Exploring the viability of XML for use in POPIN web sites
Technical Report to POPIN - Czech Republic
This part of the project is a real novelty on our www-sites, which describes a proposed standard as said above. In order to develop a mechanism with which to exchange the data we had to resolve a great deal of technical problems. Therefore, a special strategy was created with which to divide the development into several consecutive stages:
- Data design. This step involves the decision on what types of data will be collected and in which way they will be presented. This means that in this stage no specific data were collected. We drew up the first syntactic description of the future XML data and created model tables for future data collection.
- Data collection. The data were obtained from several incompatible sources (various databases, publications, printed materials, statistical data from government organizations, etc.). As a result, the data had to be merged into a single information source. As the XML is not yet generally supported by available instruments, we decided to collect the data in a spreadsheet (MS Excel) and subsequently convert them into the XML format.
- Data conversion. As there was a lack of time (the first version was to be completed by the end of 1999), we decided to take an intermediate step - the data will be stored directly on XML files without creating a real database structure. However, this does not involve any additional work (see the description below). As there were no suitable instruments to directly convert the data, we had to develop several computer programs of our making that are able to do this task in a few steps irrespective of the real quantity of data. The XML support in the current web-browsers is very poor and splintered. This was why we developed an on-line converter, which parses XML data into the form of an HTML document (see the addendum on mechanisms of data conversion). This is the level we have achieved till now. Further stages are to be implemented in the near future.
- Database development. Our strategy provides for the creation of a front-end for the data. All data from tables are then to be stored in a database behind the mechanisms working after the stage 3 is completed. The XML will be continually used for data import, export and download by remote users. For those unable to directly use the XML-formatted data, conversion into the HTML will be still available on the server side. The database layer will bring nothing new into the visible part of our site but it will add a new variety of functions to the application, especially updates, search and acquisition of particular information according to the user's needs.
- We shall propose a new mechanism for data exchange and search of data independent of particular data of individual countries. We regard it as best to develop a "master" server for the whole POPIN, which will make this possible. Such a server is to be able to query regional servers, obtain particular data and merge them into a resulting XML file. It will be possible, e.g., to obtain data on fertility for 1995 for all countries in which the information is available.
We pursue the objective of bringing into the life the qualities described in the stages 4 and 5 of our proposal. However, this is not all we would like to do. We would like our proposal to become a real standard for data exchange within the POPIN project. Moreover, we feel the need of making our solution as universal as to make it feasible elsewhere, too, not only in this single application.
Compliance with our standard proposal
Our proposal is not to play a limiting role when using specialized computer programs (tools) nor has it set as an objective to spread our own programs. However, if your data and their presentation for the POPIN are to comply with the standard, they should include the following features:
- Websites describing the population development in the relevant country, presenting its own population data using approved standards. Czech POPIN website can serve as a model example.
- Data sets describing the main demographic indices and data from tables in the
scope and structure which are presented on Czech website. This standard includes a unified
naming of main tables as well as sub-tables, divided into eight basic categories. We suppose
that the names of all categories, tables as well as sub-tables, will be standardized in order
to ensure future connection of regional servers by means of a search engine. A complete data structure proposal has been created within our
application in the on-line form.
- The minimum amount of data within every category appears in the form of all available data between 1980 and 1999. If some value is unknown, it should be replaced with the string "n/a".
All time stamps should be described using one of the following formats:
YYYY for years;
YYYYMM for years and months;
YYYYMMDD for years, months and days;
HH:MM for hours and minutes;
HH:MM:SS for hours, minutes and seconds;
YYYYMMDD HH:MM:SS for complete time specification;
YYYYMMDD HH:MM for complete time specification without seconds.
- All data from tables should be downloadable in the XML format. In order to share a
unified syntactic description of
within XML, we have created a DTD file,
describing allowed structures. The level of data presentation is to comply at least with the stage 3 of the described strategy (each of the eight basic categories is represented with one XML file) or a higher stage (the possibility of data search).
- All data from tables including some national characters should be stored in a format complying with the ISO-8859-x or Unicode character sets.
- All XML data must have the possibility of automatic conversion into an HTML document on the server side upon request.
- A detailed specification of a communication protocol for cross-country data-mining will be the subject of an extended version of this standard. For now we can state the following: the communication between the "master" server and regional data sources will be probably accomplished through the HTTP (Web) protocol with the use of CGI. The "master" server will react to users´ requests by asking a regional server whether it is able to provide the relevant data set. Regional servers then send back a reply with encoded message saying whether the relevant information is available and if so, how it can be obtained from the server. Afterwards the "master" server collects all replies, evaluates them and gives the user an overview with links to requested regional data.
The main problem of the computer epoch is caused by the existence of many standards and Requests for comments (RFC), while only a small fraction of the amount is being really put into practice. Admittedly, the XML specification is still recommendation, not standard. Yes, it is supported by a growing number of vendors. Nevertheless, if you need to adapt the existing expertise and tools to your special needs, you see that there are no sufficient tools, which can do this accordingly. You have two alternatives: either to adapt your project to the state acceptable for the use of standard tools, thus losing the most valuable thing included in your project - i.e. individual solution - or you can develop your own tools and by combining them with the existing ones you can achieve the desired result. Of course, this can require a large workload, much time or perhaps even money, but there are situations in which it really has a sense. We have embarked on this road and we believe that it was a correct decision. Both leading web-browsers (MS Internet Explorer and Netscape Communicator) continue to support the rapidly developing standard and as an add-on, further functional elements are being attached. However, none of them has been able to implement the full XML standard and they are only using it as a "useful subset." Now it seems that the Microsoft is faster and better, at least when it comes to the XML support. But here, too, one cannot rely on the Internet Explorer using the XML data in exactly the same way as defined by standards and in an ideal way from the viewpoint of both the user and developer. We hope that many things will be resolved by a new generation of the Netscape Communicator and specialized tools (plug-ins, convertors from other vendors, etc.), whose development has been announced. If our expectations are met, we will certainly take advantage of them. Nevertheless, right now we are unable to do with the XML everything we want to without writing our own code.