4.3.5.3. SGML/XML
Overview, Susanne Dobratz
SGML/XML is a multiple-targeted strategy (see [9]).
"It allows librarians to ensure longevity of digital dissertations. Modern
hardware and redundancy can keep all the bits of an electronic thesis or
dissertation (ETD) intact. But electronic archives must be modernized
continually as new document formats become popular." As librarians always
tend to think in decades, document formats like TIFF, Postscript or PDF do not
meet their requirements. If PDF is replaced by another de facto (industry, not ISO-like) standard, preserving digital
documents would mean converting thousends of documents. XML can help overcome
those difficulties. "XML is the new ASCII" [Error! Reference source not
found.]." If an electronic document is to be of ‘archival
quality, it should be liberated from the page metaphor." (See [Error! Reference
source not found.].)
A second reason for using SGML/XML is that it ensures
reusability of documents by preserving raw data and content-based structuring
of information pieces. Preserving data for statistics and formulas in
mathematics and chemistry could allow reasearchers to reuse and repeat simulations,
calculations and experiments, deriving the needed data directly from an
archive.
Third, using structured information allows the reuse of
the same information or documents in different contexts, i.e., the same digital
dissertation can be used to produce an online or print version, and to produce
additional information products, like monthly proceedings containing the
abstracts of all dissertations produced within the university during the last
month, or a citation index. Additionally, the dissertation can be displaysd for
different media, so a Braille reader or an automatic voice synthesizer could be
used as a back-end machine.
Another reason for using markup for encoding documents is
that a wider, more qualified retrieval could be provided to the the users of an
archive. As university libraries are more and more challenged by the problem of
handling, converting, archiving and providing electronic publications, one of
the major tasks is providing a new quality for retrieval within the user
interface. Using an SGML/XML-based publishing concept enables a new quality in
the distribution of scientific contents via specific information and knowledge
management.
The Extensible Markup Language (XML) is the universal
format for structured documents and data on the Web. The current W3C
Recommendations are XML 1.0, Feb '98, Namespaces, Jan '99, and Associating
Stylesheets, Jun '99, and XSLT/XPath, Nov '99.( http://www.w3.org/XML
) The development of XML started in 1996 and it is a W3C standard since February 1998, which may make you suspect
that this is rather immature technology. But in fact the technology isn't very
new.
Before XML there was the Standard Generalized Markup
Language (SGML), developed in the early '80s, an ISO standard since 1986, and
widely used for large documentation projects. And of course HTML, whose
development started in 1990. The designers of XML simply took the best parts of
SGML, guided by the experience with HTML, and produced something that is no
less powerful than SGML, but vastly more regular and simpler to use. While SGML
was mostly used for technical documentation and much less for other kinds of
data, with XML it is the opposite.
"Structured data", such as mathematical or
chemical formulas, spreadsheets, address books, configuration parameters,
financial transactions, technical drawings, etc. are usually put on the Web
using the output of layout programs as Postscript or PDF or by putting them
into graphic formats like gif, jpeg, png, vrml, and so on. Programs that
produce such data often also store it on disk, for which they can use either a
binary format or a text format. So, if soemebody wants to look at the data, he
usually needs the program that produced it. With XML those data could be stored
in a text format, which allows the user reading the file without having the
original program. XML is a set of rules, guidelines, conventions, whatever you
want to call them, for designing text formats for such data, in a way that
produces files that are easy to generate and read (by a computer).
The eXtensible Markup Language (XML)
is a markup or structuring language for documents, a so-called metalanguage,
that defines rules for the structural markup of documents independently from
any output media. XML is a "reduced" version of the Structured
Generalized Markup Language (SGML),
which has been an ISO-certified standard since 1986. In the field of internet
publishing, it never achieved wide success due to the complexity of the
standard and the high cost of the tools. It prevailed only in certain areas,
such as technical documentation in large enterprizes (Boeing, patent
information). The main philosophy of SGML and XML is the strict separation of
content, structure and layout of documents. Most ETD projects use either the
SGML standard (ISO 8879 with Korregendum K vom 4.12.1997) or the definition of
the World Wide Web Consortium (W3C) XML 1.0 (10.02.1998, revised 6.10.2000).
The crux of all those projects was always the document type definition (DTD).