3.2.5 Preparing for Conversion to
SGML/XML, Susanne Dobratz
A document type
definition (DTD), in the sense of XML, defines rules or templates, which
are used to produce similarly, structured documents. A DTD describes the
content model of a class of documents. It consists of:
n An
element declaration, which is the
main part of a DTD and the structural definition. Elements can contain other
elements, characters or nothing. Element declarations define the name of the
element and the logical content (sub elements) of an element. (See [10].) An important part of the element declaration is
the content model. It is here that the document architect indicates the order
and occurrence of other element or character data.
n A
notation declaration, which defines
a notation for external formats, e.g., for graphics (gif, jpeg), mathematics
(TeX, LaTeX), 3D objects (VRML) and other formats, that cannot be coded
directly in XML.
n An
entity declaration, which defines
character, sets and replacement objects for characters. Everything from a
single character on up can be defined with a single entity. There are two basic
types of entities: general and parameter. Parameter entities are only allowed
in declarations, and are usually used to make a DTD more readable or to control
processing. General entities are used in the document instance; the documents
build upon the DTD.
n An
attribute list declaration, where
attributes and their values for the different element types defined in the
element type declaration is listed.
To define a DTD, a special syntax is needed, which does
not conform to the usual XML syntax where a document contains elements which
are enclosed in "tags:" a start tag (e.g. <author>) and an end
tag (e.g. </author>), producing code like this: <author> Joe Miller
</author>
The fact that currently available authoring systems for
XML still have not won wide recognition has led to different strategies at
different universities regarding XML documents. Most of these projects were started
between 1995 and 1997, in a time when XML was alive, but where no tools or
standardized DTDs were available. A view of those projects from today’s
perspective illustrates the demand for a rethinking and redesign of those
approaches in order to come to a standardization.
All the presented DTDs are built upon similar principles.
A classical dissertation (which can be seen as monograph) consists of 3 main
components: an extensible title page
with abstracts, declarations, etc., the dissertation
corpus, which includes text, pictures, audio, video, tables and so on, as
well as the appendices, which
contain data sheets, bibliographies, acknowledgements and others.
The following DTDs are currently in use at different
institutions:
n ETD-ML.DTD: Virginia Polytechnic
Institute and State University (Virginia Tech)
n DiML.DTD: German Dissertationen Online
Projectes
n TDM.DTD: University of Iowa
n HutPubl.DTD: Technical University
Helsinki
n TEI-Light.DTD: Ann Arbor und Lyon
n ISOBook.DTD: University of Oslo
n TEI-based DTD with extensions for natural
sciences: Swedish University of Agricultural Sciences Uppsala
All those Document Type Definitions are so-called
author-DTDs. This means that they are primarily used to support the authoring
and the conversion process and do not first of all address document archiving
and preservation issues. One may ask why all those different DTDs have
prevailed. This is mainly because the scientific orientation of the mentioned
universities is quite varied. Lyon, Oslo and Michigan, which use TEI-Light.dtd,
mainly serve students in the arts and humanities. Problems using TEI.DTD or
DocBook.DTD are recognized at universities, which support a strong natural
science community, such as Berlin, Helsinki or Uppsala. Often a dissertation is
a cumulative work, e.g., in Lyon or Helsinki.
Converting from word processing forms to SGML or XML
requires more planning in advance, different tools, and broader learning about
document processing concepts than does working with PDF. In addition, the end
result is a representation that is easier to preserve, more reusable, and
supportive of more powerful and effective schemes for searching and browsing.
All of these advantages, however, must be weighed against the facts that there
are fewer people knowledgeable about these matters, that often tools to help
are more expensive and less mature, and that the process may be complicated,
difficult, and time consuming. In 2000, there are tens of thousands of ETDs
created by scanning (mostly by UMI, but also at sites like MIT and the National
Document Center in Greece), thousands converted from word processors into PDF,
and hundreds in SGML or XML – illustrating the relative effort required of
students to prepare ETDs in each of these forms.
Simple word processing emphasizes layout or
what-you-see-is-what-you-get (WYSIWYG) editing. Emphasizing what documents look
like is quite distinct from focusing on the logical structure, for which markup
schemes are best. Shifting from word processing representations to XML,
requires a different way of thinking, a different approach. The problem is
harder than producing HTML by exporting from a word processor, since instead of
just having a document that looks like the original, it is necessary that the
marked-up version itself is correctly tagged.
Some word processors have been extended to facilitate such
an approach. Microsoft produced SGML Author for Word as an add-on package for
Word 95, and new versions of WordPerfect can export content according to markup
schemes. Eventually it is likely that most popular word processors will export
to XML. Clearly, the resulting markup can surround document sections, headings,
paragraphs, lists, figures, tables, citations, footnotes, hyperlinks, and other
obvious constructs. In addition, regions with the same style can be tagged.
Thus, to allow easy conversion from word processing to markup schemes requires
choosing a target DTD and then consistently using document objects and styles
so that there is a clear mapping from them to tags.
Conversion from LaTeX is slightly simpler since the TeX
approach involves using formatting commands that can be mapped to tags in XML.
However, LaTeX does not require strict nesting of commands, so it may not be
clear where to place end-tags. Further, LaTeX users may not consistently use
the same sequences to designate changes in structure, making translation more
complex. Finally, LaTeX coding of mathematical expressions is very difficult to
translate to markup schemes for mathematics, like MathML.
Because of the inherent complexity of converting from word
processing schemes to markup representations, it is necessary to include steps
for checking and correcting converted forms. Parsers can ensure syntactic
correctness, so detecting problems is often simple. To ensure semantic
correctness, however, manual inspection may be required. A further test would
involve rendering the marked-up document, for example to a printed or PDF form,
and ensuring that the result suitably matches the output resulting from the
original word processing version. In any case, human labor is likely to be
needed to correct conversion errors, and presupposes that students understand
enough about the process and desired output to accomplish this with facility.
[1] http://lcweb.loc.gov/cds/lcsh.html#lcsh20
[2] http://www.bibliothek.uni-regensburg.de/rvko/rvko.php3
[5] Edward Fox: Networked Digital Library of Theses and
Dissertations, Web matters, Aug., 12th 1999, http://helix.nature.com/webmatters/library/library.html
[6] Website of the standards committee of NDLTD: http://www.ndltd.org/standards/
[7] http://dochost.rz.hu-berlin.de/epdiss/dtd-workshop/index.html
[8]
Tad Lane, Scalable Vector Graphics - Web Graphics with Original-Quality
Artwork, in: BITS, November 1999, http://lanl.gov/orgs/cic/cic6/bits/november_99/novbits1.html
[9] Neill Kipp: Beyond the Paper Paradigm: XML and the Case for
Markup; in: Part II "Guideline for Writing and Designing ETDs" ETD
Sourcebook, Weisser, Moxley and Fox
editors, 1999
[10] B. Travis, D. Waldt: The SGML Implementation Guide,
Springer, Berlin-Heidelberg-New York, 1995
[11] Ed Dumbill: The State of XML, June, 16th,
2000 in XML.com, http://www.xml.com/pub/2000/06/xmleurope/keynote.html