3.2.5.1
Preparing for Conversion to SGML/XML from MS Word, Susanne Dobratz,
Viviane Bouletreau
Performing
a conversion from MS Word documents into instances of a specified SGML or XML
DTD is a very complex task. What you will need for that is:
n
A SGML
or XML document type definition (DTD) that serves as structure model for the
output. One says that the output SGML document is valid to the specifies DTD,
or it is an instance of this DTD:
n
A Word
style sheet that holds paragraph and character styles according to the
structures in the DTD. So if in a DTD you have defined a structure for Author (e.g.
expressed in the output file as):
<author>
<title>Dr.</title><firstname>Peter</firstname><surname>Fox</surname>
</author>
You have to find expression in Word:
paragraph styles: author
character styles (just to be used within an author-paragraph): firstname,
surname, title
n
You
will need some kind of a configuration file that allows the mapping of the DTD
elements into Word elements and vice versa.
n
You
will need an SGML or XML parser to check the output SGML/ XML document against
the DTD.
Often a
conversion is done by using a plug in to MS Word directly. But other options
use the Microsoft internal exchange format RTF (Rich Text Format) for
conversion. Those tools can interpreted the RTF file with the MS Word style
that are still coded in this RTF file and export it into an SGML document. This
process mostly happens within batch mode without using much graphical user
interfaces.
Within
the following paragraphs we describe several approaches:
1. Approche
of the Université de Montréal, Université de Lyon 2, Universidad de Chile
There are
other approaches in development as well, especially within
The
process line for converting Word files into SGML documents developed within the
CyberThèses project uses scripts written with the Omnimark language.
The input
of the process line is an RTF file with a "structuring style sheet"
and the output is an SGML document encoded according to the TEI Lite DTD (see
the TEI web site at http://etext.virginia.edu/TEI.html).
The
conversion process is constituted of three main steps :
n
a
first one converts the RTF file into a flat XML file encoded according to DTD
of RTF. The produced file is a linear sequence of paragraph elements having each
one an explicit "style name" attribute corresponding to the RTF style
names.
n
the
second step consists in the re-generation of the hierarchical and logical
structure of the document based on the analysis of style name attribute.
n
last,
a SGML parser allows validating the conformity of the produced SGML document
with the TEI Lite DTD.
Some
supplementary scripts then allow the export of the SGML document towards other
formats (HTML, XML).
Most of
the scripts are available from the CyberTheses web site : http://www.cybertheses.org (well, actually there will be
soon…)
This
system is devoted to a particular DTD, but its generalization to other document
models shall not raise any difficulty.
The
"Dissertation Online" project implemented and refined a conversion
strategy that allows to convert documents written in MS word with a special style
sheet (dissertation.dot) into an SGML instance of the DiM.dtd.
We used
this product from Microsoft, the SGML Author for Word, due to several reasons:
1. SGML
Author is quite easy to configure
2. It is
easy to use.
3. It is
less expensive than other software producing SGML files with the same quality.
4. It supports
an international standard for tables: CALS.
5. As it
is a Word-Add-On it handles documents in MS-Word doc- format better than other
tools.
6. As we
started using this technology in 1997, it supported from the very beginning
Word97, the version of word, which was the actual one that time.
Unfortunately,
Microsoft didn't continue the development of this tool. So there are new
versions available for Office 2000 or Office XP. But the internal document
format from MS Word 97, MS Word 2000 and Office XP are the same in the sense of
the conversion into SGML. This means documents written in Word 2000 or Office
XP can be imported into Word97 and therefore a conversion can be done.
For a
successful conversion from a word document into a DiML document you will need:
n
The
DiML-document type definition (diml20.dtd, calstb.dtd)
n
the SGML-Author for Word97 (may not available at
Microsoft Shops any more, but NDLTD esp. Prof. Dr. Edward Fox may provide
English versions of it that work with English Word)
n
The
Association file for the Microsoft SGML-Author for Word (diml20.dta)
n
The
converter style sheet, which consists of several macros programmed to make the
preconversion process easier.
n
The
perl programming language (free Software)
n
The
nsgmls-Parser (free Software)
n
Several
perl scripts to correct the transformation of tables.
You must
have the following software installed at you computer:
n
SP
(NSGMLS) (Parser for SGML-Files by James Clark). (new version are available at http://openjade.sourceforge.net/doc-1.4/index.htm, but we haven’t tested that)
n
Run SP
(A WYSIWYG tool for SP by Richard Light). http://www.light.demon.co.uk/runsp/index.htm
n
Perl (a scripting language for using the
perl scripts).
The
converter style sheet and the author’s style sheet can be obtained from the
following website: http://dochost.rz.hu-berlin.de/epdiss/vorlage.html
Converter
scripts and perlscripts can be obtained from http://www.educat.hu-berlin.de/diss_online/software/tools.exe
(Perl scriptc, DTD and converter file for MS SGML-Author for Word -
KonverterDiML2_0.dta)
The
conversion from a Microsoft Word document into a SGML document, which is an
instance of the DiML.dtd that is used at Humboldt-University, takes several
steps:
Check the
correct usage
Load the style
sheet for conversion (NOT the one for the authors) see, see figure below.
There is
a special feature to get the page numbers out of the Word document by using
certain word specific text anchors. Those have to be converted into hard coded
information using a page number style sheet.
Formatting
that has been applied by the author without using style sheets have to be
replaced by the correct style sheets.
In order
to get a correct display of tables later on by using CSS style sheets within
common browsers, empty table cell have to be filled up with a single space
(letter).
Soft
coded line breaks have to be preserved for the conversion. This is done by
inserting special characters #BR# to that. This will be used to insert later a
special SGML tag for soft line breaks <br/>.

n
Press
the button "Save as SGML" within the FILE menu.
n
Load
the converter file KonverterDiML2_0.DTA
n
Check
the XML/SGML output using the feedback file (fbk) see figure below.

n
Load
the perlskripts using the batch file preprocessor.bat
n
Parse
the DiML file
n
Errors
have to be wiped out manually

n
Load
the perl scripts by using the batch file did2html.bat
n
Check
the HTML Output.
n
Correct
possible errors manually within the SGML file and repeat the transformation.

A
demonstration QuickTime video may be found at the ETD-Guide server as well.
(see http://www.educat.hu-berlin.de/diss_online/software/didi.mov)
Tools that export using a user specified [1]
DTD:
WordPerfect since
Version 7.0 (Corel http://www.corel.com )
FrameMaker+SGML6.0
(Adobe) (http://www.adobe.com )
Tools that exports using their own native[2]
DTD:
Openoffice (SUN/open
source ) (http://www.openoffice.org )
AbiWord (AbiWord/
open source) (http://www.abisource.com )
Kword (KOffice, KDE Project/
open source) (http://www.kde.org )
Omnimark (Omnimark)
(http://www.omnimark.com )
MarkupKit (Schema) (http://www.schema.de )
Majix (Tetrasix) (http://www.tetrasix.com )
TuSTEP (RZ Uni Tübingen) (http://www.uni-tuebingen.de/zdv/tustep/index.html)
[1]
Bollenbach,
Markus; Rüppel, Thomas, Rocker, Andreas: FrameMaker+SGML5.5.
[2]
[3]
Ducharme,
Bob: SGML CD.
[4]
Smith,
Norman E.: Practical Guide to SGML/XML Filters.
[5] Goldfarb, Charles; Prescod, Paul: XML Handbuch. München, Prentice
Hall, 1999, ISBN 3 8277 9575 0