Master Thesis Seminar: Efficient processing and transformation of XML

Date: February 04, 2005 (Friday) at 14:15

By: Tomas Rutegård

Extensible Markup Language, or XML, is quickly becoming a main approach to the explicit denoting of structure of data, particularly in text format. Accompanying this, needs and demands for efficient processing of XML documents is rising multiply. I have examined current standards and technologies of the field and given a contribution of my own.

My work examines the space and time efficiency of current methods of processing and transforming XML and the possibilities to improve upon these efficiencies. The work is concentrated on developing a representation of XML suited to the purpose of space- and time-efficient processing and transformation based on current knowledge in modeling and coding. Such a representation is developed, the BXR format, assessed and found fairly satisfying and potentially quite useful. The basis of this representation is the modeling of arbitrary XML documents as the SAX (Simple API for XML) events generated by a SAX parser parsing the document and the coding of these SAX events in a minimum-redundancy code with codewords formed from a 128 symbol code alphabet. Prototype tools in Java handling this representation were developed along with the representation itself.

The developed representation and the prototype tools handling it developed in Java allow very swift processing of simpler transformations and no less than normal speed processing of more complex transformations, and also allow for swift parsing to SAX events, the output of the de facto standard SAX parsing of XML. Furthermore, the developed representation is compact, being up to ten times smaller in size than the original text-based representation of an XML document, depending on the specific structure of the represented document. This allows for in-memory processing of much larger bodies of XML data than the original text-based representation allows. Not being text-based, the BXR representation is not quite as accessible as the original, which can be interpreted using any text editor. In BXR, a little accessibility is traded for the benefits listed above.

Room: E:2116

Last modified Dec 9, 2011 12:59 pm