Rethinking OOXML Validation, Part 1

by Alex Brown 4. November 2010 15:09
ODF Plugfest Venue
Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"
  xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>


  <!--
  
  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  
  http://xmlcalabash.com/extension/steps/library-1.0.xpl
  
  -->
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <p:sink/>

    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>
      </p:input>
    </p:wrap-sequence>

  </p:pipeline>

  <!-- get the type information and content of the package -->
  <xo:archive-info>
    <p:with-option name="package-sysid" select="$package-sysid"/>
  </xo:archive-info>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

<archive-info>
  <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/word/comments.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    <Override PartName="/word/styles.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
    <Override PartName="/docProps/app.xml"
      ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/word/settings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
    <Override PartName="/word/theme/theme1.xml"
      ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
    <Override PartName="/word/fontTable.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
    <Override PartName="/word/webSettings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
    <Override PartName="/docProps/core.xml"
      ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
  </Types>
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="539" size="1218" name="word/document.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
      date="2009-05-25T14:15:08.000+01:00"/>
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>
  </c:zipfile>
</archive-info>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

 <p:xslt>
    <p:input port="stylesheet">
      <p:inline>
        <xsl:stylesheet version="2.0"
          xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types">

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
            <c:zipfile>
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>
              <xsl:apply-templates/>
            </c:zipfile>
          </xsl:template>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
            <c:file>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
                select="//opc:Override[ends-with(@PartName,$entry-name)]/@ContentType"/>
              <xsl:variable name="default-type"
                select="//opc:Default[ends-with(@Extension,$ext)]/@ContentType"/>
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
                select="$ooxml-mappings//mapping[mime-type=$resolved-type]/schema-name"/>
              <expand name="{@name}"/>
            </c:file>
          </xsl:template>

        </xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:xslt>

You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types"
  href="file:/C:/work/officecert/hello.docx">
  <c:file name="docProps/app.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.extended-properties+xml"
    schema="shared-documentPropertiesExtended.xsd">
    <expand name="docProps/app.xml"/>
  </c:file>
  <c:file name="docProps/core.xml"
    resolved-type="application/vnd.openxmlformats-package.core-properties+xml"
    schema="opc-coreProperties.xsd">
    <expand name="docProps/core.xml"/>
  </c:file>
  <c:file name="word/comments.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"
    schema="wml.xsd">
    <expand name="word/comments.xml"/>
  </c:file>
  <c:file name="word/document.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    schema="wml.xsd">
    <expand name="word/document.xml"/>
  </c:file>
  <c:file name="word/fontTable.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"
    schema="wml.xsd">
    <expand name="word/fontTable.xml"/>
  </c:file>
  <c:file name="word/settings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"
    schema="wml.xsd">
    <expand name="word/settings.xml"/>
  </c:file>
  <c:file name="word/styles.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"
    schema="wml.xsd">
    <expand name="word/styles.xml"/>
  </c:file>
  <c:file name="word/theme/theme1.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.theme+xml"
    schema="dml-main.xsd">
    <expand name="word/theme/theme1.xml"/>
  </c:file>
  <c:file name="word/webSettings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"
    schema="wml.xsd">
    <expand name="word/webSettings.xml"/>
  </c:file>
  <c:file name="word/_rels/document.xml.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="word/_rels/document.xml.rels"/>
  </c:file>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  </c:file>
  <c:file name="_rels/.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="_rels/.rels"/>
  </c:file>
</c:zipfile>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
    <cx:unzip>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>
    </cx:unzip>
  </p:viewport>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post …

OOXML and Microsoft Office 2007 Conformance: a Smoke Test

by Alex Brown 28. March 2010 18:40

This is one in a series of popular blog articles I am re-publishing from the old Griffin Brown blog which is now closed down. This article is from April 2008. It is the same content as the original (except for some hyperlink freshening).

At the time of posting this entry caused quite a furore, even though its results were – to me anyway – as expected. Looking back I think what I wrote was largely correct, except I probably underestimated the difficulty of converting Microsoft Office to use the Strict variant of OOXML — this would require more than surgery just to the de-serialisation code!


 

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>

since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

  • Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
  • Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
  • Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?

ODF Forensics

by Alex Brown 14. June 2009 19:35

The other day I had a phone call from Michiel Leenaars of the NLnet Foundation, who is busy gearing up for this week's ODF plugfest in the Hague. Michiel had seen my blog post on ODF validation using pipelines, and was interested in whether I could come up with something quick and dirty for providing forensic information about pairs of ODF documents, so they could be assessed before and after they are used by a tool. This could help users check if anything has been incorrectly added or taken away during a round-trip. Here's what I came up with …

Reaching for XProc

Yes again, I am going to use an XProc pipeline to do the processing. The basic plan of attack is:

  1. take two documents
  2. generate a “fingerprint” for each of them
  3. compare the fingerprints
  4. display the result in a meaningful, human-readable form

Fingerprinting XML

For a basic comparison between document I chose simply to compare the elements used, and the number of them. This obviously leaves out quite a bit which might also be compared (attributes, text) etc – but is a useful smoke test about whether major structures have been added or lost during a round-trip.

So the overall pipeline will look like this:

<?xml version="1.0"?>
<pipeline name="get-opc-rels" xmlns="http://www.w3.org/ns/xproc"
xmlns:xo="http://xmlopen.org/odf-fingerprint"
xmlns:mf="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0"
xmlns:cx="http://xmlcalabash.com/ns/extensions">
<import href="extensions.xpl"/>
<!-- the URLs of the ODF documents to be processed -->
<option name="package-a" required="true"/>
<option name="package-b" required="true"/>
<!-- get the first fingerprint ... -->
<xo:make-fingerprint name="finger-a">
<with-option name="package-url" select="$package-a"/>
</xo:make-fingerprint>
<!-- ... and the second ... -->
<xo:make-fingerprint name="finger-b">
<with-option name="package-url" select="$package-b"/>
</xo:make-fingerprint>
<!-- combine them into a single document for input into an XSLT -->
<wrap-sequence wrapper="fingerprint-pair">
<input port="source">
<pipe step="finger-a" port="result"/>
<pipe step="finger-b" port="result"/>
</input>
</wrap-sequence>
<!-- style into an HTML report of differences -->
<xslt name="transform-to-html">
<input port="stylesheet">
<document href="style-diffs.xsl"/>
</input>
</xslt>
</pipeline>

A number of things are of note:

  • The ODF packages are interrogated using the JAR URI mechanism I described here
  • We’re using a custom step <xo:make-fingerprint>, which takes as its input the URL of an ODF document (“package-url”), and which emits a “fingerprint” as an XML document. Obviously this step is not built into any XProc processor, so we’ll need to write it ourselves
  • We using XProc’s wrap-sequence step to combine the two “fingerprint” documents into a single document
  • We’ll be relying on an XSLT transform to turn this combined document into a nice report, which will be the end result of the pipeline.

Writing the fingerprinting pipeline

To define our custom pipeline <xo:make-fingerprint> we simply author a new pipeline, and give it the type “xo:make-fingerprint”. This can then be invoked as a step. Here’s what this sub-pipeline looks like:

<pipeline type="xo:make-fingerprint">
<!-- the URL of the ODF file to fingerprint -->
<option name="package-url" required="true"/>
<!-- load its manifest -->
<load>
<with-option name="href" select="concat('jar:',$package-url,'!/META-INF/manifest.xml')"/>
</load>
<!-- visit each entry in the manifest which refs an XML resource -->
<viewport name="handle"
match="mf:file-entry[@mf:media-type='text/xml'
and not(starts-with(@mf:full-path,'META-INF'))]">
<cx:message message="Loading item ..."/>
<!-- load the entry -->
<load name="load-item">
<with-option name="href" select="concat('jar:',$package-url,'!/',/*/@mf:full-path)"/>
</load>
<!-- accumulate everything in a <wrapper> document -->
<wrap-sequence wrapper="wrapper">
<input port="source">
<pipe step="load-item" port="result"/>
</input>
</wrap-sequence>
</viewport>
<!-- transform the accumulated mass into a fingerprint -->
<xslt name="transform-to-fingerprint">
<input port="stylesheet">
<document href="make-fingerprint.xsl"/>
</input>
</xslt>
<!-- label it with the package URL, as an attribute on the root element-->
<add-attribute match="/*" attribute-name="package-url">
<with-option name="attribute-value" select="$package-url"/>
</add-attribute>
</pipeline>

Things to notice here:

  • We iterate through the ODF manifest looking for XML documents
  • All of the XML in the entire package is retrieved and combined into a single mega-document wrapped in an element named<wrapper>
  • We’re relying on an XSLT transform, “make-fingerprint.xsl” to do the heavy lifting and turn our mega-document into meaningful (and smaller) “fingerprint” document
  • We add the URL of the ODF document to the fingerprint using XProc’s nifty add-attribute step

The Heavy Lifting: XSLT

The XSLT to boil a document down into a fingerprint can be seen here. What it produces is a summary of the elements used in each of the namespaces the document mentions. This extract gives a flavour of the kind of result it produces:

<namespace name="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<element name="file-entry" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:meta:1.0">
<element name="generator" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:office:1.0">
<element name="automatic-styles" count="1"/> 
<element name="body" count="1"/>
<element name="document-content" count="1"/> 
<element name="document-meta" count="1"/>
<element name="document-styles" count="1"/>
<element name="font-face-decls" count="2"/>
<element name="meta" count="1"/>
<element name="spreadsheet" count="1"/>
<element name="styles" count="1"/>
</namespace>

Returning now to our main pipeline, we can see it makes two calls to (or should that be “sucks on”) the sub-pipeline to generate two fingerprints. These are then wrapped with a wrap-sequence step, and we have all we need to generate the final report. Again, an XSLT transform is used to do a comparison operation and the result is emitted as an HTML document intended for human consumption. An example of what this looks like (comparing the OpenOffice and Google Docs versions of Maya’s wedding planner) can be found here.

Putting it to use

The results of this process need to be interpreted on a case-by-case basis. Just because two applications represent notionally the same document with different XML is not necessarily a fault (though I’d like to know why Maya’s Wedding Planner has 2,000 spreadsheet cells according to Google Docs, and only 51 according to OO.o).

The most useful application of this pipeline is to check for untoward data loss when a document is processed by an application – and I understand this is a particular concern of the Dutch government. With this in mind it is possible to take this pipeline further still, checking attribute differences and even textual differences. Though there comes a point when diff'ing XML that it is best to use a specialist tool such as the excellent DeltaXML (I have no association with this product, except knowing it is well-respected through its use among clients). Many an unsuspecting programmer has come to grief under-estimating the complexities of comparing XML documents.

Online ODF Validation

Michiel also asked whether it would be possible to make the ODF validation pipeline I blogged about previously available as an online service. Coincidentally this was something I was working on anyway, though using Java rather than XML pipelines. The result is now available here. Enjoy …

Notes on Document Conformance and Portability #4

by Alex Brown 16. May 2009 19:50

In my last post I wrote about the reaction to Microsoft's ODF support in the recent service pack released for their Office 2007 product, and in particular how claims of its "non-conformance" seemed ill-founded. Now, to look a little deeper at the conformance question, I will use an XML pipeline to validate some would-be ODF documents, to get a clear-sighted and spin-free look at what the state of ODF conformance really is.

XML Pipelines: The Next Big Thing

For many years pipelines have been recognised as something the XML community badly needed. Eager markup geeks would seek out Sean McGrath or Uche Ogbuji to hear miraculous tales of how XML pipelines could be put to work; some bold experimenters would try to coerce technologies like Apache Ant into action, and some pioneers would even specify and implement their own pipelining languages – witness, for example Eric van der Vlist's xvif, or maybe XPL, which happily sits at the heart of the awesome Orbeon Forms framework.

Now however, the W3C is on the cusp of finalising its XProc language and this looks set to bring pipelines into the mainstream. I am convinced that XProc is the most significant specification from the W3C since XSL, and fully expect it to become as pervasive in all XML shops.

So what are pipelines? Well, as we know XML processing models can be described as conforming to the model: "in; out; shake it all about". The "in" bit is catered for by XML storage technologies (eXist maybe), and the "out" bit is catered for by web servers; XProc is for the "shake it all about" bit, where, with XSLT it will become the engine of many an XML process. XSLT is great for transforms but less convenient for a number of day-to-day things we routinely want to do with XML: validating, stripping element, renaming attributes, glomming together, splitting up ...  Essentially, pipelines are for doing stuff to XML in a step-by-step way, but without the overhead of a full-on programming language, since XProc pipelines are written using nice, declarative XML.

Pipelines and Office Documents

One of these typical "day to day" tasks is validating XML inside ZIPs. Both ODF and OOXML resources are not simply XML documents, but "packages" (ZIP archives) of content which include several XML documents. So to perform a full validation, we need to visit the XML resources in the package and validate them all against their governing schemas to get an overall validation result. This is exactly the sort of scenario where XML pipelines can help.

A Walk Through

I am going to describe an XML pipeline for performing ODF validation using Calabash, a FOSS (GPL v2)  implementation of XProc for the JVM written by Norm Walsh (the XProc WG chair). I'm not going to cover the absolute basics, for those (and more) consult some of the excellent material on XProc already appearing on the web such as:

We start, immediately after the root element, with a couple of "option" elements. These allow values to be passed in from the outside. In our case, we need the name of the package we want to validate ...

<?xml version="1.0"?>
<pipeline name="validate-odf" xmlns="http://www.w3.org/ns/xproc"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:mf="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0"
  xmlns:o="urn:oasis:names:tc:opendocument:xmlns:office:1.0">

  <!-- the URL of the package to be validated must be supplied by the caller -->
  <option name="package-url" required="true"/>

  <!-- whether to enforce use of the IEC/ISO 26300 schema -->
  <option name="force-26300-validation" select="'false'"/>

Next we import some extensions. Like XSLT, XProc is designed to be extensible and already additional sets of functions are becoming available. Calabash ships with a handy function for ZIP extraction which we are going to need.

  <!-- we use the Calabash extension in this library for looking inside ZIP files -->
  <import href="extensions.xpl"/>

Now we start the processing proper. This next step uses the ZIP extraction mechanism to pull the "manifest.xml" document out of the archive and outputs that XML for onward processing

  <!-- emits the package manifest -->
  <cx:unzip file="META-INF/manifest.xml">
    <with-option name="href" select="$package-url"/>
  </cx:unzip>

As a sanity check, we are going to make sure that this manifest actually conforms to the ODF manifest schema. I made this schema by manually extracting it from the ODF 1.1 specification (here referred to as "odf-manifest.rng"). As you can see, XProc makes this kind of document validation a cinch:

  <!-- validate the manifest against the manifest schema -->
  <cx:message message="Validating manifest ..."/>
  <validate-with-relax-ng assert-valid="false">    
    <input port="schema">
      <document href="odf-manifest.rng"/>
    </input>
  </validate-with-relax-ng>

[Update: I have added an @assert-valid="false" attribute here, as this is just a 'sanity check']

Now we start to visit the individual documents in the package referenced by the manifest. This is done here using the viewport step, which offers a kind of "keyhole surgery" option allowing us to isolate bits of a document. Here we're interested in all the <file-entry> elements in the manifest which (1) have a media type of "text/xml" and (2) aren't residing in the "META-INF" folder itself.

  <!-- visit each file entry in the manifest which targets an XML resource -->
  <viewport name="handle"
    match="mf:file-entry[@mf:media-type='text/xml'
    and not(starts-with(@mf:full-path,'META-INF'))]">

For each of these <file-entry> elements, a @full-path attribute specifies the name of an XML resource in the ZIP, again we use the unzip step to pull each of these XML documents from the archive:

    <!-- assume paths are relative to package base, and extract the XML resource -->
    <cx:unzip name="get-validation-candidate">
      <with-option name="href" select="$package-url"/>
      <with-option name="file" select="/*/@mf:full-path"/>
    </cx:unzip>

Once we've grabbed an XML resource, we need to work out which schema to use to validate it. Generally this can be done by looking at a @version attribute on the root element. However, ODF does not make this mandatory and so implementations are free to omit it. ODF specifies no fall-back rules, so we need to invent our own. What I've done here is to use the version specified, but fall back to the most recent published standard (1.1) when it is not specified.

    <!-- emits the schema RELAX NG that corresponds to the ODF version -->
    <choose name="get-relax-ng-schema">
      <when test="$force-26300-validation='true' or /*/@o:version='1.0'">
        <cx:message message="Validating with v1.0 schema ..."/>
        <load href="OpenDocument-schema-v1.0.rng"/>
      </when>
      <when test="/*/@o:version='1.2'">
        <cx:message message="Validating with draft v1.2 schema ..."/>
        <load href="OpenDocument-schema-v1.2-cd01-rev05.rng"/>
      </when>
      <otherwise>
        <cx:message message="Validating with v1.1 schema ..."/>        
        <load href="OpenDocument-schema-v1.1.rng"/>        
      </otherwise>
    </choose>
    <identity name="the-schema"/>

So now we have the document to validate, and the schema to use. We simply need to apply one to the other:

    <!-- and: validates the candidate against the schema -->
    <validate-with-relax-ng>
      <input port="schema">
        <pipe step="the-schema" port="result"/>
      </input>
      <input port="source">
        <pipe step="get-validation-candidate" port="result"/>
      </input>
    </validate-with-relax-ng>

  </viewport>
</pipeline>

Et voilà, a complete pipeline for validating ODF instances. Running it against packages which contain invalid XML will cause the pipeline processor to halt and report a dynamic error, for that is the default behaviour of the validate-with-relax-ng step.

Since ODF is clear that invalid XML signals non-conformance to the spec, we know that any package which fails this pipeline is, beyond argument, non-conformant.

Running It

Rob Weir helpfully provided a ZIP of the spreadsheets used for his Maya's Wedding Planner piece. Consult his blog entry for details of how these documents were produced. Putting these 7 test files through our pipeline we get this result:

Producer                   FAIL    PASS
---------------------------------------
Google                      X
KSpread                              X
Symphony                    X
OpenOffice                  ? *
Sun Plugin                  ? *
CleverAge                            X
MS Office 2007 SP2                   X
---------------------------------------
* See update below

So, Why the Failures?

  • Google failed because for some bizarre reason the manifest.xml document in its package specified a document type declaration referring to a non-existent "Manifest.dtd"; the processor cannot find this DTD and aborts with an IO Exception.
  • Symphony failed because its styles.xml document contained a date-value of "0-00-00". This fails to match the datatyping rules the ODF 1.1 schema uses to police date values.
  • OpenOffice failed because its manifest was not valid to the 1.1 schema. Now, this is an odd result as the manifest claims to be valid to version "1.2" of the ODF schema, yet consulting the latest drafts of ODF 1.2 it appears the manifest schema is not defined there, but has been planned for being specified in a new "Part 3" of ODF. I cannot find Part 3 of ODF in draft – maybe the OOo code has been written, but the standards text not fitted to it yet. If somebody can point me to a public draft of this schema, I'd like to re-run this test. [Update: I have now been pointed at the draft of Part 3 of ODF 1.2, and it does indeed contain a new schema. This draft is unfinished and contains non conformance clause, so it is not really possible to know for sure whether a package conforms to it. However, the OOo package here is invalid to the schema. I am going to assume that Part 3 will mirror the draft of Part 1 of ODF 1.2, and so will require schema validity. On that (reasonable) basis this OOo package is non-conformant; but of course the draft might change tomorrow. We do not know quite what version of the spec is being targetted here ...]
  • The Sun Plugin also failed because its manifest uses a @manifest:version attribute which the 1.1 schema does not declare. Again, maybe this is valid to some draft schema I have not seen, but it certainly does not conform to any published version of ODF. As above, if I can get a new schema I can re-run the test. [Update: see bullet above, it's the same here]

Conclusions

There had been a lot of spin in the blogosphere about who is, and who is not, supporting ODF at the moment. This validation test focusses on a small but important area of that discussion: conformance. One of the reasons it is important is that it is testable. From the test above we have the hard fact that most of the mainstream ODF applications are failing to emit standards-conformant ODF, even for a case as simple as "Maya's Wedding Planner". Surprisingly when assessing conformance it appears KOffice, Microsoft and CleverAge are leading the conformance pack; while Sun, Google and IBM have fallen behind.

To me this merely goes to confirm one of the fundamental dynamics of standardisation; done right, standards wrench "ownership" from those who thought they owned them, and distributes that ownership through the community at large. We, as users, should be applauding the widening adoption of ODF - and should be keeping the pressure on those vendors that seem to have been left behind, to raise their games.

About the author

Alex Brown


Links

Legal

The author's views contained in this weblog are his, and not necessarily of any organisation. Third-party contributions are the responsibility of the contributor.

This weblog’s written content is governed by a Creative Commons Licence.

Creative Commons License     


Bling

Use OpenDNS  

profile for alexbrn at Stack Overflow, Q&A for professional and enthusiast programmers

Quotable

Note that everyone directly involved in the development of ISO standards is a volunteer or funded by outside sponsors. The editors, technical experts, etc., get none of this money. Of course, we must also consider the considerable expense of maintaining offices and executive staff in Geneva. Individual National Bodies are also permitted to sell ISO standards and this money is used to fund their own national standards activities, e.g., pay for offices and executive staff in their capital. But none of this money seems to flow down to the people who makes the standards.

Rob Weir

RecentComments

Comment RSS