Rethinking OOXML Validation, Part 1

04 November 2010 Alex (8)

Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"
  xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>


  <!--
  
  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  
  http://xmlcalabash.com/extension/steps/library-1.0.xpl
  
  -->
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <p:sink/>

    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>
      </p:input>
    </p:wrap-sequence>

  </p:pipeline>

  <!-- get the type information and content of the package -->
  <xo:archive-info>
    <p:with-option name="package-sysid" select="$package-sysid"/>
  </xo:archive-info>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

<archive-info>
  <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/word/comments.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    <Override PartName="/word/styles.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
    <Override PartName="/docProps/app.xml"
      ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/word/settings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
    <Override PartName="/word/theme/theme1.xml"
      ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
    <Override PartName="/word/fontTable.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
    <Override PartName="/word/webSettings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
    <Override PartName="/docProps/core.xml"
      ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
  </Types>
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="539" size="1218" name="word/document.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
      date="2009-05-25T14:15:08.000+01:00"/>
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>
  </c:zipfile>
</archive-info>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

 <p:xslt>
    <p:input port="stylesheet">
      <p:inline>
        <xsl:stylesheet version="2.0"
          xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types">

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
            <c:zipfile>
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>
              <xsl:apply-templates/>
            </c:zipfile>
          </xsl:template>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
            <c:file>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
                select="//opc:Override[ends-with(@PartName,$entry-name)]/@ContentType"/>
              <xsl:variable name="default-type"
                select="//opc:Default[ends-with(@Extension,$ext)]/@ContentType"/>
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
                select="$ooxml-mappings//mapping[mime-type=$resolved-type]/schema-name"/>
              <expand name="{@name}"/>
            </c:file>
          </xsl:template>

        </xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:xslt>

You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types"
  href="file:/C:/work/officecert/hello.docx">
  <c:file name="docProps/app.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.extended-properties+xml"
    schema="shared-documentPropertiesExtended.xsd">
    <expand name="docProps/app.xml"/>
  </c:file>
  <c:file name="docProps/core.xml"
    resolved-type="application/vnd.openxmlformats-package.core-properties+xml"
    schema="opc-coreProperties.xsd">
    <expand name="docProps/core.xml"/>
  </c:file>
  <c:file name="word/comments.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"
    schema="wml.xsd">
    <expand name="word/comments.xml"/>
  </c:file>
  <c:file name="word/document.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    schema="wml.xsd">
    <expand name="word/document.xml"/>
  </c:file>
  <c:file name="word/fontTable.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"
    schema="wml.xsd">
    <expand name="word/fontTable.xml"/>
  </c:file>
  <c:file name="word/settings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"
    schema="wml.xsd">
    <expand name="word/settings.xml"/>
  </c:file>
  <c:file name="word/styles.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"
    schema="wml.xsd">
    <expand name="word/styles.xml"/>
  </c:file>
  <c:file name="word/theme/theme1.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.theme+xml"
    schema="dml-main.xsd">
    <expand name="word/theme/theme1.xml"/>
  </c:file>
  <c:file name="word/webSettings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"
    schema="wml.xsd">
    <expand name="word/webSettings.xml"/>
  </c:file>
  <c:file name="word/_rels/document.xml.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="word/_rels/document.xml.rels"/>
  </c:file>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  </c:file>
  <c:file name="_rels/.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="_rels/.rels"/>
  </c:file>
</c:zipfile>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
    <cx:unzip>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>
    </cx:unzip>
  </p:viewport>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post …

OOXML and Microsoft Office 2007 Conformance: a Smoke Test

29 March 2010 Alex (5)

This is one in a series of popular blog articles I am re-publishing from the old Griffin Brown blog which is now closed down. This article is from April 2008. It is the same content as the original (except for some hyperlink freshening).

At the time of posting this entry caused quite a furore, even though its results were – to me anyway – as expected. Looking back I think what I wrote was largely correct, except I probably underestimated the difficulty of converting Microsoft Office to use the Strict variant of OOXML — this would require more than surgery just to the de-serialisation code!

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>

since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?

SC 34 WG meetings in Paris last week

09 December 2009 Alex (11)

Last week I was in Paris for a stimulating week of meetings of ISO/IEC JTC 1/SC 34 WGs, and as the year draws to a close it seems an opportune time to take the temperature of our XML standards space and look ahead to where we may be going next.

WG 1 (Schema languages)

WG 1 can be thought of as tending to the foundations upon which other SC 34 Standards are built - and of these foundations perhaps none is more important than RELAX NG, the schema language of many key XML technologies including ODF, DocBook and the forthcoming MathML 3.0 language. WG 1 discussed a number of potential enhancements to RELAX NG, settling on a modest but useful set which will enhance the language in response to user feedback.

A proposed new schema language for cross reference validation (including ID/IDREF checking) was also discussed; the question here is whether to have something simple and quick (that addresses the ID/IDREF validation if RELAX NG, say), or whether to develop a more fully-featured language capable of meeting challenges like cross-document cross-reference checking in an OOXML or ODF package. It seems as if WG 1 is strongly inclining towards the latter.

Other work centred on proposing changes for cleaning up the unreasonable licensing restrictions which apply to "freely-available" ISO/IEC standards made available by the ITTF: the click through license here is obviously out-of-date, and text is required to attach to schemas so that they can be used on more liberal, FOSS-friendly terms. (I mentioned this before in this blog entry).

WG 4 (OOXML)

WG 4 had a full agenda. One item of business requiring immediate attention was the resolution of comments accompanying the just-voted-on set of DCOR ballots. These had received wide support from the National Bodies though it was disappointing to see that the two NBs who had voted to disapprove had not sent delegates to the meeting. P-members are obliged both to vote on ballots and attend meetings in SCs and so these nations (Brazil and Malaysia are the countries in question) are not properly honouring their obligation as laid down in the JTC 1 Directives:

3.1.1 P-members of JTC 1 and its SCs have an obligation to take an active part in the work of JTC 1 or the SC and to attend meetings.

I note with approval the hard line taken by the ITTF, who have just forcibly demoted 18 JTC 1 P-members who had become inactive.

Nevertheless, all comments received were resolved and the set of corrigenda will now go forward to publication, making a significant start to cleaning up the OOXML standard.

S/T

The other big topic facing WG 4 was to the thorny problem of what has come to be called the issue of "Strict v Transitional". In other words, deciding on some strategy for dealing with these two variants of the 29500 Standard.

The UK has a clear consensus on the purpose of the two formats. Transitional (aka "T") is (in the UK view) a format for representing the existing legacy of documents in the field (and those which continue to be created by many systems today); no more, and no less. Strict (aka "S") is viewed as the proper place for future innovation around OOXML.

Progress on this topic is (for me) frustratingly slow – ah! the perils of the consensus forming process – but some pathways are beginning to become visible in the swirling mists. In particular it seems there is a mood to issue a statement that the core schemas of T are to be frozen, and that any dangerous features (such as the date representation option blogged about by WG 4 experts Gareth Horton and Jesper Lund Stocholm) are removed from T.

This will go some way to clarify for users what to expect when dealing with a 29500-conformant document. However, I foresee needed work ahead to clarify this still further since within the two variants (Strict and Transitional) there are many sub-variants which users will need to know about. In particular the extensibility mechanism of OOXML (MCE) allows for additional structures to be introduced into documents. And so, is a "Transitional" (or "Strict") document:

Unextended ?
Extended, but with only standardized extensions ?
Extended, but with proprietary extensions ?
Extended in a backwards-compatible way relative to the core Standard ?
Extended in a backwards-incompatible way ?

I expect WG 4 will need to work on conformance classes and content labelling mechanisms (a logo programme?) to enable implementers to convey with precision what kind of OOXML documents they can consume and emit, and for procurers to specify with precision what they want to procure.

WG 5 (Document interop)

WG 5 continues its work with TR 29166, Open Document Format (ISO/IEC 26300) / Office Open XML (ISO/IEC 29500) Translation Guidelines, setting out the high-level differences between the ISO versions of the OOXML and ODF formats. I attended to hear about a Korean idea for a new work item focussed on the use of the clipboard as an interchange mechanism.

This is interesting because the clipboard presents some particular challenges for implementers. What happens (for example) when a user selects content for copying which does not correspond to well-formed XML (from the middle of one paragraph to the middle of another)? I am interested in seeing exactly what work the Koreans will propose in this space ...

WG 6 (ODF)

Although I had registered for the WG 6 meeting, I had to take the Eurostar home on Thursday and so attempted to participate in Friday's WG 6 meeting by Skype (as much as rather intermittent wi-fi connectivity would allow).

From what I heard of it, the meeting was constructive and business-like, sorting out various items of administrivia and turning attention to the ongoing work of maintaining ISO/IEC 26300 (the International Standard version of ODF).

To this end, it is heartening to see the wheels finally creak into motion:

The first ever set of corrigenda to ISO/IEC 26300 has now gone to ballot
A second set is on the way, once a mechanism has been agreed how to re-word those bits of the Standard which are unimplementable
A new defect report from the UK was considered (many of these comments have already been addressed within OASIS, and so fixes are known)

Most significant of all is the work to align the ISO version of ODF with the current OASIS standard so that ISO/IEC 26300 and ODF 1.1 are technically equivalent. The National Bodies present reiterated a consensus that this was desirable (better, by far, than withdrawing ISO/IEC 26300 as a defunct standard) and are looking forward to the amendment project. The world will, then, have an ISO/IEC version of ODF which is relevant to the marketplace while waiting for a possible ISO/IEC version of ODF 1.2 – as even with a fair wind this is still around two years away from being published as an International Standard.

Records

I'll update this entry with links to documents as they become available. To start with, here are some informal records: :-)

WG 6 meeting report

ODF Forensics

15 June 2009 Alex (3)

The other day I had a phone call from Michiel Leenaars of the NLnet Foundation, who is busy gearing up for this week's ODF plugfest in the Hague. Michiel had seen my blog post on ODF validation using pipelines, and was interested in whether I could come up with something quick and dirty for providing forensic information about pairs of ODF documents, so they could be assessed before and after they are used by a tool. This could help users check if anything has been incorrectly added or taken away during a round-trip. Here's what I came up with …

Reaching for XProc

Yes again, I am going to use an XProc pipeline to do the processing. The basic plan of attack is:

take two documents
generate a “fingerprint” for each of them
compare the fingerprints
display the result in a meaningful, human-readable form

Fingerprinting XML

For a basic comparison between document I chose simply to compare the elements used, and the number of them. This obviously leaves out quite a bit which might also be compared (attributes, text) etc – but is a useful smoke test about whether major structures have been added or lost during a round-trip.

So the overall pipeline will look like this:

<?xml version="1.0"?>

<pipeline name="get-opc-rels" xmlns="http://www.w3.org/ns/xproc"
xmlns:xo="http://xmlopen.org/odf-fingerprint"
xmlns:mf="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0"
xmlns:cx="http://xmlcalabash.com/ns/extensions">

<import href="extensions.xpl"/>

<!-- the URLs of the ODF documents to be processed -->
<option name="package-a" required="true"/>
<option name="package-b" required="true"/>

<!-- get the first fingerprint ... -->
<xo:make-fingerprint name="finger-a">
<with-option name="package-url" select="$package-a"/>
</xo:make-fingerprint>

<!-- ... and the second ... -->
<xo:make-fingerprint name="finger-b">
<with-option name="package-url" select="$package-b"/>
</xo:make-fingerprint>

<!-- combine them into a single document for input into an XSLT -->
<wrap-sequence wrapper="fingerprint-pair">
<input port="source">
<pipe step="finger-a" port="result"/>
<pipe step="finger-b" port="result"/>
</input>
</wrap-sequence>

<!-- style into an HTML report of differences -->
<xslt name="transform-to-html">
<input port="stylesheet">
<document href="style-diffs.xsl"/>
</input>
</xslt>

</pipeline>

A number of things are of note:

The ODF packages are interrogated using the JAR URI mechanism I described here
We’re using a custom step <xo:make-fingerprint>, which takes as its input the URL of an ODF document (“package-url”), and which emits a “fingerprint” as an XML document. Obviously this step is not built into any XProc processor, so we’ll need to write it ourselves
We using XProc’s wrap-sequence step to combine the two “fingerprint” documents into a single document
We’ll be relying on an XSLT transform to turn this combined document into a nice report, which will be the end result of the pipeline.

Writing the fingerprinting pipeline

To define our custom pipeline <xo:make-fingerprint> we simply author a new pipeline, and give it the type “xo:make-fingerprint”. This can then be invoked as a step. Here’s what this sub-pipeline looks like:

<pipeline type="xo:make-fingerprint">

<!-- the URL of the ODF file to fingerprint -->
<option name="package-url" required="true"/>

<!-- load its manifest -->
<load>
<with-option name="href" select="concat('jar:',$package-url,'!/META-INF/manifest.xml')"/>
</load>

<!-- visit each entry in the manifest which refs an XML resource -->
<viewport name="handle"
match="mf:file-entry[@mf:media-type='text/xml'
and not(starts-with(@mf:full-path,'META-INF'))]">
<cx:message message="Loading item ..."/>

<!-- load the entry -->
<load name="load-item">
<with-option name="href" select="concat('jar:',$package-url,'!/',/*/@mf:full-path)"/>
</load>

<!-- accumulate everything in a <wrapper> document -->
<wrap-sequence wrapper="wrapper">
<input port="source">
<pipe step="load-item" port="result"/>
</input>
</wrap-sequence>
</viewport>

<!-- transform the accumulated mass into a fingerprint -->
<xslt name="transform-to-fingerprint">
<input port="stylesheet">
<document href="make-fingerprint.xsl"/>
</input>
</xslt>

<!-- label it with the package URL, as an attribute on the root element-->
<add-attribute match="/*" attribute-name="package-url">
<with-option name="attribute-value" select="$package-url"/>
</add-attribute>

</pipeline>

Things to notice here:

We iterate through the ODF manifest looking for XML documents
All of the XML in the entire package is retrieved and combined into a single mega-document wrapped in an element named<wrapper>
We’re relying on an XSLT transform, “make-fingerprint.xsl” to do the heavy lifting and turn our mega-document into meaningful (and smaller) “fingerprint” document
We add the URL of the ODF document to the fingerprint using XProc’s nifty add-attribute step

The Heavy Lifting: XSLT

The XSLT to boil a document down into a fingerprint can be seen here. What it produces is a summary of the elements used in each of the namespaces the document mentions. This extract gives a flavour of the kind of result it produces:

<namespace name="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<element name="file-entry" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:meta:1.0">
<element name="generator" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:office:1.0">
<element name="automatic-styles" count="1"/> 
<element name="body" count="1"/>
<element name="document-content" count="1"/> 
<element name="document-meta" count="1"/>
<element name="document-styles" count="1"/>
<element name="font-face-decls" count="2"/>
<element name="meta" count="1"/>
<element name="spreadsheet" count="1"/>
<element name="styles" count="1"/>
</namespace>

Returning now to our main pipeline, we can see it makes two calls to (or should that be “sucks on”) the sub-pipeline to generate two fingerprints. These are then wrapped with a wrap-sequence step, and we have all we need to generate the final report. Again, an XSLT transform is used to do a comparison operation and the result is emitted as an HTML document intended for human consumption. An example of what this looks like (comparing the OpenOffice and Google Docs versions of Maya’s wedding planner) can be found here.

Putting it to use

The results of this process need to be interpreted on a case-by-case basis. Just because two applications represent notionally the same document with different XML is not necessarily a fault (though I’d like to know why Maya’s Wedding Planner has 2,000 spreadsheet cells according to Google Docs, and only 51 according to OO.o).

The most useful application of this pipeline is to check for untoward data loss when a document is processed by an application – and I understand this is a particular concern of the Dutch government. With this in mind it is possible to take this pipeline further still, checking attribute differences and even textual differences. Though there comes a point when diff'ing XML that it is best to use a specialist tool such as the excellent DeltaXML (I have no association with this product, except knowing it is well-respected through its use among clients). Many an unsuspecting programmer has come to grief under-estimating the complexities of comparing XML documents.

Online ODF Validation

Michiel also asked whether it would be possible to make the ODF validation pipeline I blogged about previously available as an online service. Coincidentally this was something I was working on anyway, though using Java rather than XML pipelines. The result is now available here. Enjoy …

No one supports ISO ODF today?

10 June 2009 Alex (69)

IBM employee and ODF TC co-chair Rob Weir’s latest blog entry seeks to rebut what he terms a “disinformation campaign being waged against ODF”. The writing is curiously disjointed, and while at first I thought Rob’s famously fluent pen had been constipated by his distaste at having to descend further into the ad hominem gutter, on closer inspection I think it is perhaps a tell of Rob’s discomfort about his own past statements.

In particular, Rob takes issue with a statement that he condemns as “Microsoft FUD […] laundered via intermediaries”:

There is no software that currently implements ODF as approved by the ISO

Now Rob Weir is a great blogger, a much-praised committee chair, and somebody who can, on occasion, fearlessly produce the blunt truth like a rabbit from a hat. For this reason, I know his blog entry, “Toy Soldiers” of July 2008 has enjoyed quite some exposure in standards meetings around the world, most particularly for its assertions about ODF. He wrote:

No one supports ODF 1.0 today. All of the major vendors have moved on to ODF 1.1, and will be moving on to ODF 1.2 soon.
No one supports OOXML 1.0 today, not even Microsoft.
No one supports interoperability via translation, not Sun in their Plugin, not Novell in their OOXML support, and not Microsoft in their announced ODF support in Office 2007 SP2.

While the anti-MS line here represents the kind of robust corporate knockabout stuff we might expect, it is Rob’s statement that “no one supports ODF 1.0 today. All of the major vendors have moved on […]” which has particularly resonated for users. A pronouncement on adoption from a committee chair about his own committee’s standard is significant. And naturally, it has deeply concerned some of the National Bodies who have adopted ODF 1.0 (which is ISO/IEC 26300) as a National standard, and who now find they have adopted something which, apparently, “no one supports”.

So, far from being “Microsoft FUD”, the idea that “No one supports ODF 1.0” is in fact Rob Weir’s own statement. And it was taken up and repeated by Andy Updegrove, Groklaw and Boycott Novell, those well-known vehicles of Microsoft’s corporate will.

Today however, this appears to have become an inconvenient truth. The rabbit that was pulled out of the hat in the interest of last summer’s spin, now needs to be put into the boiler. Consequently we find Rob’s blog entry of July 2008 has been silently amended so that it now states:

Few applications today support exclusively ODF 1.0 and only ODF 1.0. Most of the major vendors also support ODF 1.1, one (OpenOffice 3.x), now supports draft ODF 1.2 as well.
No one supports OOXML 1.0 today, not even Microsoft.
No one supports interoperability via translation, not Sun in their Plugin, not Novell in their OOXML support, and not Microsoft in their announced ODF support in Office 2007 SP2.

The pertinent change is to item 1 on this list, which now has a weasel-worded (and tellingly tautological) assertion that might make the unsuspecting reader think that ODF 1.0 was somehow supported by the major vendors. Well, is it? Who is right, the Rob Weir of 2008 or the Rob Weir of 2009? Maybe I’ve missed something, but personally I’m unaware of an upsurge in ODF 1.0 support during the last 11 months. My money is on the former Rob being right here.

Okay, I use OpenOffice.org 1.1.5 (despite its Secunia level 4 advisory) out of a kind of loyalty to ISO/IEC 26300 (ODF 1.0), but I’m often teased about being the only person on the planet who must be doing this, and onlookers wonder what the .swx (etc) files I produce really are.

Blog Etiquette

As a general rule, when making substantive retrospective changes to blog entries, especially controversial blog entries, it is honest dealing to draw attention to this by striking-through removed text and prominently labelling the new text as “updated”. Failing to do this can lead to the suspicion that an attempt to re-write history is underway …