Rethinking OOXML Validation, Part 1

by Alex Brown 4. November 2010 15:09
ODF Plugfest Venue
Brussels ODF Plugfest venue

At the recent ODF Plugfest in Brussels, I was very interested to hear Jos van den Oever of KOffice present on how ODF’s alternative “flat” document format could be used to drive browser based rendering of ODF documents. ODF defines two methods of serializing documents: one uses multiple files in a “Zip” archive, the aforementioned “flat” format combines everything into a single XML file. Seeing this approach in action gelled with some thoughts I’d been having on how better to validate OOXML documents using standards-based XML tools …

Unlike ODF, OOXML has no “flat” file format – its files are OPC packages built on top of Zip archives. However, some interesting work has already been done in this area by Microsoft’s Eric White in such as blog posts as The Flat OPC Format, which points out that Microsoft Word™ (alone among the Office™ suite members [UPDATE: Word and PowerPoint can do this]) can already save in an unofficial flat format which can be processed with standards-based XML tools like XSLT processors.

Rather than having to rely on Word, or stick only to word processing documents, I thought it would be interesting to explore ways in which any OOXML document could be flattened and processed using standards-based processors. Ideally one would then also write a tool that did the opposite so that to work with OOXML content the steps would be first to flatten it, then to do the processing, and then to re-structify it into an OPC package.

Back to XProc

I have already written a number of blog posts on office document validation, and have used a variety of technical approaches to get the validation done. Most of my recent effort has been on developing the Office-o-tron, a hand-crafted Java application which functions primarily by unpacking archives to the file system before operating on their individual components. Earlier efforts using XProc has foundered on the difficulty of working with files inside a Zip archive — in particular because I was using the non-standard JAR URI scheme which, it turns out, is not capable of addressing items with certain names (e.g. “Object 1”) that one typically finds inside ODF documents.

However, armed with knowledge gained from developing Office-o-tron, and looking again at Zip handling extension functions of the Calabash XProc processor, made me think there was a way XProc could be used to get the job done. Here’s how …

Inspecting an OPC package

OOXML documents are built using the Open Packaging Convention (OPC, or ISO/IEC 29500-2), a generic means of building file formats within Zip archives which also happens to underpin the XPS format. OPC’s chief virtue – that it is very generic – is offset by much (probably too much) complexity in pursuit of this goal. Before we can know what we’ve got in an OPC package, and how to process it, some work needs to be done.

Fortunately, the essence of what we need consists of two pieces of information: a file inside the Zip guaranteed to be called “[Content_Types].xml”, and a manifest of the content of the package. XProc can get both of these pieces of information for us:

<?xml version="1.0"?>
<p:pipeline name="consolidate-officedoc"
  xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

  <p:import href="extensions.xpl"/>

  <!-- specifies the document to be processed -->
  <p:option name="package-sysid" required="true"/>


  <!--
  
  Given the system identifer $package-sysid of an OOXML document,
  this pipeline returns a document whose root element is archive-info
  which contains two children: the [Content_Types].xml resource
  contained in the root of the archive, and a zipfile element
  created per the unzip step at:
  
  http://xmlcalabash.com/extension/steps/library-1.0.xpl
  
  -->
  <p:pipeline type="xo:archive-info">

    <p:option name="package-sysid" required="true"/>

    <cx:unzip name="content-types" file="[Content_Types].xml">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <cx:unzip name="archive-content">
      <p:with-option name="href" select="$package-sysid"/>
    </cx:unzip>

    <p:sink/>

    <p:wrap-sequence wrapper="archive-info">
      <p:input port="source">
        <p:pipe step="content-types" port="result"/>
        <p:pipe step="archive-content" port="result"/>
      </p:input>
    </p:wrap-sequence>

  </p:pipeline>

  <!-- get the type information and content of the package -->
  <xo:archive-info>
    <p:with-option name="package-sysid" select="$package-sysid"/>
  </xo:archive-info>

  <!-- etc -->

Executing this pipeline on a typical “HelloWorld.docx” file gives us an XML document which consists of a composite of our two vital pieces of information, as follows:

<archive-info>
  <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/word/comments.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    <Override PartName="/word/styles.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
    <Override PartName="/docProps/app.xml"
      ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/word/settings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
    <Override PartName="/word/theme/theme1.xml"
      ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
    <Override PartName="/word/fontTable.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
    <Override PartName="/word/webSettings.xml"
      ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
    <Override PartName="/docProps/core.xml"
      ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
  </Types>
  <c:zipfile href="file:/C:/work/officecert/hello.docx">
    <c:file compressed-size="368" size="712" name="docProps/app.xml" date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="375" size="747" name="docProps/core.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="459" size="1004" name="word/comments.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="539" size="1218" name="word/document.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="407" size="1296" name="word/fontTable.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="651" size="1443" name="word/settings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="1783" size="16891" name="word/styles.xml"
      date="2009-05-25T14:15:08.000+01:00"/>
    <c:file compressed-size="1686" size="6992" name="word/theme/theme1.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="187" size="260" name="word/webSettings.xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="265" size="948" name="word/_rels/document.xml.rels"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="372" size="1443" name="[Content_Types].xml"
      date="1980-01-01T00:00:00.000Z"/>
    <c:file compressed-size="243" size="590" name="_rels/.rels" date="1980-01-01T00:00:00.000Z"/>
  </c:zipfile>
</archive-info>

The purpose of the information in the Types element is to tell us the MIME types of the contents of the package, either specifically (in Override elements), or indirectly by associating a MIME type with file extensions (in Default elements). What we are now going to do is add another step to our pipeline that resolves all this information so that we label each of the items in the Zip file with the MIME type that applies to it.

 <p:xslt>
    <p:input port="stylesheet">
      <p:inline>
        <xsl:stylesheet version="2.0"
          xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types">

          <xsl:variable name="ooxml-mappings" select="document('ooxml-map.xml')"/>

          <xsl:template match="/">
            <c:zipfile>
              <xsl:copy-of select="/archive-info/c:zipfile/@*"/>
              <xsl:apply-templates/>
            </c:zipfile>
          </xsl:template>

          <xsl:template match="c:file">
            <xsl:variable name="entry-name" select="@name"/>
            <xsl:variable name="toks" select="tokenize($entry-name,'\.')"/>
            <xsl:variable name="ext" select="$toks[count($toks)]"/>
            <c:file>
              <xsl:copy-of select="@name"/>
              <xsl:variable name="overriden-type"
                select="//opc:Override[ends-with(@PartName,$entry-name)]/@ContentType"/>
              <xsl:variable name="default-type"
                select="//opc:Default[ends-with(@Extension,$ext)]/@ContentType"/>
              <xsl:variable name="resolved-type"
                select="if(string-length($overriden-type)) then $overriden-type else $default-type"/>
              <xsl:attribute name="resolved-type" select="$resolved-type"/>
              <xsl:attribute name="schema"
                select="$ooxml-mappings//mapping[mime-type=$resolved-type]/schema-name"/>
              <expand name="{@name}"/>
            </c:file>
          </xsl:template>

        </xsl:stylesheet>
      </p:inline>
    </p:input>
  </p:xslt>

You’ll notice I am also using an XML document called “ooxml-map.xml” as part of this enrinchment process. This is a file which contains the (hard won) information about which document of which MIME types are governed by which schemas as published as part of the OOXML standard. That document is available online here.

The result of running this additional step is to give us an enriched manifest of the OPC package content:

<c:zipfile xmlns:c="http://www.w3.org/ns/xproc-step"
  xmlns:cx="http://xmlcalabash.com/ns/extensions"
  xmlns:xo="http://xmlopen.org/officecert"
  xmlns:opc="http://schemas.openxmlformats.org/package/2006/content-types"
  href="file:/C:/work/officecert/hello.docx">
  <c:file name="docProps/app.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.extended-properties+xml"
    schema="shared-documentPropertiesExtended.xsd">
    <expand name="docProps/app.xml"/>
  </c:file>
  <c:file name="docProps/core.xml"
    resolved-type="application/vnd.openxmlformats-package.core-properties+xml"
    schema="opc-coreProperties.xsd">
    <expand name="docProps/core.xml"/>
  </c:file>
  <c:file name="word/comments.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"
    schema="wml.xsd">
    <expand name="word/comments.xml"/>
  </c:file>
  <c:file name="word/document.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
    schema="wml.xsd">
    <expand name="word/document.xml"/>
  </c:file>
  <c:file name="word/fontTable.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"
    schema="wml.xsd">
    <expand name="word/fontTable.xml"/>
  </c:file>
  <c:file name="word/settings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"
    schema="wml.xsd">
    <expand name="word/settings.xml"/>
  </c:file>
  <c:file name="word/styles.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"
    schema="wml.xsd">
    <expand name="word/styles.xml"/>
  </c:file>
  <c:file name="word/theme/theme1.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.theme+xml"
    schema="dml-main.xsd">
    <expand name="word/theme/theme1.xml"/>
  </c:file>
  <c:file name="word/webSettings.xml"
    resolved-type="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"
    schema="wml.xsd">
    <expand name="word/webSettings.xml"/>
  </c:file>
  <c:file name="word/_rels/document.xml.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="word/_rels/document.xml.rels"/>
  </c:file>
  <c:file name="[Content_Types].xml" resolved-type="application/xml" schema="">
    <expand name="[Content_Types].xml"/>
  </c:file>
  <c:file name="_rels/.rels"
    resolved-type="application/vnd.openxmlformats-package.relationships+xml"
    schema="">
    <expand name="_rels/.rels"/>
  </c:file>
</c:zipfile>

Also notice that each of the items has been given a child element called expand – this is a placeholder for the documents which we are going to expand in situ to create our flat representation of the OPC package content. The pipeline step to achieve that expansion is quite straightforward:

  <p:viewport name="archive-content" match="c:file[contains(@resolved-type,'xml')]/expand">
    <p:variable name="filename" select="/*/@name"/>
    <cx:unzip>
      <p:with-option name="href" select="$package-sysid"/>
      <p:with-option name="file" select="$filename"/>
    </cx:unzip>
  </p:viewport>

At this point, we're only expanding the content that looks like it is XML – a fuller implementation would expand non-XML content and BASE64 encode it (perfectly doable with XProc).

The result of applying this process is a rather large document, with all the expand elements referring to XML documents replaced by that XML document content … in other words, a flat OPC file. With the additional metadata we have placed on the containing c:file elements, we have enough information to start performing schema validation. I will look at validation in more depth in the next part of this post …

OOXML and Microsoft Office 2007 Conformance: a Smoke Test

by Alex Brown 28. March 2010 18:40

This is one in a series of popular blog articles I am re-publishing from the old Griffin Brown blog which is now closed down. This article is from April 2008. It is the same content as the original (except for some hyperlink freshening).

At the time of posting this entry caused quite a furore, even though its results were – to me anyway – as expected. Looking back I think what I wrote was largely correct, except I probably underestimated the difficulty of converting Microsoft Office to use the Strict variant of OOXML — this would require more than surgery just to the de-serialisation code!


 

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>

since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

  • Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
  • Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
  • Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?

SC 34 WG meetings in Paris last week

by Alex Brown 9. December 2009 11:33
The croissants of AFNOR

Last week I was in Paris for a stimulating week of meetings of ISO/IEC JTC 1/SC 34 WGs, and as the year draws to a close it seems an opportune time to take the temperature of our XML standards space and look ahead to where we may be going next.

WG 1 (Schema languages)

WG 1 can be thought of as tending to the foundations upon which other SC 34 Standards are built - and of these foundations perhaps none is more important than RELAX NG, the schema language of many key XML technologies including ODFDocBook and the forthcoming MathML 3.0 language. WG 1 discussed a number of potential enhancements to RELAX NG, settling on a modest but useful set which will enhance the language in response to user feedback. 

A proposed new schema language for cross reference validation (including ID/IDREF checking) was also discussed; the question here is whether to have something simple and quick (that addresses the ID/IDREF validation if RELAX NG, say), or whether to develop a more fully-featured language capable of meeting challenges like cross-document cross-reference checking in an OOXML or ODF package. It seems as if WG 1 is strongly inclining towards the latter.

Other work centred on proposing changes for cleaning up the unreasonable licensing restrictions which apply to "freely-available" ISO/IEC standards made available by the ITTF: the click through license here is obviously out-of-date, and text is required to attach to schemas so that they can be used on more liberal, FOSS-friendly terms. (I mentioned this before in this blog entry).

WG 4 (OOXML)

WG 4 had a full agenda. One item of business requiring immediate attention was the resolution of comments accompanying the just-voted-on set of DCOR ballots. These had received wide support from the National Bodies though it was disappointing to see that the two NBs who had voted to disapprove had not sent delegates to the meeting. P-members are obliged both to vote on ballots and attend meetings in SCs and so these nations (Brazil and Malaysia are the countries in question) are not properly honouring their obligation as laid down in the JTC 1 Directives:

3.1.1 P-members of JTC 1 and its SCs have an obligation to take an active part in the work of JTC 1 or the SC and to attend meetings.

I note with approval the hard line taken by the ITTF, who have just forcibly demoted 18 JTC 1 P-members who had become inactive.

Nevertheless, all comments received were resolved and the set of corrigenda will now go forward to publication, making a significant start to cleaning up the OOXML standard.

S/T

The other big topic facing WG 4 was to the thorny problem of what has come to be called the issue of "Strict v Transitional". In other words, deciding on some strategy for dealing with these two variants of the 29500 Standard.

The UK has a clear consensus on the purpose of the two formats. Transitional (aka "T") is (in the UK view) a format for representing the existing legacy of documents in the field (and those which continue to be created by many systems today); no more, and no less. Strict (aka "S") is viewed as the proper place for future innovation around OOXML.

Progress on this topic is (for me) frustratingly slow – ah! the perils of the consensus forming process – but some pathways are beginning to become visible in the swirling mists. In particular it seems there is a mood to issue a statement that the core schemas of T are to be frozen, and that any dangerous features (such as the date representation option blogged about by WG 4 experts Gareth Horton and Jesper Lund Stocholm) are removed from T.

This will go some way to clarify for users what to expect when dealing with a 29500-conformant document. However, I foresee needed work ahead to clarify this still further since within the two variants (Strict and Transitional) there are many sub-variants which users will need to know about. In particular the extensibility mechanism of OOXML (MCE) allows for additional structures to be introduced into documents. And so, is a "Transitional" (or "Strict") document:

  • Unextended ?
  • Extended, but with only standardized extensions ?
  • Extended, but with proprietary extensions ?
  • Extended in a backwards-compatible way relative to the core Standard ?
  • Extended in a backwards-incompatible way ?

I expect WG 4 will need to work on conformance classes and content labelling mechanisms (a logo programme?) to enable implementers to convey with precision what kind of OOXML documents they can consume and emit, and for procurers to specify with precision what they want to procure.

WG 5 (Document interop)

WG 5 continues its work with TR 29166, Open Document Format (ISO/IEC 26300) / Office Open XML (ISO/IEC 29500) Translation Guidelines, setting out the high-level differences between the ISO versions of the OOXML and ODF formats. I attended to hear about a Korean idea for a new work item focussed on the use of the clipboard as an interchange mechanism.

This is interesting because the clipboard presents some particular challenges for implementers. What happens (for example) when a user selects content for copying which does not correspond to well-formed XML (from the middle of one paragraph to the middle of another)? I am interested in seeing exactly what work the Koreans will propose in this space ...

WG 6 (ODF)

Although I had registered for the WG 6 meeting, I had to take the Eurostar home on Thursday and so attempted to participate in Friday's WG 6 meeting by Skype (as much as rather intermittent wi-fi connectivity would allow).

From what I heard of it, the meeting was constructive and business-like, sorting out various items of administrivia and turning attention to the ongoing work of maintaining ISO/IEC 26300 (the International Standard version of ODF).

To this end, it is heartening to see the wheels finally creak into motion:

  • The first ever set of corrigenda to ISO/IEC 26300 has now gone to ballot
  • A second set is on the way, once a mechanism has been agreed how to re-word those bits of the Standard which are unimplementable
  • A new defect report from the UK was considered (many of these comments have already been addressed within OASIS, and so fixes are known)

Most significant of all is the work to align the ISO version of ODF with the current OASIS standard so that ISO/IEC 26300 and ODF 1.1 are technically equivalent. The National Bodies present reiterated a consensus that this was desirable (better, by far, than withdrawing ISO/IEC 26300 as a defunct standard) and are looking forward to the amendment project. The world will, then, have an ISO/IEC version of ODF which is relevant to the marketplace while waiting for a possible ISO/IEC version of ODF 1.2 – as even with a fair wind this is still around two years away from being published as an International Standard.

Records

I'll update this entry with links to documents as they become available. To start with, here are some informal records: :-)

 

Nightcap

Tags: , , , , , ,

ODF Forensics

by Alex Brown 14. June 2009 19:35

The other day I had a phone call from Michiel Leenaars of the NLnet Foundation, who is busy gearing up for this week's ODF plugfest in the Hague. Michiel had seen my blog post on ODF validation using pipelines, and was interested in whether I could come up with something quick and dirty for providing forensic information about pairs of ODF documents, so they could be assessed before and after they are used by a tool. This could help users check if anything has been incorrectly added or taken away during a round-trip. Here's what I came up with …

Reaching for XProc

Yes again, I am going to use an XProc pipeline to do the processing. The basic plan of attack is:

  1. take two documents
  2. generate a “fingerprint” for each of them
  3. compare the fingerprints
  4. display the result in a meaningful, human-readable form

Fingerprinting XML

For a basic comparison between document I chose simply to compare the elements used, and the number of them. This obviously leaves out quite a bit which might also be compared (attributes, text) etc – but is a useful smoke test about whether major structures have been added or lost during a round-trip.

So the overall pipeline will look like this:

<?xml version="1.0"?>
<pipeline name="get-opc-rels" xmlns="http://www.w3.org/ns/xproc"
xmlns:xo="http://xmlopen.org/odf-fingerprint"
xmlns:mf="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0"
xmlns:cx="http://xmlcalabash.com/ns/extensions">
<import href="extensions.xpl"/>
<!-- the URLs of the ODF documents to be processed -->
<option name="package-a" required="true"/>
<option name="package-b" required="true"/>
<!-- get the first fingerprint ... -->
<xo:make-fingerprint name="finger-a">
<with-option name="package-url" select="$package-a"/>
</xo:make-fingerprint>
<!-- ... and the second ... -->
<xo:make-fingerprint name="finger-b">
<with-option name="package-url" select="$package-b"/>
</xo:make-fingerprint>
<!-- combine them into a single document for input into an XSLT -->
<wrap-sequence wrapper="fingerprint-pair">
<input port="source">
<pipe step="finger-a" port="result"/>
<pipe step="finger-b" port="result"/>
</input>
</wrap-sequence>
<!-- style into an HTML report of differences -->
<xslt name="transform-to-html">
<input port="stylesheet">
<document href="style-diffs.xsl"/>
</input>
</xslt>
</pipeline>

A number of things are of note:

  • The ODF packages are interrogated using the JAR URI mechanism I described here
  • We’re using a custom step <xo:make-fingerprint>, which takes as its input the URL of an ODF document (“package-url”), and which emits a “fingerprint” as an XML document. Obviously this step is not built into any XProc processor, so we’ll need to write it ourselves
  • We using XProc’s wrap-sequence step to combine the two “fingerprint” documents into a single document
  • We’ll be relying on an XSLT transform to turn this combined document into a nice report, which will be the end result of the pipeline.

Writing the fingerprinting pipeline

To define our custom pipeline <xo:make-fingerprint> we simply author a new pipeline, and give it the type “xo:make-fingerprint”. This can then be invoked as a step. Here’s what this sub-pipeline looks like:

<pipeline type="xo:make-fingerprint">
<!-- the URL of the ODF file to fingerprint -->
<option name="package-url" required="true"/>
<!-- load its manifest -->
<load>
<with-option name="href" select="concat('jar:',$package-url,'!/META-INF/manifest.xml')"/>
</load>
<!-- visit each entry in the manifest which refs an XML resource -->
<viewport name="handle"
match="mf:file-entry[@mf:media-type='text/xml'
and not(starts-with(@mf:full-path,'META-INF'))]">
<cx:message message="Loading item ..."/>
<!-- load the entry -->
<load name="load-item">
<with-option name="href" select="concat('jar:',$package-url,'!/',/*/@mf:full-path)"/>
</load>
<!-- accumulate everything in a <wrapper> document -->
<wrap-sequence wrapper="wrapper">
<input port="source">
<pipe step="load-item" port="result"/>
</input>
</wrap-sequence>
</viewport>
<!-- transform the accumulated mass into a fingerprint -->
<xslt name="transform-to-fingerprint">
<input port="stylesheet">
<document href="make-fingerprint.xsl"/>
</input>
</xslt>
<!-- label it with the package URL, as an attribute on the root element-->
<add-attribute match="/*" attribute-name="package-url">
<with-option name="attribute-value" select="$package-url"/>
</add-attribute>
</pipeline>

Things to notice here:

  • We iterate through the ODF manifest looking for XML documents
  • All of the XML in the entire package is retrieved and combined into a single mega-document wrapped in an element named<wrapper>
  • We’re relying on an XSLT transform, “make-fingerprint.xsl” to do the heavy lifting and turn our mega-document into meaningful (and smaller) “fingerprint” document
  • We add the URL of the ODF document to the fingerprint using XProc’s nifty add-attribute step

The Heavy Lifting: XSLT

The XSLT to boil a document down into a fingerprint can be seen here. What it produces is a summary of the elements used in each of the namespaces the document mentions. This extract gives a flavour of the kind of result it produces:

<namespace name="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0">
<element name="file-entry" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:meta:1.0">
<element name="generator" count="1"/>
</namespace>
<namespace name="urn:oasis:names:tc:opendocument:xmlns:office:1.0">
<element name="automatic-styles" count="1"/> 
<element name="body" count="1"/>
<element name="document-content" count="1"/> 
<element name="document-meta" count="1"/>
<element name="document-styles" count="1"/>
<element name="font-face-decls" count="2"/>
<element name="meta" count="1"/>
<element name="spreadsheet" count="1"/>
<element name="styles" count="1"/>
</namespace>

Returning now to our main pipeline, we can see it makes two calls to (or should that be “sucks on”) the sub-pipeline to generate two fingerprints. These are then wrapped with a wrap-sequence step, and we have all we need to generate the final report. Again, an XSLT transform is used to do a comparison operation and the result is emitted as an HTML document intended for human consumption. An example of what this looks like (comparing the OpenOffice and Google Docs versions of Maya’s wedding planner) can be found here.

Putting it to use

The results of this process need to be interpreted on a case-by-case basis. Just because two applications represent notionally the same document with different XML is not necessarily a fault (though I’d like to know why Maya’s Wedding Planner has 2,000 spreadsheet cells according to Google Docs, and only 51 according to OO.o).

The most useful application of this pipeline is to check for untoward data loss when a document is processed by an application – and I understand this is a particular concern of the Dutch government. With this in mind it is possible to take this pipeline further still, checking attribute differences and even textual differences. Though there comes a point when diff'ing XML that it is best to use a specialist tool such as the excellent DeltaXML (I have no association with this product, except knowing it is well-respected through its use among clients). Many an unsuspecting programmer has come to grief under-estimating the complexities of comparing XML documents.

Online ODF Validation

Michiel also asked whether it would be possible to make the ODF validation pipeline I blogged about previously available as an online service. Coincidentally this was something I was working on anyway, though using Java rather than XML pipelines. The result is now available here. Enjoy …

No one supports ISO ODF today?

by Alex Brown 10. June 2009 11:09

IBM employee and ODF TC co-chair Rob Weir’s latest blog entry seeks to rebut what he terms a “disinformation campaign being waged against ODF”. The writing is curiously disjointed, and while at first I thought Rob’s famously fluent pen had been constipated by his distaste at having to descend further into the ad hominem gutter, on closer inspection I think it is perhaps a tell of Rob’s discomfort about his own past statements.

In particular, Rob takes issue with a statement that he condemns as “Microsoft FUD […] laundered via intermediaries”:

There is no software that currently implements ODF as approved by the ISO

Now Rob Weir is a great blogger, a much-praised committee chair, and somebody who can, on occasion, fearlessly produce the blunt truth like a rabbit from a hat. For this reason, I know his blog entry, “Toy Soldiers” of July 2008 has enjoyed quite some exposure in standards meetings around the world, most particularly for its assertions about ODF. He wrote:

  1. No one supports ODF 1.0 today. All of the major vendors have moved on to ODF 1.1, and will be moving on to ODF 1.2 soon.
  2. No one supports OOXML 1.0 today, not even Microsoft.
  3. No one supports interoperability via translation, not Sun in their Plugin, not Novell in their OOXML support, and not Microsoft in their announced ODF support in Office 2007 SP2.

While the anti-MS line here represents the kind of robust corporate knockabout stuff we might expect, it is Rob’s statement that “no one supports ODF 1.0 today. All of the major vendors have moved on […]” which has particularly resonated for users. A pronouncement on adoption from a committee chair about his own committee’s standard is significant. And naturally, it has deeply concerned some of the National Bodies who have adopted ODF 1.0 (which is ISO/IEC 26300) as a National standard, and who now find they have adopted something which, apparently, “no one supports”.

So, far from being “Microsoft FUD”, the idea that “No one supports ODF 1.0” is in fact Rob Weir’s own statement. And it was taken up and repeated by Andy Updegrove, Groklaw and Boycott Novell, those well-known vehicles of Microsoft’s corporate will.

Today however, this appears to have become an inconvenient truth. The rabbit that was pulled out of the hat in the interest of last summer’s spin, now needs to be put into the boiler. Consequently we find Rob’s blog entry of July 2008 has been silently amended so that it now states:

  1. Few applications today support exclusively ODF 1.0 and only ODF 1.0. Most of the major vendors also support ODF 1.1, one (OpenOffice 3.x), now supports draft ODF 1.2 as well.
  2. No one supports OOXML 1.0 today, not even Microsoft.
  3. No one supports interoperability via translation, not Sun in their Plugin, not Novell in their OOXML support, and not Microsoft in their announced ODF support in Office 2007 SP2.

The pertinent change is to item 1 on this list, which now has a weasel-worded (and tellingly tautological) assertion that might make the unsuspecting reader think that ODF 1.0 was somehow supported by the major vendors. Well, is it? Who is right, the Rob Weir of 2008 or the Rob Weir of 2009? Maybe I’ve missed something, but personally I’m unaware of an upsurge in ODF 1.0 support during the last 11 months. My money is on the former Rob being right here.

Okay, I use OpenOffice.org 1.1.5 (despite its Secunia level 4 advisory) out of a kind of loyalty to ISO/IEC 26300 (ODF 1.0), but I’m often teased about being the only person on the planet who must be doing this, and onlookers wonder what the .swx (etc) files I produce really are.

Blog Etiquette

As a general rule, when making substantive retrospective changes to blog entries, especially controversial blog entries, it is honest dealing to draw attention to this by striking-through removed text and prominently labelling the new text as “updated”. Failing to do this can lead to the suspicion that an attempt to re-write history is underway …

Notes on Document Conformance and Portability #3

by Alex Brown 10. May 2009 13:20

Now that the furore about Microsoft’s implementation of ODF spreadsheet formulas in Office SP2 has died down a little, it is perhaps worth taking a little time to have a calm look at some of the issues involved.

Clearly, this is an area where strong commercial interests are in play, not to mention an element of sometimes irrational zeal from those who consider themselves pro or anti (mostly anti) Microsoft.

One question is whether Microsoft did “The Right Thing” by users in choosing to implement formulas the way they did. This is certainly a fair question and one over which we can expect there to be some argument.

The fact is that Microsoft’s implementation decision means that, on the face of it, they have produced an implementation of ODF which does not interoperate with other available implementations. Thus IBM blogger Rob Weir can produce a simple (possibly simplistic) spreadsheet, “Maya’s Wedding Planner” and used it to illustrate, with helpful red boxes for the slow-witted, that Microsoft’s implementation is a “FAIL” attributable to “malice or incompetence”. For good measure he also takes a side-swipe at Sun for their non-interoperable implementation. In this view, interoperability aligning with IBM’s Symphony implementation is – unsurprisingly – presented as optimal (in fact, you can hear the sales pitch from IBM now: “well, Mr government procurement officer, looks like Sun and MS are not interoperable, you won’t want these other small-fry implementations, and Google’s web-based approach isn’t suitable – so looks like Symphony is the only choice …”)

Microsoft have argued back, of course, most strikingly in Doug Mahugh’s 1 + 2 = 1? blog posting, which appears to present some real problems with basic spreadsheet interoperability among ODF products using undocumented extensions. The MS argument is that practical ODF interoperability is a myth anyway, and so supporting it meaningfully is not possible (in fact, you can hear the sales pitch from MS now: “well, Mr government procurement officer, looks like ODF is dangerously non-interoperable: here, let me show you how IBM and Sun can’t even agree on basic features; but look, we’ve implemented ISO standard formulas, so we alone avoid that – and you can assess whether we’re doing what we claim – looks like MS Office is the only choice …”)

Personally, I think MS have been disappointingly petty in abandoning the “convention” that the other suites more or less use. I accept that these ODF implementations have limited interoperability and are unsafe for any mission-critical data, but for the benefit of the “Maya’s Wedding Planner” type of scenario, where ODF implementations can actually cut it, I think MS should have included this legacy support as an option, even if they did have to qualify that support with warning dialogs about data loss and interoperability issues.

But - vendors are vendors; it is their very purpose to compete in order to maximise their long-term profits. Users don’t always benefit from this. We really shouldn’t be surprised that we have IBM, Sun and Microsoft in disagreement at this point.

What we should be surprised about is how this interoperability fiasco has been allowed to happen within the context of a standard. To borrow Rick Jelliffe’s colourfully reported words, the whole purpose of shoving an international standard up a vendor’s backside it to get them to behave better in the interests of the users. What has gone wrong here is in the nature of the standard itself. ODF offers an extremely weak promise of interoperability, and the omission of a spreadsheet formula specification in ODF 1.1 is merely one of the more glaring facets of this problem. As XML guru James Clark wrote in 2005:

I really hope I'm missing something, because, frankly, I'm speechless. You cannot be serious. You have virtually zero interoperability for spreadsheet documents.

To put this spec out as is would be a bit like putting out the XSLT spec without doing the XPath spec. How useful would that be?

It is essential that in all contexts that allow expressions the spec precisely define the syntax and semantics of the allowed expressions.

These words were prophetic, for now we do indeed face a present zero interoperability reality.

The good news is that work is underway to fix this problem: ODF 1.2 promises, when it eventually appears, to specify formulas using the new OpenFormula specification. When that is published vendors will cease to have an excuse to create non-interoperable implementations, at least in this area.

Is SP2 conformant?

Whether Microsoft’s approach to ODF was the wisest is something over which people may disagree in good faith. Whether their approach conforms to ODF should be a neutral fact we can determine with certainty.

In a follow-up posting to his initial blast, Rob Weir sets out to show that Microsoft’s approach is non-conformant, subsequent to his previous statement that “SP2's implementation of ODF spreadsheets does not, in fact, conform to the requirements of the ODF standard”. After quoting a few selected extracts from the standard, a list is presented showing how various implementations represent a formula:

  • Symphony 1.3: =[.E12]+[.C13]-[.D13]
  • Microsoft/CleverAge 3.0: =[.E12]+[.C13]-[.D13]
  • KSpread 1.6.3: =[.E12]+[.C13]-[.D13]
  • Google Spreadsheets: =[.E12]+[.C13]-[.D13]
  • OpenOffice 3.01: =[.E12]+[.C13]-[.D13]
  • Sun Plugin 3.0: [.E12]+[.C13]-[.D13]
  • Excel 2007 SP2: =E12+C13-D13

Rob writes, “I'll leave it as an exercise to the reader to determine which one of these seven is wrong and does not conform to the ODF 1.1 standard.”

Again, this is clearly aimed at the slow witted. One can imagine even the most hesitant pupil raising their hand, “please Mr Weir, is it Excel 2007 SP2?” Rob however, is too smart to avoid answering the question himself, and anybody who knows anything of ODF will know that, in fact, this is a tricky question.

Accordingly, Dennis Hamilton (ODF TC member and secretary of the ODF Interoperability and Conformance TC) soon chipped in among the blog comments to point out that ODF’s description of formulas is governed by the word “Typically”, rendering it arguably just a guideline. And, as I pointed out in my last post, it is certainly possible to read ODF as a whole as nothing more than a guideline.

(I am glad to be able to report that the word “typically” has been stripped from the draft of ODF 1.2, indicating its existence was problematic.)

Curious readers might like to look for themselves at the (normative) schema for further guidance. Here, we find the formal schema definition for formulas, with a telling comment:

<define name="formula">
  <!-- A formula should start with a namespace prefix, -->
  <!-- but has no restrictions-->
  <data type="string"/>
</define>

Which is yet another confirmation that there are no certain rules about formulas in ODF.

So I believe Rob’s statement that “SP2's implementation of ODF spreadsheets does not, in fact, conform to the requirements of the ODF standard” is mistaken on this point. This might be his personal interpretation of the standard, but it is based on an ingenious reading (argued around the meaning of comma placement, and privileging certain statements over other), and should certainly give no grounds for complacency about the sufficiency of the ODF specification.

As an ODF supporter I am keen to see defects, such as conformance loopholes, fixed in the next published ODF standard. I urge all other true supporters to read the drafts and give feedback to make ODF better for the benefit of everyone, next time around.

Notes on Document Conformance and Portability #2

by Alex Brown 7. May 2009 10:42

There has been quite a lot of heat (and rather less light) in the blogosphere lately on the topic of document conformance, and – in particular – on the topic of whether Microsoft’s newly released Service Pack for their Office suite is conformant to the Open Document Format (ODF).

Conformance is, in reality, a complex topic. When we are considering the conformance of an office application there are really two distinct types of conformance at issue:

  • Document conformance
  • Application conformance

Simply put, document conformance is an assessment of whether the documents emitted by an application are as the specification says they must be; application conformance is an assessment of whether an application does what the specification says it must.

In the overheated blogs, one can find many pronouncements about what might (or might not) be conformant, but disappointingly few informed readings of what the ODF specification actually states about conformance. Curious readers might what to investigate further and decide for themselves …

There Are No Rules

The specification to consult is Open Document Format for Office Applications (OpenDocument) v1.1, specifically section 1.5 on “Document Processing and Conformance”. From this we learn various things about validity and foreign elements, and the section closes with these words:

There are no rules regarding the elements and attributes that actually have to be supported by conforming applications, except that applications should not use foreign elements and attributes for features defined in the OpenDocument schema.

It is not necessary to re-word or interpret what is written here – the meaning is plainly stated in exactly the place we’d expect to find it. Consequently we can confidently describe ODF as a very liberal specification indeed, and know it is very hard for an application to be non-conformant, so long as its document processing behaviour passes the technical XML validation and namespace tests stated earlier in this conformance section.

So while issues of document portability and application interoperability may well be worth discussing, I cannot see from this reading of ODF 1.1 how a discussion of application conformance is relevant to the new service pack from Microsoft.

Personally, I believe this conformance statement in ODF 1.1 is badly misconceived, and I am glad that the conformance requirements of the ODF 1.2 draft have been beefed-up – even though applications are still free to omit support for features of the specification and still claim full conformance. The “application description” feature in OOXML gets it right, I think, by introducing the concept of defined feature sets which conformant applications can claim to support (or not).

Going Beyond the Spec

Document conformance and portability is a problem for ODF, as it is for any similar specification, and in recognition of this OASIS has wisely created the OASIS Open Document Format Interoperability and Conformance (OIC) TC who are busy working on the wider issues of conformance than those envisaged within ODF itself. This work is very important and I am very happy to see it underway.

OOXML needs to have a parallel activity, as its specification too will not, on its own, be sufficient to guarantee the degree of document portability or application interoperability that users rightly demand. This is a theme I hope to return to in coming blog entries over the next couple of weeks …

Notes on Document Conformance and Portability #1

by Alex Brown 4. May 2009 08:20

Richard Gillam’s handy book, Unicode Demystified: A Practical Programmers Guide to the Encoding Standard, contains an example of right-to-left text appearing in a prevailing left-to-right writing direction:

Avram said “מזל טוב.‏” and smiled.

Whether you see here what you are meant to see here will depend on your browser's Unicode support, and whether you have Hebrew fonts installed. Properly rendered, it will look something like this:

In reading order, the first character after “said” is the “מ” character to the left of the closing quotation mark. The text then runs from right to left until the full-stop, and then resumes with “and smiled”. In Unicode, this text is not represented in rendering order, but reading order – it is up to the renderer to make space and reverse direction at the correct points. Here is the text represented as XML in a paragraph in an ODF document (get the document here):

<text:p>Avram said “&#x5de;&#x5d6;&#x5dc; &#x5d8;&#x5d5;&#x5d1;.&#x200f;” and smiled.</text:p>

One of the great things about XML is its solid basis in Unicode and therefore its use of the Universal Character Set (ISO/IEC 10646). XML defines a number of encodings for this character set, and in the XML above the numeric character reference mechanism is used for the Hebrew characters. Notice, just to the left of the full stop the use of U+200F 'RIGHT-TO-LEFT MARK' which specifies that the full stop is part of the right-to-left character sequence.

Viewing this document in three ODF applications (OpenOffice 3, Google Docs with FireFox, and the new MS Office 2007 SP2) give the correct result every time. That is good news.

And if, for an ODF application, the character sequence did not appear correctly (if, say, the full stop was out-of-place) we would be able to say unequivocally that it was faulty; and we would be able to point to the Unicode specification where the correct behaviour was described. We (the user) would be able to bang the table and demand the bug was fixed.

This kind of process is one one of the pillars of conformance testing: application conformance testing, to be exact. Where we have a solid spec and observable behaviour we can compare the two and make a judgement.

Where we don't have a solid spec, things get trickier. For the standardiser's viewpoint, and if its not too highfalutin (and anyway, I claim Cambridge resident's special rights), we might want to quote Wittgenstein on such occasions: "Whereof one cannot speak, thereof one must be silent".

About the author

Alex Brown


Links

Legal

The author's views contained in this weblog are his, and not necessarily of any organisation. Third-party contributions are the responsibility of the contributor.

This weblog’s written content is governed by a Creative Commons Licence.

Creative Commons License     


Bling

Use OpenDNS  

profile for alexbrn at Stack Overflow, Q&A for professional and enthusiast programmers

Quotable

Note that everyone directly involved in the development of ISO standards is a volunteer or funded by outside sponsors. The editors, technical experts, etc., get none of this money. Of course, we must also consider the considerable expense of maintaining offices and executive staff in Geneva. Individual National Bodies are also permitted to sell ISO standards and this money is used to fund their own national standards activities, e.g., pay for offices and executive staff in their capital. But none of this money seems to flow down to the people who makes the standards.

Rob Weir

RecentComments

Comment RSS