Where is there an end of it? | OOXML and Microsoft Office 2007 Conformance: a Smoke Test

This is one in a series of popular blog articles I am re-publishing from the old Griffin Brown blog which is now closed down. This article is from April 2008. It is the same content as the original (except for some hyperlink freshening).

At the time of posting this entry caused quite a furore, even though its results were – to me anyway – as expected. Looking back I think what I wrote was largely correct, except I probably underestimated the difficulty of converting Microsoft Office to use the Strict variant of OOXML — this would require more than surgery just to the de-serialisation code!

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>

since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?

Rob Weir
3/29/2010 3:52:49 AM |

What is interesting, looking at this two years later, is how your analysis found only a single validity error in Transitional, in the OnOff type. But we know now, after multiple corrigenda and amendments, that Transitional was rife with errors: invoking the wrong Unicode version, the wrong Namespace version, that it was missing elements needed to change track mathematical equations, that many attribute types were wrong, etc. Hundreds of pages of corrections and additions were recently made because the published OOXML standard did not match what MS Office 2007 actually writes out. And your analysis found only a single one of these problems.

I suspect the problem is the corpus of documents used in your test. Although you used a long document -- the standard itself -- this is pretty much the work of a single editor, and exercises a constrained subset of the capabilities of the format.

Alex
3/29/2010 4:45:38 AM |

@Rob

Yes, there is definitely a problem of using only one document, and a problem of using only schema validation. The state of the art has advanced somewhat since then, not least with office-o-tron - http://code.google.com/p/officeotron/

Rob Weir
3/29/2010 5:13:31 AM |

Btw, we have a similar issue with testing code. A test suite is more powerful to the extent it exercises more of the code. In many cases this can be quantified in "code coverage" metrics and measured and reported in terms of "function coverage", "line coverage" or "path coverage".

Have you seen anything that does this at the XML level? For example, take a collection of XML documents and the schema that they are instances of, and report the "coverage", which might include listing which elements, attributes and attribute values have and have not been exercised. In code coverage, merely visiting each function is a good start, but doesn't ensure that you've taken all paths (say all cases in a switch statement). Full line coverage is better, but doesn't guarantee that you've coverage all possible states. There is probably a similar set of metrics for markup: all elements, all attributes, all attribute values, all nestings to N-levels (for small N), etc.

Alex
3/29/2010 5:09:04 PM |

@Rob

Yes, it's common to produce these kinds of content metrics in XML content projects (for example before transforming an existing corpus or upgrading a schema that governs an existing corpus). The kind of reports I see in my day job often record element usage and counts of the actual child elements, parent elements used, together with attributes usage.

I've not seen these kinds of reports formally compared to what is permitted by a model. One would need to take a view on which set of possibilities one wanted to compare against -- any non-trivial XML document model will permit astronomical number of valid document and the trick is knowing what the significant ones are.

I had thought about doing this for the upcoming ODF review -- again the key is finding possible structures which are significant. This is in part prompted by finding the "hyperlinks allow nested hyperlinks" problem. I'm sure ODF (and OOXML) permit all kinds of weird grammatical constructs which they either shouldn't, or which should have governing prose in the spec ...

Rick Jelliffe
3/30/2010 11:47:02 AM |

It is possible to get validation verification and coverage reports from Schematron validation. ISO SVRL has an element fired:rule which gives the context XPaths that have matched a node in the document.

These can be used to *verify* that all rules in the schema have been exercised by some test suite (e.g. by a process of simple string matching of the sch:rule/@id against the sch:fired-rule/@id).

Testing that the schema *covers* all nodes in the document suite is a bit more tricky as an operation on the SVRL. However, it is very convenient to do this inside the Schematron schema: a wildcard as the last sch:rule/@context will catch unmatched productions, and these can be reported with an sch:report element, perhaps labelled with something like @role="not-covered" so that the results can be readily filtered in or out of the other results in the SVRL file.

Validating against the STRICT model

Validating against the TRANSITIONAL model

Conclusions?

What's next?

Related posts

Comments (5) -

Rob Weir

Alex

Rob Weir

Alex

Rick Jelliffe