The unsplittability of XML

From Trephine

Jump to: navigation, search
« NDJ vs other popular formats Site improvements - reducing dependencies »

[subscribe] Recent blog entries

Live Demos

The unsplittability of XML

Yesterday's article compared newline delimited JSON to other popular formats. One commenter felt that I was being disingenuous towards XML by stating that it is not easy to split for parallel processing. I stand by my assertion, and here's why:

I appreciate that it's possible to use XML in such a way as to make it capable of being split easily, however the format as a whole does not lend itself in this way. That is, whereas every properly formatted NDJ file shares the characteristic that it is easy to split (newlines are the only delimiter), not every XML file has this feature. In fact, it's fairly trivial to come up with examples that would be difficult or impossible to split correctly and efficiently.

For the sake of demonstration, lets construct a format for storing a list of text blobs. Something like this:

<blobs>
<text>Here is some text</text>
<text>Here is some more text</text>
</blobs>

The strategy for splitting this file (as I envision it - please correct me if I'm mistaken) would be to jump forward in the file, then walk backwards until a <text> tag is encountered, with some special rules about what to do if the splitter happens to land right on an open or close tag.

One way we can break the splitter is by introducing XML comments:

<blobs>
<text>Here is some text</text>
<!--
<text>Ignore this entry!</text>
-->
<text>Here is some more text</text>
</blobs>

Suppose the splitter hit inside the "Ignore" entry. It walks backwards, finds the opening text tag and splits accordingly. Whatever process receives the second chunk unknowingly processes the text which should have been ignored.

Another way our split strategy could break is by introducing unparsed character data:

<blobs>
<text>Here is some text</text>
<text><![CDATA[<text>Example of a text element.</text>]]></text>
<text>Here is some more text</text>
</blobs>

In this case, if the splitter happens upon the word "Example", it will walk back and find the text tag which is really part of the character data. This causes two problems for the process that receives the second chunk. First, the CDATA text entry will be improperly parsed, losing the leading and trailing literal text tags. Second, the XML parser will now presumably choke when it receives a second closing text tag before a beginning one.

And this is just for a trivally simple XML file format with two types of tags (<blobs> and <text>). The problem is compounded further by more complex XML schemas. It should be fairly obvious that any schema which allows nesting of the elements to be split would be subject to similar problems, even without comments and CDATA.

So in summary, it's possible to have a perfectly valid XML document which is virtually impossible to split without scanning forward from the beginning of the file in order to maintain the context. And if you have to scan from the beginning, then it's not really "easy" to split, after all.

--Jim R. Wilson (jimbojw) 17:31, 17 April 2009 (UTC)