NDJ vs other popular formats

From Trephine

Jump to: navigation, search
« Newline delimited JSON The unsplittability of XML »

[subscribe] Recent blog entries

Live Demos

NDJ vs other popular formats

A commenter on my previous article about newline delimited JSON asked what the benefit is of NDJ vs pure JSON. This article compares NDJ against a variety of other data formats including CSV, XML and JSON.

Feature XML CSV JSON NDJ
Data objects of arbitrary complexity Yes -- Yes Yes
Human readable -- Yes Yes Yes
Easy to stream Yes1 Yes2 -- Yes
Easy to split for parallel processing -- Yes2 -- Yes
Supports inline comments Yes -- -- Yes
Whitespace agnostic Yes -- Yes --
Implementations widely available Yes Yes Yes --3
Notes
  1. Requires a SAX capable library and sufficient skill to leverage it.
  2. Values in a CSV file may contain newlines, so care must be taken distinguish between a value's newline and a delimiter.
  3. That is, not yet ;)

Compared to the other formats described here, the real benefit of NDJ arises from its partitionability. That is, it's fairly trivial to break up an enormous NDJ file into manageable chunks so that each can be processed by a separate entity (thread, process, computer) and/or at different times.

In fact, this is exactly how Hadoop Streaming works. Given a set of one or more data files, hadoop splits the files into lines and sends chunks of lines off to individual nodes for processing (this is the "map" part of MapReduce). Of all the data formats compared in this article, only NDJ is suitable for input to a hadoop job. CSV comes close, but as mentioned in the notes, CSV files may contain non-delimiting newlines.

Of course, NDJ is not suitable for every problem - nothing ever is - but by using it as an intermediate file format for a given programming problem, you're already on the road to parallelization, without even having to think about it.

I hope this helped explain the pros and cons of newline delimited JSON as compared to other popular data formats. I look forward to your comments!

--Jim R. Wilson (jimbojw) 14:29, 16 April 2009 (UTC)