NDJ vs other popular formats
From Trephine
| « Newline delimited JSON | The unsplittability of XML » |
[subscribe] Recent blog entries
- Simple prototypal inheritance new!
- Adventures in Rhino - setters and getters
- Site improvements - fighting with Disqus
- JavaScript task chaining
- JavaScript string building benchmarks
- Efficient JavaScript string building
- Alternative JavaScript worker thread API
- Implementing JavaScript worker threads
- Thread safe DOM manipulation
- Site improvements - CSS sprites
- Trephine worker threads made easy
- Pitfalls of multithreaded browser development
- Site improvements - reducing dependencies
- The unsplittability of XML
Live Demos
NDJ vs other popular formats
A commenter on my previous article about newline delimited JSON asked what the benefit is of NDJ vs pure JSON. This article compares NDJ against a variety of other data formats including CSV, XML and JSON.
| Feature | XML | CSV | JSON | NDJ |
|---|---|---|---|---|
| Data objects of arbitrary complexity | Yes | -- | Yes | Yes |
| Human readable | -- | Yes | Yes | Yes |
| Easy to stream | Yes1 | Yes2 | -- | Yes |
| Easy to split for parallel processing | -- | Yes2 | -- | Yes |
| Supports inline comments | Yes | -- | -- | Yes |
| Whitespace agnostic | Yes | -- | Yes | -- |
| Implementations widely available | Yes | Yes | Yes | --3 |
- Notes
- Requires a SAX capable library and sufficient skill to leverage it.
- Values in a CSV file may contain newlines, so care must be taken distinguish between a value's newline and a delimiter.
- That is, not yet ;)
Compared to the other formats described here, the real benefit of NDJ arises from its partitionability. That is, it's fairly trivial to break up an enormous NDJ file into manageable chunks so that each can be processed by a separate entity (thread, process, computer) and/or at different times.
In fact, this is exactly how Hadoop Streaming works. Given a set of one or more data files, hadoop splits the files into lines and sends chunks of lines off to individual nodes for processing (this is the "map" part of MapReduce). Of all the data formats compared in this article, only NDJ is suitable for input to a hadoop job. CSV comes close, but as mentioned in the notes, CSV files may contain non-delimiting newlines.
Of course, NDJ is not suitable for every problem - nothing ever is - but by using it as an intermediate file format for a given programming problem, you're already on the road to parallelization, without even having to think about it.
I hope this helped explain the pros and cons of newline delimited JSON as compared to other popular data formats. I look forward to your comments!
--Jim R. Wilson (jimbojw) 14:29, 16 April 2009 (UTC)