CSVs: The good, the bad, and the ugly

CSVs are a relatively popular data format, it seems particularly common as a format for providing exports of medium-sized datasets. My day job involves processing lots of these types of data sets, and so I’ve developed a set of strong opinions on CSVs, which are documented here.

One feature of CSVs that is often considered a big advantage is that it’s an incredibly simple file format. However, this can be misleading.

The Good

The Bad

The Ugly

A great many people seem to produce and consume CSVs without using a real CSV library. This is because the format of CSVs is so seductively simple, many people don’t realize it has escaping and quoting rules. Many people seem to generate their CSVs with file.write(",".join(fields)) which generates an unparsable mess when one of the fields contains a quote or a comma.

This can generate a catastrophic mess, because there’s no mechanism for re-synchronization. Consider the following:

name,number
Alex,7
Jeremy,4
Jane,"3
Ali,12"

Many CSV parsers will produce something like the following, which is the correct parse according to RFC 4180:

[
	("Alex", "7"),
	("Jeremy", "4"),
	("Jane", "3\nAli,12"),
]

However, some may also produce the following, which is what you’d get if you treat quotes as a literal or naively split on commas:

[
	("Alex", "7"),
	("Jeremy", "4"),
	("Jane", "\"3"),
	("Ali", "12\""),
]

Or even the following, if you implicitly close all quotes at the end of a line:

[
	("Alex", "7"),
	("Jeremy", "4"),
	("Jane", "3"),
	("Ali", "12"),
]

This makes parsing CSV file a somewhat fraught experience. This is exacerbated by the fact that most CSV libraries contain various heuristics about how to handle these sorts of cases, making the results inconsistent across languages.

Conclusions

My experience consuming CSVs generated from dozens of different systems is that the quoting, escaping, and other parse errors I highlighted as “The Ugly” present a large problem. Unfortunately, there’s no competing format I’m aware of that supports streaming processing, imposes low storage overhead, and is widely supported by tooling. Such a format would be tremendously valuable… if it were widely adopted. There are several alternatives which succeed on some axis:

Ultimately there’s no format that corrects all of CSVs flaws while retaining its benefits.