CSVs: The good, the bad, and the ugly
CSVs are a relatively popular data format, it seems particularly common as a format for providing exports of medium-sized datasets. My day job involves processing lots of these types of data sets, and so I’ve developed a set of strong opinions on CSVs, which are documented here.
One feature of CSVs that is often considered a big advantage is that it’s an incredibly simple file format. However, this can be misleading.
The Good
- They support streaming processing, you don’t need to hold an entire file in memory to process it.
- They’re relatively space efficient, the CSV format itself imposes relatively low overhead on the size of the data itself (specifically, the names of each column are not repeated for each record).
- Parsers are widely available, in every popular programming language, and also many small-data processing tools (e.g. Excel) support them natively.
The Bad
- CSVs have cousin formats tab-separated-values and pipe-separated-values. These are often used semi-interchangeably with CSVs, but are slightly less well supported, and if you parse something expecting the wrong delimiter you’ll get silent bugs.
- CSVs have no types, everything is a string. This means no native sequence types, and also no notion of nullability that’s distinct from the empty string.
- Most popular CSV libraries are poorly optimized, making CSV parsing much slower than it needs to be.
The Ugly
A great many people seem to produce and consume CSVs without using a real CSV
library. This is because the format of CSVs is so seductively simple, many
people don’t realize it has escaping and quoting rules. Many people seem to
generate their CSVs with file.write(",".join(fields))
which generates an
unparsable mess when one of the fields contains a quote or a comma.
This can generate a catastrophic mess, because there’s no mechanism for re-synchronization. Consider the following:
name,number
Alex,7
Jeremy,4
Jane,"3
Ali,12"
Many CSV parsers will produce something like the following, which is the correct parse according to RFC 4180:
[
("Alex", "7"),
("Jeremy", "4"),
("Jane", "3\nAli,12"),
]
However, some may also produce the following, which is what you’d get if you treat quotes as a literal or naively split on commas:
[
("Alex", "7"),
("Jeremy", "4"),
("Jane", "\"3"),
("Ali", "12\""),
]
Or even the following, if you implicitly close all quotes at the end of a line:
[
("Alex", "7"),
("Jeremy", "4"),
("Jane", "3"),
("Ali", "12"),
]
This makes parsing CSV file a somewhat fraught experience. This is exacerbated by the fact that most CSV libraries contain various heuristics about how to handle these sorts of cases, making the results inconsistent across languages.
Conclusions
My experience consuming CSVs generated from dozens of different systems is that the quoting, escaping, and other parse errors I highlighted as “The Ugly” present a large problem. Unfortunately, there’s no competing format I’m aware of that supports streaming processing, imposes low storage overhead, and is widely supported by tooling. Such a format would be tremendously valuable… if it were widely adopted. There are several alternatives which succeed on some axis:
- Line-delimited JSON works, but imposes a high size overhead (field names are repeated for every record) and is not well supported in non-programming environments.
- Parquet files work well, but streaming is a tad more complex (you need to be able to seek to the end of the file to read the metadata before you can stream the contents) and aren’t as widely supported.
- Excel files don’t support streaming well and are relatively poorly supported in programming environments.
Ultimately there’s no format that corrects all of CSVs flaws while retaining its benefits.