Parsing CSV Files in WinRT
I am proud to announce that my CSV Parser for Windows RT is now available on MSDN Code Samples.
Creating a CSV parser sounds like an easy task, but it’s the developer equivalent to quick sand.
CSV stands for Comma Separted Values and, despite the name, the files in this "format" are often not fields separated by commas.
Often, tabs or pipe characters are used instead.
Additionally, parsing a CSV file is not as straightforward as it seems.
One would think that it would be as simple as splitting the raw text first by line to get the records and then by delimiter to extract the fields.
Simple Enough, right?
Year,Make,Model
2003, Chevy, Impala LS
2007, Honda, Accord
1985, Chevy, Caprice Classic
Like many elements to our line of work, it can be that simple, but it rarely is.
The Slippery Slope
Here’s an example from a feed about earthquake data from the USGS.
Notice how that nice simple code you had in mind suddenly gets more complex.
Src,Eqid,Version,Datetime,Lat,Lon,Magnitude,Depth,NST,Region
ci,15211745,0,"Thursday, September 6, 2012 20:42:58 UTC",34.3067,-117.1348,1.5,19.20,13,"Southern California"
ci,15211737,0,"Thursday, September 6, 2012 19:39:45 UTC",32.5963,-116.9800,1.6,2.30,35,"San Diego County urban area, California"
nc,71838765,0,"Thursday, September 6, 2012 19:38:04 UTC",37.3273,-122.1048,2.0,0.10, 8,"San Francisco Bay area, California"
Clearly, a simple string.Split(‘,’) won’t cut it.
But wait, it gets worse.
Descent into Development Hell
The first rule of plain text files is that once you pick a special character, you have to come up with a way to escape that character anywhere else it may appear.
It’s a slippery slope, to be sure. First, we made the delimiter character a comma, then when have a comma legitimately appear in our data, we have to make it clear that it’s content, not structure.
Not a big deal, right? Just put the field inside quotes. But what happens when you have to escape the escape characters?
Year,Make,Model,Associated Quote,Artist
1964, Chevy, Impala, ""Rolling in my 64"", Snoop Dog
1982, Chevy, Corvette, ""Little red Corvette, baby you're way too fast"",Prince
Various, Various, Various, ""My main objective is, Benz's and Lexus's"", Lord Tariq
Just Use RegEx!
True, but using RegEx means you have to understand RegEx completely. RegEx, when properly used, works great.
But few people know RegEx all that well.
The bottom line is summed up in this quote:
“Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.” - Jamie Zawinski
But This is the 21st Century, Use XML or JSON!
Many organizations make extensive use of the CSV file format for data exchange.
In fact, many government agencies expose their datasets on Data.gov via CSV.
Simply put: JSON or XML may not be an option for some organizations.
It’s hard to beat a file format that’s as easy to create as CSV. (It can be as simple as choosing Save As in Excel.)
The bottom line is that there’s a great deal of content available out there in the wild and developers should spend their time creating great Apps, not parsing files.
