I recently needed to be able to parse CSV export from excel, in Javascript. The only way I could find to parse it was a stream-oriented parser, so I came up with this that seems to cope with the majority of strangeness, including multiline cells, quotation marks, and escape and non-escaped cells.
It’s implemented in coffeescript:
class CSV
constructor: (data) ->
@raw = data
@_parse(@raw)
_parse: ->
@rows = []
row = []
cell = ""
offset = 0
mode = CSV.RAW
while offset < @raw.length
ch = @raw.charAt(offset)
adjacent = @raw.charAt(offset+1)
if mode == CSV.RAW
if ch == ","
row.push cell
cell = ""
else if ch == "\r"
@rows.push row
row = []
cell = ""
else if ch == "\""
mode = CSV.ESC
else
cell += ch
else if mode == CSV.ESC
if ch == "\""
if adjacent == "\""
cell += ch
offset += 1
else
mode = CSV.RAW
else
cell += ch
else
throw "Invalid csv parser mode"
offset += 1
@rows
CSV.RAW = 1
CSV.ESC = 2
@CSV = CSV
This is a good example of code which will block the browser in a big-time way. For example, if you tried to parse a one megabyte CSV file, you’ll probably crash webkit mobile devices, and get script timeout errors in slower javascript implementations. Whoever, because all the looping logic is in a while, it’d be straightforward to make the parser run inside a setInterval loop, and let the browser remain responsive.