ppolv’s blog

February 25, 2008

Parsing CSV in erlang

Filed under: erlang — Tags: , , — ppolv @ 9:23 pm

So I need to parse a CSV file in erlang.

Although files in CSV have a very simple structure, simply calling lists:tokens(Line,”,”) for each line in the file won’t do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes.

A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn’t take quoted/multi-line fields into account.

Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files ;-) ) . As the format is really simple, we won’t use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching.

Our goals are

  • Recognize fields delimited by commas, records delimited by line breaks
  • Recognize quoted fields
  • Being able to parse quotes, commas and line breaks inside quoted fields
  • Ensure that all records had the same number of fields
  • Provide a fold-like callback interface, in addition to a return-all-records-in-file function

What the parser won’t do:

  • Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary

A quick lock suggest that the parser will pass through the following states:
cvs parsing states

  • Field start
  • at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded

  • Normal
  • an unquoted field

  • Quoted
  • inside a quoted field

  • Post Quoted
  • after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded

Parsing state

While parsing, we will use the following record to keep track of the current state

-record(ecsv,{
   state = field_start,  %%field_start|normal|quoted|post_quoted
   cols = undefined, %%how many fields per record
   current_field = [],
   current_record = [],
   fold_state,
   fold_fun  %%user supplied fold function
   }).

API functions

parse_file(FileName,InitialState,Fun) ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary,InitialState,Fun).
    
parse_file(FileName)  ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary).

parse(X) ->
   R = parse(X,[],fun(Fold,Record) -> [Record|Fold] end),
   lists:reverse(R).
		
parse(X,InitialState,Fun) ->
   do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}).

The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file.

Parsing

Now the fun part!.
The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state).
Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against

<<$",$",_/binary>>

sort of “lookahead” in yacc’s lexicon.

 
%% --------- Field_start state ---------------------
%%whitespace, loop in field_start state
do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})->		
	do_parse(Rest,S#ecsv{current_field=[32|Field]});

%%its a quoted field, discard previous whitespaces		
do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})->		
	do_parse(Rest,S#ecsv{state=quoted,current_field=[]});

%%anything else, is a unquoted field		
do_parse(Bin,S = #ecsv{state=field_start})->
	do_parse(Bin,S#ecsv{state=normal});	
		
		
%% --------- Quoted state ---------------------	
%%Escaped quote inside a quoted field	
do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[$"|Field]});		
	
%%End of quoted field
do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})->
	do_parse(Rest,S#ecsv{state=post_quoted});
	
%%Anything else inside a quoted field
do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[X|Field]});
	
do_parse(<<>>, #ecsv{state=quoted})->	
	throw({ecsv_exception,unclosed_quote});
	
	
%% --------- Post_quoted state ---------------------		
%%consume whitespaces after a quoted field	
do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})->	
	do_parse(Rest,S);


%%---------Comma and New line handling. ------------------
%%---------Common code for post_quoted and normal state---

%%EOF in a new line, return the records
do_parse(<<>>, #ecsv{current_record=[],fold_state=State})->	
	State;
%%EOF in the last line, add the last record and continue
do_parse(<<>>,S)->	
	do_parse([],new_record(S));

%% skip carriage return (windows files uses CRLF)
do_parse(<<$r,Rest/binary>>,S = #ecsv{})->
	do_parse(Rest,S);		
		
%% new record
do_parse(<<$n,Rest/binary>>,S = #ecsv{}) ->	
	do_parse(Rest,new_record(S));
	
do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})->	
	do_parse(Rest,S#ecsv{state=field_start,
					  current_field=[],
					  current_record=[lists:reverse(Field)|Record]});


%%A double quote in any other place than the already managed is an error
do_parse(<<$",_Rest/binary>>, #ecsv{})->	
	throw({ecsv_exception,bad_record});
	
%%Anything other than whitespace or line ends in post_quoted state is an error
do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})->
 	throw({ecsv_exception,bad_record});

%%Accumulate Field value
do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[X|Field]}).

Record assembly and callback

Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state.

%%check	the record size against the previous, and actualize state.
new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
	NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
	if
		(tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
			NewState = Fun(State,NewRecord),
			S#ecsv{state=field_start,cols=tuple_size(NewRecord),
					current_record=[],current_field=[],fold_state=NewState};
		
		(tuple_size(NewRecord) =/= Cols) ->
			throw({ecsv_exception,bad_record_size})
	end.

Final notes

We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren’t using a parser generator like yecc (this is the approach i’m using to parse ldap filters).

About these ads

17 Comments »

  1. $n should be $\n

    Comment by anoop aryal — September 17, 2008 @ 5:20 am

  2. Took your idea and played with it. Here is my version of the CSV parser using lists. Presumably if one uses a read_unicode_file function to read a file, the parse/1 function would be able to process unicode CSV files.

    %
    % Parse RFC 4180, CSV files.
    % GG20081222
    %

    -module(csv).
    -export([parse/1, parse_file/1]).
    -author(“gerald.gutierrez@gmail.com”).

    parse_file(Fn) ->
    {ok, Data} = file:read_file(Fn),
    parse(binary_to_list(Data)).

    parse(Data) -> lists:reverse(parse(Data, [])).

    parse([], Acc) -> Acc;
    parse(Data, Acc) ->
    {Line, Tail} = parse_line(Data),
    parse(Tail, [Line|Acc]).

    parse_line(Data) ->
    {Line, Tail} = parse_line(Data, []),
    {lists:reverse(Line), Tail}.

    parse_line([13,10|Data], Acc) -> {Acc, Data};
    parse_line([10|Data], Acc) -> {Acc, Data};
    parse_line([13|Data], Acc) -> {Acc, Data};
    parse_line([], Acc) -> {Acc, []};
    parse_line([$,,$,|Data], Acc) -> parse_line(Data, [""|Acc]);
    parse_line([$,|Data], Acc) -> parse_line(Data, Acc);
    parse_line(Data, Acc) ->
    {Fld, Tail} = parse_field(Data),
    parse_line(Tail, [Fld|Acc]).

    parse_field([$"|Data]) ->
    {Fld, Tail} = parse_fieldq(Data, “”),
    {lists:reverse(Fld), Tail};
    parse_field(Data) ->
    {Fld, Tail} = parse_field(Data, “”),
    {lists:reverse(Fld), Tail}.

    parse_field([$,|Tail], Acc) -> {Acc, [$,|Tail]};
    parse_field([13|Tail], Acc) -> {Acc, [13|Tail]};
    parse_field([10|Tail], Acc) -> {Acc, [10|Tail]};
    parse_field([], Acc) -> {Acc, []};
    parse_field([Ch|Tail], Acc) -> parse_field(Tail, [Ch|Acc]).

    parse_fieldq([$",$"|Tail], Acc) -> parse_fieldq(Tail, [$"|Acc]);
    parse_fieldq([$"|Tail], Acc) -> {Acc, Tail};
    parse_fieldq([Ch|Tail], Acc) -> parse_fieldq(Tail, [Ch|Acc]).

    Comment by Gerald Gutierrez — December 23, 2008 @ 2:32 am

  3. Fixed a couple of bugs, added support for CR, LF, and CRLF end of lines as well as a small regression test suite using eunit.

    %% —
    %% Parse csv formated data (RFC-4180)
    %% —

    -module(csv).
    -import(lists, [reverse/1]).
    -export([parse_file/3,parse_file/1,parse/3,parse/1]).

    -record(ecsv,{
    state = field_start, %%field_start|normal|quoted|post_quoted
    cols = undefined, %%how many fields per record
    current_field = [],
    current_record = [],
    fold_state,
    fold_fun %%user supplied fold function
    }).

    %% ——— Exported ——————————
    parse_file(FileName,InitialState,Fun) ->
    {ok, Binary} = file:read_file(FileName),
    parse(Binary,InitialState,Fun).

    parse_file(FileName) ->
    {ok, Binary} = file:read_file(FileName),
    parse(Binary).

    parse(Binary) ->
    R = parse(Binary,[],fun(Fold,Record) -> [Record|Fold] end),
    lists:reverse(R).

    parse(Binary,InitialState,Fun) ->
    do_parse(Binary,#ecsv{fold_state=InitialState,fold_fun=Fun}).

    %% ——— Field_start state ———————
    %%whitespace, loop in field_start state
    do_parse(<>,S = #ecsv{state=field_start,current_field=Field})->
    do_parse(Rest,S#ecsv{current_field=[32|Field]});

    %%its a quoted field, discard previous whitespaces
    do_parse(<>,S = #ecsv{state=field_start})->
    do_parse(Rest,S#ecsv{state=quoted,current_field=[]});

    %%anything else, is a unquoted field
    do_parse(Bin,S = #ecsv{state=field_start})->
    do_parse(Bin,S#ecsv{state=normal});

    %% ——— Quoted state ———————
    %%Escaped quote inside a quoted field
    do_parse(<>,S = #ecsv{state=quoted,current_field=Field})->
    do_parse(Rest,S#ecsv{current_field=[$"|Field]});

    %%End of quoted field
    do_parse(<>,S = #ecsv{state=quoted})->
    do_parse(Rest,S#ecsv{state=post_quoted});

    %%Anything else inside a quoted field
    do_parse(<>,S = #ecsv{state=quoted,current_field=Field})->
    do_parse(Rest,S#ecsv{current_field=[X|Field]});

    do_parse(<>, #ecsv{state=quoted})->
    throw({ecsv_exception,unclosed_quote});

    %% ——— Post_quoted state ———————
    %%consume whitespaces after a quoted field
    do_parse(<>,S = #ecsv{state=post_quoted})->
    do_parse(Rest,S);

    %%———Comma and New line handling. ——————
    %%———Common code for post_quoted and normal state—

    %%EOF in a new line, return the records
    do_parse(<>, #ecsv{current_record=[],fold_state=State})->
    State;
    %%EOF in the last line, add the last record and continue
    do_parse(<>,S)->
    do_parse(<>,new_record(S));

    %% new record windows
    do_parse(<>,S = #ecsv{})->
    do_parse(Rest,new_record(S));

    %% new record pre Mac OSX 10
    do_parse(<>,S = #ecsv{}) ->
    do_parse(Rest,new_record(S));

    %% new record Unix
    do_parse(<>,S = #ecsv{}) ->
    do_parse(Rest,new_record(S));

    do_parse(<>,S = #ecsv{current_field=Field,current_record=Record})->
    do_parse(Rest,S#ecsv{state=field_start,
    current_field=[],
    current_record=[lists:reverse(Field)|Record]});

    %%A double quote in any other place than the already managed is an error
    do_parse(<>, #ecsv{})->
    throw({ecsv_exception,bad_record});

    %%Anything other than whitespace or line ends in post_quoted state is an error
    do_parse(<>, #ecsv{state=post_quoted})->
    throw({ecsv_exception,bad_record});

    %%Accumulate Field value
    do_parse(<>,S = #ecsv{state=normal,current_field=Field})->
    do_parse(Rest,S#ecsv{current_field=[X|Field]}).

    %%check the record size against the previous, and actualize state.
    new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
    NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
    if
    (tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
    NewState = Fun(State,NewRecord),
    S#ecsv{state=field_start,cols=tuple_size(NewRecord),
    current_record=[],current_field=[],fold_state=NewState};

    (tuple_size(NewRecord) =/= Cols) ->
    throw({ecsv_exception,bad_record_size})
    end.

    %% ——– Regression tests ————————
    %% From the erl interpreter run csv:test() to run regression tests.
    %% See eunit for more information.
    -define(TEST,true).
    -ifdef(TEST).
    -include_lib(“eunit/include/eunit.hrl”).

    csv_test_() ->
    [% empty binary
    ?_assertEqual([], parse(<>)),
    % Unix LF
    ?_assertEqual([{"1A","1B","1C"},{"2A","2B","2C"}], parse(<>)),
    % Unix LF with extra spaces after quoted element stripped
    ?_assertEqual([{"1A","1B","1C"},{"2A","2B","2C"}], parse(<>)),
    % Unix LF with extra spaces preserved in unquoted element
    ?_assertEqual([{" 1A ","1B","1C"},{"2A","2B","2C"}], parse(<>)),
    % Pre Mac OSX 10 CR
    ?_assertEqual([{"1A","1B","1C"},{"2A","2B","2C"}], parse(<>)),
    % Windows CRLF
    ?_assertEqual([{"1A","1B","1C"},{"2A","2B","2C"}], parse(<>)),
    % Quoted element
    ?_assertEqual([{"1A","1B"}], parse(<>)),
    % Nested quoted element
    ?_assertEqual([{"1A","\"1B\""}], parse(<>)),
    % Quoted element with embedded LF
    ?_assertEqual([{"1A","1\nB"}], parse(<>)),
    % Missing 2nd quote
    ?_assertThrow({ecsv_exception,unclosed_quote}, parse(<>)),
    % Bad record size
    ?_assertThrow({ecsv_exception,bad_record_size}, parse(<>))
    ].

    -endif.

    Comment by lkraz — December 31, 2008 @ 4:16 am

  4. It seems that the website is stripping some of the code when doing a simple cut and paste. Email me at luke.krasnoff@gmail.com for the code.

    Comment by lkraz — December 31, 2008 @ 4:21 am

  5. Thanks all for your comments!

    @Gerald: you are right, your parsing functions are able to work on unicode date without any change other than decoding the initial unicode (xmerl has functions for that)

    @lkraz: The eunit tests looks neat, I really like it.

    Comment by ppolv — December 31, 2008 @ 1:36 pm

  6. Is there a license by which someone can use this code. It looks pretty excellent.

    Comment by Dave — April 14, 2009 @ 9:23 pm

  7. @Dave: I didn’t include any explicit license info as it’s a simple algorithm not a complete system. Just go ahead and use it if it is helpful!,
    you can consider it to be in public domain.

    Comment by ppolv — April 14, 2009 @ 9:43 pm

  8. Thanks! That’s very helpful.

    Comment by Dave — April 14, 2009 @ 9:44 pm

  9. Nice drawing too! Which tool did you use for that?

    Comment by Gregory Van de Wiele — May 18, 2009 @ 7:43 pm

  10. Would you consider turning this into an full-on Erlang library on Github/Bitbucket? I’m sure the Erlang community would be greatly appreciative.

    Comment by Carlo Cabanilla — May 25, 2009 @ 12:48 am

    • Yeah, it would be great to get this code up on github or something.

      Comment by joe — July 23, 2009 @ 5:23 pm

  11. Great work! Is there a download link? The code posted earlier seems to have missing chars and doesn’t compile.
    thanks

    Comment by Praveen — June 16, 2009 @ 7:52 pm

  12. for my application i have to parse CSV file using Erlang.so i m using Gerald Gutierrez code as he mention above.but in this code i have found two issues,which is listed below.

    //just copy and paste his code.
    parse_file(Fn) ->
    {ok, Data} = file:read_file(Fn),
    parse(binary_to_list(Data)).

    parse(Data) -> lists:reverse(parse(Data, [])).

    parse([], Acc) -> Acc;
    parse(Data, Acc) ->
    {Line, Tail} = parse_line(Data),
    parse(Tail, [Line|Acc]).

    parse_line(Data) ->
    {Line, Tail} = parse_line(Data, []),
    {lists:reverse(Line), Tail}.

    parse_line([13,10|Data], Acc) -> {Acc, Data};
    parse_line([10|Data], Acc) -> {Acc, Data};
    parse_line([13|Data], Acc) -> {Acc, Data};
    parse_line([], Acc) -> {Acc, []};
    parse_line([$,,$,|Data], Acc) -> parse_line(Data, [""|Acc]);
    parse_line([$,|Data], Acc) -> parse_line(Data, Acc);
    parse_line(Data, Acc) ->
    {Fld, Tail} = parse_field(Data),
    parse_line(Tail, [Fld|Acc]).

    parse_field([34|Data]) ->
    {Fld, Tail} = parse_fieldq(Data, “”),
    {lists:reverse(Fld), Tail};
    parse_field(Data) ->
    {Fld, Tail} = parse_field(Data, “”),
    {lists:reverse(Fld), Tail}.

    parse_field([$,|Tail], Acc) -> {Acc, [$,|Tail]};
    parse_field([13|Tail], Acc) -> {Acc, [13|Tail]};
    parse_field([10|Tail], Acc) -> {Acc, [10|Tail]};
    parse_field([], Acc) -> {Acc, []};
    parse_field([Ch|Tail], Acc) -> parse_field(Tail, [Ch|Acc]).

    parse_fieldq([34,34|Tail], Acc) -> parse_fieldq(Tail, [34|Acc]);
    parse_fieldq([34|Tail], Acc) -> {Acc, Tail};
    parse_fieldq([Ch|Tail], Acc) -> parse_fieldq(Tail, [Ch|Acc]).

    this code works fine but having two issues:- 1-since the code parse using double quote (“”) and comma(,) and separate each value..but in following example if First name consist of double quote sting within it then the parser will create one more field.

    “Type”,”First Name”,”Last Name”,”Email”
    “Contact”,”Ashwani Garg ——”All Pain Will End.””,””,”itisashwani4u@gmail.com”

    result:-
    [["contact"],["Ashwani Garg ------"],["All Pain Will End."],[],["itisashwani4u@gmail.com"]]

    expected result:-
    [["contact"],["Ashwani Garg ------All Pain Will End."],[],["itisashwani4u@gmail.com"]]

    2-for the following kind of csv its for value,its truncate some value:- First Name,Last Name,Middle Name,Name,Nickname,E-mail Address,Home Street,Home City,Home Postal Code,Home State,Home Country/Region,Home Phone,Home Fax,Mobile Phone,Personal Web Page,Business Street,Business City,Business Postal Code,Business State,Business Country/Region,Business Web Page,Business Phone,Business Fax,Pager,Company,Job Title,Department,Office Location,Notes

    Affection,,,Affection,,,,,,,,+919845141544,,+919845141544,,,,,,,,,,,,,,,
    result:-
    [["Affection"],[],[],["Affection"],[],[],[],[],[],[],[],["+919845141544"],[],["+919845141544"],[],[],[],[],[],[],[]]
    expected result:-
    [["Affection"],[],[],["Affection"],[],[],[],[],[],[],[],["+919845141544"],[],["+919845141544"],[],[],[],[],[],[],[],[],[],[],[],[],[],[]]

    Please help me …

    Comment by Abhimanyu Singh — October 7, 2009 @ 4:48 pm

    • This is my version that I use regularly:

      parse_csv(String) -> parse_csv_val(re:replace(String,”\r”,”\n”, [global, {return, list}]),no,[],[]).
      parse_csv_val([],no,Vals,DV) -> lists:reverse([lists:reverse(Vals)|DV]);
      parse_csv_val([],{q,V},Vals,DV) -> lists:reverse([lists:reverse([lists:reverse(V)|Vals])|DV]);
      parse_csv_val([],Val,Vals,DV) -> lists:reverse([lists:reverse([lists:reverse(Val)|Vals])|DV]);
      parse_csv_val([$\n|Rest],no,Vals,DV) -> parse_csv_val(Rest,no,[],[lists:reverse(Vals)|DV]);
      parse_csv_val([$\n|Rest],{q,V},Vals,DV) -> parse_csv_val(Rest,{q,[$\n|V]},Vals,DV);
      parse_csv_val([$\n|Rest],Val,Vals,DV) -> parse_csv_val(Rest,no,[],[lists:reverse([lists:reverse(Val)|Vals])|DV]);
      parse_csv_val([$"|Rest],no,Vals,DV) -> parse_csv_val(Rest,{q,[]},Vals,DV);
      parse_csv_val([$"|Rest],{q,V},Vals,DV) -> parse_csv_val(Rest,no,[lists:reverse(V)|Vals],DV);
      parse_csv_val([$,|Rest],no,Vals,DV) -> parse_csv_val(Rest,no,Vals,DV);
      parse_csv_val([$,|Rest],{q,V},Vals,DV) -> parse_csv_val(Rest,{q,[$,|V]},Vals,DV);
      parse_csv_val([$,|Rest],Val,Vals,DV) -> parse_csv_val(Rest,no,[lists:reverse(Val)|Vals],DV);
      parse_csv_val([X|Rest],no,Vals,DV) -> parse_csv_val(Rest,[X],Vals,DV);
      parse_csv_val([X|Rest],{q,V},Vals,DV) -> parse_csv_val(Rest,{q,[X|V]},Vals,DV);
      parse_csv_val([X|Rest],Val,Vals,DV) -> parse_csv_val(Rest,[X|Val],Vals,DV).

      Comment by Andrew — November 20, 2012 @ 11:40 am

  13. “Parsing CSV in erlang | ppolv’s blog” was in fact a incredibly good posting, . Keep writing and I am going to keep on browsing! Regards -Angelo

    Comment by Zack — May 31, 2013 @ 3:47 am

  14. Thanks for posting “Parsing CSV in erlang | ppolv’s blog”. I reallymight definitely be coming back for far more browsing and commenting soon enough. Many thanks, Latia

    Comment by Aracely — August 9, 2013 @ 6:34 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: