ppolv's blog

May 9, 2008

fun with mochiweb’s html parser and xpath

Filed under: erlang — Tags: Add new tag, erlang, mochiweb, screen scraping, xpath — ppolv @ 11:07 pm

Some days ago, while reading Roberto Saccon’s blog, I noticed that mochiweb has an html parser.

So a couple of days latter, here I am giving it a try.
The structure of the generated tree is quite simple (I’m only interested in text and element nodes).
A text node is represented as an erlang binary(). An element node as a 3-element tuple {Tag,Attrs,Contents} where Tag is the element’s tag, Attrs is a list of {Name,Value} and Contents is a list of the element’s contents.

In contrast with xmerl, the input document is never converted to string(). Text, tags, attributes names and values generated from mochiweb’s parser are all binary() , which I guess helps both with parsing speedup and memory consumption.

My final goal with the parser would be to use it inside tsung http benchmarks. Currently you have to use regular expression to refer to elements of the page. I’d preffer to use XPath instead. Also, the regexp module can become painfully slow when you are working with multiple expressions, so using an html parser + XPath make sense, as the cost of parsing the html will be amortized between all the xpath expressions.

I wrote a simple XPath interpreter that works over mochiweb’s html tree. Currently it’s capable of execute xpath expressions with

self,child, descendant_or_self and attributes axis
predicates
some predefined functions ( count/1, start-with/2,..)
user-defined functions

There is a .tag.gz file in the mochiweb group containing the xpath code.

Example: simple HTML Screen Scraping in Erlang

To demonstrate how to use the parser/xpath combination, we’ll write a function that, given an URL pointing to a web page, return a summary of how much the page weights, in the spirit of Web Page Analyzer. For simplicity, we’ll only consider

the size of the page itself
the size of the images
the size of the external scripts files
the size of the external css files

so we won’t be searching for things like flash or other type of embedded objects.
Also, we’ll be working on the raw html page as returned by the server, so if the page contains javascript code that adds images to the page, we wouldn’t be aware of that.

The steps are:

Request the desired page using http
Parse the body of the response using mochiweb’s parser
Using XPath, search for the url of all the images, external scripts and css contained in the page
Request the size of each of the documents at the urls obtained in the previous step

Since this is an erlang example, there should be some concurrency involved, shouldn’t it?. So at step 4, instead of sequentially scan all the urls, we will spawn a new process for each and make all the request in parallel. Nice and simple.

Lets start with our main function:

page_info(URL) ->
    case http:request(URL) of
        {ok,{_,Headers,Body}} ->
            got_page_info(URL,content_length(Headers),Body);
        {error,Reason} ->
            {error,Reason}
    end.

its makes an http request to the given URL, and if everything goes well, pass to got_page_info/3 to process the results.

got_page_info(URL, PageSize,Body) ->
    Tree = mochiweb_html:parse(Body),
    Imgs = remove_duplicates(mochiweb_xpath:execute("//img/@src",Tree)),
    Css = remove_duplicates(mochiweb_xpath:execute("//link[@rel=’stylesheet’]/@href",Tree)),
    Scripts = remove_duplicates(mochiweb_xpath:execute("//script/@src",Tree)),

    URLCtx = url_context(URL),
    spawn_workers(URLCtx,img,lists:map(fun  binary_to_list/1,Imgs)),
    spawn_workers(URLCtx,css,lists:map(fun  binary_to_list/1,Css)),
    spawn_workers(URLCtx,script,lists:map(fun  binary_to_list/1,Scripts)),

    TRef = erlang:send_after(?TIMEOUT,self(),timeout),
    State = #state{page=PageSize,
                  timer=TRef,
                  errors=[],
                  img=0,
                  css=0,
                  script=0},
    wait_for_responses(State,length(Imgs) + length(Css) + length(Scripts)).

Now we have reached to the interesting part. First, parse the http body (mochiweb_html:parse/1), then search the document using xpath (mochiweb_xpath:execute/3). With those results in hands, spawn all the bots that will be in charge of querying the urls (spawn_workers/3). Then sit and wait (wait_for_responses/2) until all the bots finished, or the timeout expires.

We remove duplicates, since the same image could be used multiple times in the same page, and we want to count it only once. UrlContext will be used by the bots to resolve relative and absolute urls.

wait_for_responses(State,0) ->
finish(State,0);

wait_for_responses(State,Count) ->
    receive
        {component,Type,URL,Info} ->
            NewState = add_info(State,Type,URL,Info),
            wait_for_responses(NewState,Count-1);
        timeout ->
            finish(State,Count)
    end.

The protocol between the master and the bots is simple:
each bot will report its findings by sending a message back to the originating process, and then die.

In the master process, If the count of remaining bots is 0 then all bots have finished, and we only need to return the gathered info. If not, we wait until some message arrives.
We are interested in two kind of messages:

a msg from a worker process. We update the state and loop again
a timeout. The timeout has expired, we don’t wait any more for the remaining bots to complete.

For the purposes of this exercise, this is enough. In a real application we would probably like to do some additional work, like watching for crashes in the bots that we’ve spawned. Also we would need to kill all remaining process when the timeout expires.

add_info(State = #state{css=Css},css,_URL,{ok,Info}) ->
    State#state{css = Css + Info};
add_info(State = #state{img=Img},img,_URL,{ok,Info}) ->
    State#state{img = Img + Info};
add_info(State = #state{script=Script},script,_URL,{ok,Info}) ->
    State#state{script = Script + Info};
add_info(State = #state{errors=Errors},_Type,URL,{error,Reason}) ->
    State#state{errors=[{URL,Reason}|Errors]}.

This is self explanatory. Update the current state with the new info, counting how many bytes are in images, scripts and css files.

spawn_workers(URLCtx,Type,URLs) ->
    Master = self(),
    lists:foreach(fun(URL) ->
      spawn(fun()-> Master ! {component,Type,URL,get_component_info(URLCtx,URL)} end)
    end,URLs).

The code above is all we need to create the bots processes. One bot per URL, each bot sends a message back to the master with the result of evaluating get_component_info/2.

get_component_info(URLCtx,URL) ->
    FullURL = full_url(URLCtx,URL),
    case http:request(head,{FullURL,[]},[],[]) of
        {ok, {_,Headers,_Body}} ->
                    {ok,content_length(Headers)};
        {error,Reason} ->
            {error,Reason}
    end.

And the rest, mainly utility functions:

remove_duplicates(L) ->
sets:to_list(sets:from_list(L)).

% extract content-length from the http headers
content_length(Headers) ->
list_to_integer(proplists:get_value("content-length",Headers,"0")).

%% abs url inside the same server ej: /img/image.png
full_url({Root,_Context},ComponentUrl=[$/|_]) ->
Root ++ ComponentUrl;

%% full url ej: http://other.com/img.png
full_url({_Root,_Context},ComponentUrl="http://"++_) ->
ComponentUrl;

% everything else is considerer a relative path.. obviously its wrong (../img)
full_url({Root,Context},ComponentUrl) ->
Root ++ Context ++ "/" ++ ComponentUrl.

% returns the  domain, and current context path.
% url_context("http://www.some.domain.com/content/index.html)
%      -> {"http://www.some.domain.com", "/content"}
url_context(URL) ->
    {http,_,Root,_Port,Path,_Query} = http_uri:parse(URL),
    Ctx = string:sub_string(Path,1, string:rstr(Path,"/")),
    {"http://"++Root,Ctx}.

% cancel the timeout timer and return results as a tuple
finish(State,Remaining) ->
    #state{page=PageSize,
           img=ImgSize,
           css=CssSize,
           script=ScriptSize,
           errors=Errors,
           timer=TRef} = State,
    erlang:cancel_timer(TRef),
    {PageSize,ImgSize,CssSize,ScriptSize,Errors,Remaining}.

% pretty print
report({PageSize,ImgSize,CssSize,ScriptSize,Errors,Missing}) ->
    io:format("html size: ~.2fkb~n",[PageSize/1024]),
    io:format("images:    ~.2fkb~n",[ImgSize/1024]),
    io:format("styleshets:~.2fkb~n",[CssSize/1024]),
    io:format("scripts:   ~.2fkb~n",[ScriptSize/1024]),
    Total = PageSize + ImgSize + CssSize + ScriptSize,
    io:format("~nTotal:     ~.2fkb~n",[Total/1024]),
    lists:foreach(fun({URL,Code}) -> io:format("~s error!: ~p",[URL,Code]) end, Errors),
    case Missing of
        0 -> ok;
        Missing  -> io:format("Timeouts: ~b~n",[Missing])
    end.

That’s it. Now a sample erlang session to see our code in action 🙂

51> L = page_tester:page_info(“https://ppolv.wordpress.com”).
{39077,38799,4672,20756,[],0}
52> page_tester:report(L).
HTML Size: 38.16kb
Images: 37.89kb
Styleshets:4.56kb
Scripts: 20.27kb

Total: 100.88kb
ok
53>

Easy, isn’t it?

p.s.
it looks like I finally learn how to easily embed erlang code in wordpress!, taking some ideas from code-syntax-highlighting-for-blogs-with-vim

Comments (12)

February 29, 2008

esolr, an erlang text search client library for Apache Solr

Filed under: erlang — Tags: erlang, solr — ppolv @ 4:42 am

From the Apache Solr website:

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.

Nice, a full text search engine easily accessible from anywhere. Just HTTP, no special binding required.

I’ve just hacked esolr, a simple, almost untested and featureless erlang client for Solr ;-). Well, there wasn’t so many operations to implement really. The basic and usefull ones are

Add/Update documents esolr:add/1
Delete documents esolr:delete/1
Search esolr:search/2

Also, there are functions to perform commits to the index (to make all changes made since the last commit available for searching) and to optimize the index (a time consuming operation, see Solr documentation). Besides of issuing commits and optimize operations explicitly, the library also allows to perform that operations periodically at user-defined intervals. In the case of commits, these can also be specified to automatically take place after each add or delete operation (mainly usefull for development and not for production code).

Quick start:

Install Solr 1.2
Run it with the sample configuration provided (/example$ java -jar start.jar)
Make sure that is correctly running, open a browser at http://localhost:8983/solr/admin/
Get esolr from the trapexit forum
Look at the html API documentation
Compile the sources (RFC4627.erl, from http://www.lshift.net/blog/2007/02/17/json-and-json-rpc-for-erlang is included)
Start the esolr library esolr:start_link()
Play around

To compile, open an erlang console on the directory where the .erl files resides, and type:

28> c(rfc4627).
{ok,rfc4627}
29> c(esolr).
{ok,esolr}

then start the esolr process, using default configuration:

30>esolr:start_link().
{ok,}

Add some documents. Here we are adding two documents, one in each call to esolr:add/1. The id and name fields are defined in the sample Solr schema, id is.. you know, the ID for the document.

31> esolr:add([{doc,[{id,"a"},{name,<<"Look me mom!, I'm searching now">>}]}]).
ok
32> esolr:add([{doc,[{id,"b"},{name,<<"Yes, searching from the erlang console">>}]}]).
ok

Commit the changes.

33>esolr:commit().
ok

Search. We search for the word “search”, and specify that we want all the normal fields plus the document score for the query, that we want the result in ascendant order by id, and that we want the matchings highlighted for us.

34> esolr:search("search",[{fields,"*,score"},{sort,[{id,asc}]},{highlight,"name"}]).
{ok,[{"numFound",2},{"start",0},{"maxScore",0.880075}],
    [{doc,[{"id",<<"a">>},
           {"sku",<<"a">>},
           {"name",<<"Look me mom!, I'm searching now">>},
           {"popularity",0},
           {"timestamp",<<"2008-02-28T23:42:15.642Z">>},
           {"score",0.628625}]},
     {doc,[{"id",<<"b">>},
           {"sku",<<"b">>},
           {"name",<<"Yes, searching from the erlang console">>},
           {"popularity",0},
           {"timestamp",<<"2008-02-28T23:43:26.997Z">>},
           {"score",0.880075}]}],
    [{"highlighting",
      {obj,[{"a",
             {obj,[{"name",
                    [<<"Look me mom!, I'm <em>searching</em> now">>]}]}},
            {"b",
             {obj,[{"name",
                    [<<"Yes, <em>searching</em> from the erlang "...>>]}]}}]}}]}

Read the API docs to find all the functions/options implemented so far.

Have fun!

Comments (8)

February 25, 2008

Parsing CSV in erlang

Filed under: erlang — Tags: csv, erlang, parsing — ppolv @ 9:23 pm

So I need to parse a CSV file in erlang.

Although files in CSV have a very simple structure, simply calling lists:tokens(Line,”,”) for each line in the file won’t do the trick, as there could be quoted fields that spans more than one line and contains commas or escaped quotes.

A detailed discussion of string parsing in erlang can be found at the excellent Parsing text and binary files with Erlang article by Joel Reymont. And the very first example is parsing a CSV file!; but being the first example, it was written with simplicity rather than completeness in mind, so it didn’t take quoted/multi-line fields into account.

Now, we will write a simple parser for RFC-4180 documents ( witch is way cooler than parse plain old CSV files 😉 ) . As the format is really simple, we won’t use yecc nor leex, but parse the input file by hand using binaries,lists and lots of pattern matching.

Our goals are

Recognize fields delimited by commas, records delimited by line breaks
Recognize quoted fields
Being able to parse quotes, commas and line breaks inside quoted fields
Ensure that all records had the same number of fields
Provide a fold-like callback interface, in addition to a return-all-records-in-file function

What the parser won’t do:

Unicode. We will treat the file as binary and consider each character as ASCII, 1 byte wide. To parse unicode files, you can use xmerl_ucs:from_utf8/1, and then process the resulting list instead of the raw binary

A quick lock suggest that the parser will pass through the following states:

Field start

at the begin of each field. The whitespaces should be consider for unquoted fields, but any whitespace before a quoted field is discarded

Normal

an unquoted field

Quoted

inside a quoted field

Post Quoted

after a quoted field. Whitespaces could appear between a quoted field and the next field/record, and should be discarded

Parsing state

While parsing, we will use the following record to keep track of the current state

-record(ecsv,{
   state = field_start,  %%field_start|normal|quoted|post_quoted
   cols = undefined, %%how many fields per record
   current_field = [],
   current_record = [],
   fold_state,
   fold_fun  %%user supplied fold function
   }).

API functions

parse_file(FileName,InitialState,Fun) ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary,InitialState,Fun).
    
parse_file(FileName)  ->
   {ok, Binary} = file:read_file(FileName),
    parse(Binary).

parse(X) ->
   R = parse(X,[],fun(Fold,Record) -> [Record|Fold] end),
   lists:reverse(R).
		
parse(X,InitialState,Fun) ->
   do_parse(X,#ecsv{fold_state=InitialState,fold_fun = Fun}).

The tree arguments functions provide the fold-like interface, while the single argument one returns a list with all the records in the file.

Parsing

Now the fun part!.
The transitions (State X Input -> NewState ) are almost 1:1 derived from the diagram, with minor changes (like the handling of field and record delimiters, common to both the normal and post_quoted state).
Inside a quoted field, a double quote must be escaped by preceding it with another double quote. Its really easy to distinguish this case by matching against

<<$",$",_/binary>>

sort of “lookahead” in yacc’s lexicon.

 
%% --------- Field_start state ---------------------
%%whitespace, loop in field_start state
do_parse(<<32,Rest/binary>>,S = #ecsv{state=field_start,current_field=Field})->		
	do_parse(Rest,S#ecsv{current_field=[32|Field]});

%%its a quoted field, discard previous whitespaces		
do_parse(<<$",Rest/binary>>,S = #ecsv{state=field_start})->		
	do_parse(Rest,S#ecsv{state=quoted,current_field=[]});

%%anything else, is a unquoted field		
do_parse(Bin,S = #ecsv{state=field_start})->
	do_parse(Bin,S#ecsv{state=normal});	
		
		
%% --------- Quoted state ---------------------	
%%Escaped quote inside a quoted field	
do_parse(<<$",$",Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[$"|Field]});		
	
%%End of quoted field
do_parse(<<$",Rest/binary>>,S = #ecsv{state=quoted})->
	do_parse(Rest,S#ecsv{state=post_quoted});
	
%%Anything else inside a quoted field
do_parse(<<X,Rest/binary>>,S = #ecsv{state=quoted,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[X|Field]});
	
do_parse(<<>>, #ecsv{state=quoted})->	
	throw({ecsv_exception,unclosed_quote});
	
	
%% --------- Post_quoted state ---------------------		
%%consume whitespaces after a quoted field	
do_parse(<<32,Rest/binary>>,S = #ecsv{state=post_quoted})->	
	do_parse(Rest,S);


%%---------Comma and New line handling. ------------------
%%---------Common code for post_quoted and normal state---

%%EOF in a new line, return the records
do_parse(<<>>, #ecsv{current_record=[],fold_state=State})->	
	State;
%%EOF in the last line, add the last record and continue
do_parse(<<>>,S)->	
	do_parse([],new_record(S));

%% skip carriage return (windows files uses CRLF)
do_parse(<<$r,Rest/binary>>,S = #ecsv{})->
	do_parse(Rest,S);		
		
%% new record
do_parse(<<$n,Rest/binary>>,S = #ecsv{}) ->	
	do_parse(Rest,new_record(S));
	
do_parse(<<$, ,Rest/binary>>,S = #ecsv{current_field=Field,current_record=Record})->	
	do_parse(Rest,S#ecsv{state=field_start,
					  current_field=[],
					  current_record=[lists:reverse(Field)|Record]});


%%A double quote in any other place than the already managed is an error
do_parse(<<$",_Rest/binary>>, #ecsv{})->	
	throw({ecsv_exception,bad_record});
	
%%Anything other than whitespace or line ends in post_quoted state is an error
do_parse(<<_X,_Rest/binary>>, #ecsv{state=post_quoted})->
 	throw({ecsv_exception,bad_record});

%%Accumulate Field value
do_parse(<<X,Rest/binary>>,S = #ecsv{state=normal,current_field=Field})->
	do_parse(Rest,S#ecsv{current_field=[X|Field]}).

Record assembly and callback

Convert each record to a tuple, and check that it has the same number of fields than the previous records. Invoke the callback function with the new record and the previous state.

%%check	the record size against the previous, and actualize state.
new_record(S=#ecsv{cols=Cols,current_field=Field,current_record=Record,fold_state=State,fold_fun=Fun}) ->
	NewRecord = list_to_tuple(lists:reverse([lists:reverse(Field)|Record])),
	if
		(tuple_size(NewRecord) =:= Cols) or (Cols =:= undefined) ->
			NewState = Fun(State,NewRecord),
			S#ecsv{state=field_start,cols=tuple_size(NewRecord),
					current_record=[],current_field=[],fold_state=NewState};
		
		(tuple_size(NewRecord) =/= Cols) ->
			throw({ecsv_exception,bad_record_size})
	end.

Final notes

We used a single function, do_parse/2, with many clauses to do the parsing. In a more complex scenario, you probably will use different functions for different sections of the grammar you are parsing. Also you could first tokenize the input and then parse the resulting token stream, this could make your work simpler even if your aren’t using a parser generator like yecc (this is the approach i’m using to parse ldap filters).

Comments (18)

February 15, 2008

erlang, ssl and asn1

Filed under: erlang — Tags: asn1, erlang, tsung — ppolv @ 3:44 pm

I’ve been playing with erlang and asn1, implementing a ldap plugin for the tsung load testing tool. Erlang comes with nice support for work with ber-encoded asn1 data; particularly handy is its aviility to recognize packets borders and deliver network data, one asn1 packet at a time rather than as a raw byte stream.

One of the extended operations defined in the ldap protocol, the startTLS command, allows the use of unencrypted,plain tcp socket to connect to the server and later “upgrade” the same connection to use ssl. To implement this in erlang, the way to go is to use the new ssl module, since it is capable of establish a ssl session over an already connected tcp_gen socket, something than previous OTP versions can’t. Sadly for me, this new ssl module seems to not be able to recognize asn1 packets yet. Luckily, the buffering code required is very simple to implement in erlang.

Here is the the code i’m using:

%%The buffer consist of the data received and the length of the current packet,
%%undefined if the length is still unknown.
-record(asn1_packet_state,    {
    length = undefined,
    buffer = <<>>
}).

%%The push function simply appends the data to the end of the buffer.
push(<<>>,S) ->
    S;
push(Data,S =#asn1_packet_state{buffer = B}) ->
    S#asn1_packet_state{buffer = <<B/binary,Data/binary>>}.

%% Try to extract a packet from the buffer, if the length is unknown, calculate it first.
get_packet(S = #asn1_packet_state{length=undefined,buffer= <<>>}) ->
    {none,S};
get_packet(S = #asn1_packet_state{length=undefined,buffer=Buffer}) ->
    case packet_length(Buffer) of
        {ok,Length} -> extract_packet(S#asn1_packet_state{length=Length});
         not_enough_data -> {none,S}
    end;
get_packet(S) -> extract_packet(S).

%% Extract the packet if there is enough data available.
extract_packet(#asn1_packet_state{length=N,buffer=Buffer}) when (size(Buffer) >= N) ->
    <<Packet:N/binary,Rest/binary>> = Buffer,
    {packet,Packet,#asn1_packet_state{length=undefined,buffer=Rest}};

extract_packet(S) when is_record(S,asn1_packet_state) -> {none,S}.

%%Extract the packet size from the packet header.
packet_length(Buffer) ->
    try asn1rt_ber_bin:decode_tag_and_length(Buffer) of
       {Tag, Len,_Rest,RemovedBytes} ->  {ok,Len+RemovedBytes}
    catch
        _Type:_Error ->
            if
                (size(Buffer) > ?MAX_HEADER) -> throw({invalid_packet,Buffer});
                true -> not_enough_data  %%incomplete header
            end
    end.

So whenever you get data from the network, you push/2 then into the buffer. Then you can get_packet/1 from the buffer , keeping in mind that there could by no complete packet yet, or more than one packet could be present in the buffer; get_packet/1 will return either {none,Buffer} or {packet,Packet,Buffer}.

Comments (6)

February 4, 2008

Isn’t performance, is scalability

Filed under: random — Tags: benchmark, memcached — ppolv @ 8:09 pm

Recently I’ve been reading about cache technologies, memcached in particular. The architecture behind memcached is what catch my attention: its so clear and simple. Beauty .

Then, I reach to this performance comparison between memcached and ehcache (almost a yer old, but I just read it, sorry). At first I was somewhat surprised and to be honest, disappointed. I thought memcached should compare better than that… but … wait…
The author of the article has experience in the word of java caching (actually, he is the maintainer of ehcache), but I think the benchmark isn’t really fair. While the memcached setup is already a distributed one, the ehcache setup don’t. While the memcached setup is ready to hold TERABYTES of data, the ehcache setup don’t.

I don’t say that ehcache can’t be distributed, what i say is that the costs associated with ehcache distribution aren’t represented in the benchmark.

Anyway, the point isn’t about performance, its scalability what matters!. And that’s why you should always read benchmarks carefully.
If my web application isn’t a big, heavy-accessed one (like the intranet applications I’m used to work with) then an entirely in-process caching strategy is fine, and easy to implement and understand too.
But for high-load scenarios, with many web servers involved, benchmarks like this one doesn’t apply.

Update: Unfair Benchmarks of Ehcache vs Memcached another response to the original article.

Comments (5)

Older Posts »

ppolv’s blog