ppolv’s blog

May 9, 2008

fun with mochiweb’s html parser and xpath

Filed under: erlang — Tags: , , , , — ppolv @ 11:07 pm

Some days ago, while reading Roberto Saccon’s blog, I noticed that mochiweb has an html parser.

So a couple of days latter, here I am giving it a try.
The structure of the generated tree is quite simple (I’m only interested in text and element nodes).
A text node is represented as an erlang binary(). An element node as a 3-element tuple {Tag,Attrs,Contents} where Tag is the element’s tag, Attrs is a list of {Name,Value} and Contents is a list of the element’s contents.

In contrast with xmerl, the input document is never converted to string(). Text, tags, attributes names and values generated from mochiweb’s parser are all binary() , which I guess helps both with parsing speedup and memory consumption.

My final goal with the parser would be to use it inside tsung http benchmarks. Currently you have to use regular expression to refer to elements of the page. I’d preffer to use XPath instead. Also, the regexp module can become painfully slow when you are working with multiple expressions, so using an html parser + XPath make sense, as the cost of parsing the html will be amortized between all the xpath expressions.

I wrote a simple XPath interpreter that works over mochiweb’s html tree. Currently it’s capable of execute xpath expressions with

  • self,child, descendant_or_self and attributes axis
  • predicates
  • some predefined functions ( count/1, start-with/2,..)
  • user-defined functions

There is a .tag.gz file in the mochiweb group containing the xpath code.

Example: simple HTML Screen Scraping in Erlang

To demonstrate how to use the parser/xpath combination, we’ll write a function that, given an URL pointing to a web page, return a summary of how much the page weights, in the spirit of Web Page Analyzer. For simplicity, we’ll only consider

  • the size of the page itself
  • the size of the images
  • the size of the external scripts files
  • the size of the external css files

so we won’t be searching for things like flash or other type of embedded objects.
Also, we’ll be working on the raw html page as returned by the server, so if the page contains javascript code that adds images to the page, we wouldn’t be aware of that.

The steps are:

  1. Request the desired page using http
  2. Parse the body of the response using mochiweb’s parser
  3. Using XPath, search for the url of all the images, external scripts and css contained in the page
  4. Request the size of each of the documents at the urls obtained in the previous step

Since this is an erlang example, there should be some concurrency involved, shouldn’t it?. So at step 4, instead of sequentially scan all the urls, we will spawn a new process for each and make all the request in parallel. Nice and simple.

Lets start with our main function:


page_info(URL) ->
    case http:request(URL) of
        {ok,{_,Headers,Body}} ->
            got_page_info(URL,content_length(Headers),Body);
        {error,Reason} -> 
            {error,Reason}
    end.

its makes an http request to the given URL, and if everything goes well, pass to got_page_info/3 to process the results.


got_page_info(URL, PageSize,Body) ->
    Tree = mochiweb_html:parse(Body),
    Imgs = remove_duplicates(mochiweb_xpath:execute("//img/@src",Tree)),
    Css = remove_duplicates(mochiweb_xpath:execute("//link[@rel=’stylesheet’]/@href",Tree)),
    Scripts = remove_duplicates(mochiweb_xpath:execute("//script/@src",Tree)),
  
    URLCtx = url_context(URL),
    spawn_workers(URLCtx,img,lists:map(fun  binary_to_list/1,Imgs)),
    spawn_workers(URLCtx,css,lists:map(fun  binary_to_list/1,Css)),
    spawn_workers(URLCtx,script,lists:map(fun  binary_to_list/1,Scripts)),
    
    TRef = erlang:send_after(?TIMEOUT,self(),timeout),
    State = #state{page=PageSize,
                  timer=TRef,
                  errors=[],
                  img=0,
                  css=0,
                  script=0},
    wait_for_responses(State,length(Imgs) + length(Css) + length(Scripts)).

Now we have reached to the interesting part. First, parse the http body (mochiweb_html:parse/1), then search the document using xpath (mochiweb_xpath:execute/3). With those results in hands, spawn all the bots that will be in charge of querying the urls (spawn_workers/3). Then sit and wait (wait_for_responses/2) until all the bots finished, or the timeout expires.

We remove duplicates, since the same image could be used multiple times in the same page, and we want to count it only once. UrlContext will be used by the bots to resolve relative and absolute urls.


wait_for_responses(State,0) ->
    finish(State,0);

wait_for_responses(State,Count) ->
    receive
        {component,Type,URL,Info} ->
            NewState = add_info(State,Type,URL,Info),
            wait_for_responses(NewState,Count-1);
        timeout ->
            finish(State,Count)
    end.

The protocol between the master and the bots is simple:
each bot will report its findings by sending a message back to the originating process, and then die.

In the master process, If the count of remaining bots is 0 then all bots have finished, and we only need to return the gathered info. If not, we wait until some message arrives.
We are interested in two kind of messages:

  • a msg from a worker process. We update the state and loop again
  • a timeout. The timeout has expired, we don’t wait any more for the remaining bots to complete.

For the purposes of this exercise, this is enough. In a real application we would probably like to do some additional work, like watching for crashes in the bots that we’ve spawned. Also we would need to kill all remaining process when the timeout expires.


add_info(State = #state{css=Css},css,_URL,{ok,Info}) ->
    State#state{css = Css + Info};
add_info(State = #state{img=Img},img,_URL,{ok,Info}) ->
    State#state{img = Img + Info};
add_info(State = #state{script=Script},script,_URL,{ok,Info}) ->
    State#state{script = Script + Info};
add_info(State = #state{errors=Errors},_Type,URL,{error,Reason}) ->
    State#state{errors=[{URL,Reason}|Errors]}. 

This is self explanatory. Update the current state with the new info, counting how many bytes are in images, scripts and css files.


spawn_workers(URLCtx,Type,URLs) ->
    Master = self(),
    lists:foreach(fun(URL) -> 
      spawn(fun()-> Master ! {component,Type,URL,get_component_info(URLCtx,URL)} end)
    end,URLs).

The code above is all we need to create the bots processes. One bot per URL, each bot sends a message back to the master with the result of evaluating get_component_info/2.


get_component_info(URLCtx,URL) ->
    FullURL = full_url(URLCtx,URL),
    case http:request(head,{FullURL,[]},[],[]) of
        {ok, {_,Headers,_Body}} ->
                    {ok,content_length(Headers)};
        {error,Reason} -> 
            {error,Reason}
    end.

And the rest, mainly utility functions:


remove_duplicates(L) ->
    sets:to_list(sets:from_list(L)).

% extract content-length from the http headers
content_length(Headers) ->
    list_to_integer(proplists:get_value("content-length",Headers,"0")).

%% abs url inside the same server ej: /img/image.png    
full_url({Root,_Context},ComponentUrl=[$/|_]) ->
    Root ++ ComponentUrl;

%% full url ej: http://other.com/img.png
full_url({_Root,_Context},ComponentUrl="http://"++_) ->
    ComponentUrl;

% everything else is considerer a relative path.. obviously its wrong (../img)
full_url({Root,Context},ComponentUrl) ->
    Root ++ Context ++ "/" ++ ComponentUrl.

% returns the  domain, and current context path.
% url_context("http://www.some.domain.com/content/index.html)
%      -> {"
http://www.some.domain.com", "/content"}
url_context(URL) ->
    {http,_,Root,_Port,Path,_Query} = http_uri:parse(URL),
    Ctx = string:sub_string(Path,1, string:rstr(Path,"/")),
    {"http://"++Root,Ctx}.
    
% cancel the timeout timer and return results as a tuple
finish(State,Remaining) ->
    #state{page=PageSize,
           img=ImgSize,
           css=CssSize,
           script=ScriptSize,
           errors=Errors,
           timer=TRef} = State,
    erlang:cancel_timer(TRef),
    {PageSize,ImgSize,CssSize,ScriptSize,Errors,Remaining}.

% pretty print
report({PageSize,ImgSize,CssSize,ScriptSize,Errors,Missing}) ->
    io:format("html size: ~.2fkb~n",[PageSize/1024]),
    io:format("images:    ~.2fkb~n",[ImgSize/1024]),
    io:format("styleshets:~.2fkb~n",[CssSize/1024]),
    io:format("scripts:   ~.2fkb~n",[ScriptSize/1024]),
    Total = PageSize + ImgSize + CssSize + ScriptSize,
    io:format("~nTotal:     ~.2fkb~n",[Total/1024]),
    lists:foreach(fun({URL,Code}) -> io:format("~s error!: ~p",[URL,Code]) end, Errors),
    case Missing of
        0 -> ok;
        Missing  -> io:format("Timeouts: ~b~n",[Missing])
    end.

That’s it. Now a sample erlang session to see our code in action 🙂

51> L = page_tester:page_info(“https://ppolv.wordpress.com”).
{39077,38799,4672,20756,[],0}
52> page_tester:report(L).
HTML Size: 38.16kb
Images: 37.89kb
Styleshets:4.56kb
Scripts: 20.27kb

Total: 100.88kb
ok
53>

Easy, isn’t it?

p.s.
it looks like I finally learn how to easily embed erlang code in wordpress!, taking some ideas from code-syntax-highlighting-for-blogs-with-vim

12 Comments »

  1. […] – bookmarked by 1 members originally found by colibri on 2008-08-18 fun with mochiweb’s html parser and xpath https://ppolv.wordpress.com/2008/05/09/fun-with-mochiwebs-html-parser-and-xpath/ – bookmarked by 4 […]

    Pingback by Bookmarks about Xpath — September 3, 2008 @ 3:45 pm

  2. […] by leomdresch on Tue 18-11-2008 A Brief Review of RefWorks Saved by delemming on Sun 16-11-2008 fun with mochiweb’s html parser and xpath Saved by MichaelJPeters on Tue 11-11-2008 palewire / Python Recipe: Grab a page, scrape a table, […]

    Pingback by Recent Links Tagged With "screenscraping" - JabberTags — December 1, 2008 @ 10:49 pm

  3. […] through a post on pplov’s blog we came to know that he had contributed an xpath parser for mochiweb which can be downloaded at the […]

    Pingback by developers.hover.in » intern challenge - 1 — May 24, 2009 @ 3:31 pm

  4. Do you plan to implement css3 selectors over mochiweb_html?

    Comment by edbond — June 21, 2009 @ 8:20 am

  5. Hi Pablo,

    Great work on the mochiweb_xpath stuff.

    I ran into a bug in the core mochiweb_html parser, as documented here

    http://groups.google.com/group/mochiweb/browse_thread/thread/a4e01aba5fb7bb66/0a51eaffca8ac9ee?lnk=raot

    Assuming you’re the owner of mochiweb_xpath, would it be possible to update the distribution to include the latest version of mochiweb_html [or even get Bob and co to include it in the mochiweb dist ?]

    Regards,

    Comment by justin — September 14, 2009 @ 10:04 am

  6. Saccon’s post link is dead:

    http://www.rsaccon.com/2007/11/mochiweb-got-html-parser.html

    Comment by Elena — December 23, 2009 @ 12:59 pm

  7. Also #state definition is missing. Thanks.

    Comment by Elena — December 23, 2009 @ 1:01 pm

  8. This is awesome!! You should put this up on GitHub and make this into a full-fledged Erlang module!

    Comment by Carlo Cabanilla — January 23, 2010 @ 8:34 pm

  9. Anyway to request pages without httpc? Chicago boss doesn’t support inets

    Comment by drew — November 28, 2011 @ 1:02 am

    • Chicago does support httpc so nevermind this

      Comment by drew — November 28, 2011 @ 1:37 am

  10. I personally think about the reasons why you branded
    this particular posting, “fun with mochiweb’s html parser and xpath | ppolv’s blog”. Either way I actually admired the post!Thank you-Belle

    Comment by Selma — August 18, 2013 @ 4:44 pm

  11. […] other people are interested in the same use case, so you can learn from […]

    Pingback by Is Erlang the right choice for a webcrawler? - MicroEducate — March 2, 2022 @ 8:44 pm


RSS feed for comments on this post. TrackBack URI

Leave a reply to drew Cancel reply

Create a free website or blog at WordPress.com.