Nokogiri vs Hpricot

4

I faced a performance problem while using the library Mechanize to scrap an HTML page.

After fetching that page :

@agent = WWW::Mechanize.new
@page = @agent.get(http://www.webometrics.info/top4000.asp)
#.....
 

I loop for 50 times to exctract the data I need; in each loop I use the method search() as follows:

50.times do |i|
institute = Institute.new
institute.name = @page.search("/table/tr[#{i}]/td[2]/a").inner_html
end

Since the loop seemed to take a long time, I added the following to see exactly how long it takes :

Time.now
50.times do |i|
institute.name = @page.search("/table/tr[#{i}]/td[2]/a").inner_html
end
Time.now

And this is what I got:

Thu Dec 18 13:09:16 +0200 2008
Thu Dec 18 13:09:28 +0200 2008

The loop took 12 seconds! Because of this, I started looking for a better solution. Searching for a solution to make it faster, I found that Mechanize uses Hpricot Library to Parse HTML. I started to look for another library that could parse faster than Hpricot and there I found a benchmark Hpricot vs Nokogiri that showed that Nokogiri seems must faster in searching by xpath. So I gave it a try and the results were surprising.

All I needed to do to make it work in my code was to add the following:


require 'Nokogiri'
WWW::Mechanize.html_parser = Nokogiri::HTML
@agent = WWW::Mechanize.new
#.........
Time.now
50.times do |i|
institute = Institute.new
institute.name = @page.search("/table/tr[#{i}]/td[2]/a").inner_html
end
Time.now

Running that gave the following:

Thu Dec 18 13:19:20 +0200 2008
Thu Dec 18 13:19:20 +0200 2008

That means that it just took less than a second!

That showed how much Nokogiri is faster than Hpricot when it comes to searching by xpath

Written By:

Alfred Nagy

Comments

1

In this part

50.times do |i|
institute = Institute.new
institute.name = @page.search("/table/tr[#{i}]/td[2]/a").inner_html
end

you are searching for _50_ times, but I believe that you can optimize this by searching only once and getting the list

2

Actually , i just used "50.times" to demonstrate that in my code a 50 itrations were actually done. So the point here is not about optimizing how many times i'll do searching, its about comparing the behaviour of Nokogiri vs Hpricot in that sitiuation.

3

What versions of Hpricot / Nokogiri?

4

When i wrote that Blog post i was using Hpricot 0.6 and Nokogiri 1.2.3 ,
I am not sure if this comparison could hold on the new versions.

Post a Comment

eSpace podcast Prodcast

RSS iTunes