HTML Scrapping Using Javascript ((For Google Gadgets))

3

Over the past few months, people here at eSpace have been working on developing google gadgets. A large number of these gadgets were dependent on data gathered from other websites which lack any XML or RSS service providing this data in a direct way. Since this is a problem we will face every now and then, we started thinking about a more generic solution to use in any gadget depending on such source of data.

 

The solution that we needed boiled down to one of these three

  1. Using a scrapping service such as Dapper or Yahoo pipes to do the scrapping on behalf of us and return a well formed XML file to use in any gadget
  2. Creating a google app engine that we call and it scrapes the data and returns XML to us
  3. Using JS for scrapping HTML pages 

The first and second solutions may seem the same, and actually they are, except that Dapper isn't that reliable as it sometimes fails due to extra load on it. Google app engine, on the other hand,  was proven to survive under high request rates. Anyways, I liked the third solution and told myself to give it a try and see if it would get the job done well or not. I thought that scrapping html using JS would be an easy matter that could be done easily in any google gadget, but I was proven wrong. I will summerize the trials I made here, starting from those that failed to the last solution that worked.

  1. Depending on Google Api method "_IG_FetchXmlContent". This way failed easily because it was expecting an XML document and was faced with HTML Page. It gave me parse error on Doctype line. The result is FAILURE.
  2. Depending on Google Api method "_IG_FetchContent". This way gave us the html as it is and it was then time to parse it using DOM Parsers built already inside browsers. I tried doing so using Firefox browser, but also got parse error because it is not an XML document but an HTML one and parsers available only expect XML. The result is FAILURE.
  3. Repeating step 2 again but after using a regular expression to take only inner HTML of body tag. DOM Parser failed on one of the comments' lines present in the HTML page, which may appear in
    may page,s so this isn't  generic enough  of a solution to be accepted. The result is FAILURE.
  4. Using Regular expression to get body inner html and then add this to a hidden div then using normal JS methods for traversing DOM nodes considering this Div as my root. The result is SUCCESS.

Since the fourth trial was successful, I made a generic method that anyone can use in his gadget. This simple method will just get html and scrape based on your scrapping function. To understand what I mean, have a look at the function definition first:


scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){}

As in this definition we see that the function needs some parameters:

  • url to retrieve html from
  • dataHolderId : the id of the hidden div that the retrieved html will be added to it
  • scrapeFunction: a function that takes the hidden div as a root element and uses JS to get the data desired (each person should write his according to what he wants to retrieve).

This is the implementation of it:

scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
_IG_FetchContent(url, function(responseText){ operate(responseText, dataHolderId, scrapeFunction); });
}

operate = function(responseText, dataHolderId, scrapeFunction){
var body = /<body.*?>((.|\n|\r)*)<\/body>/.exec(responseText);
var bodyData = body[1];
_gel(dataHolderId).innerHTML = bodyData;
scrapeFunction(dataHolderId);
}


These two functions are used to get html page, then retrieve body inner html, then call the scrape function passing through it the id of the hidden div containing the html body.
It is your responsibility now to write the scrapping function desired based on that this div is the root of your DOM tree.

This is an example of a scrapping function I defined:

scrape = function(dataHolderId){
var elements = _gel(dataHolderId).getElementsByClassName('main');
var noktas = [];
var num = elements.length;
for(i=0 ; i<num ; i+=2) noktas.push(elements[i].childNodes[0].innerHTML);
for(i=0 ; i<noktas.length ; i++){
var e = document.createElement('p');
e.innerHTML = noktas[i];
document.body.appendChild(e);
}
}

That's it! I think that you are now ready to use these two functions in any gadget thats data source should be scrapped.
This method should be better than most, as here all the processing is done on the client machine rather than on any other servers.

Written By:

Osama Breka (http://bionuc-tech.blogspot.com/)

Comments

1

Thats a handy piece of code! well done! I just wanted to know if there is a reason why the div should be embedded in the existing document. I mean why not just create a div, set its inner html and return it in <code>scrapeHTMLBody</code> instead of passing an Id.

2

i agree with you totally that it is useless and should be created without passing div id

i have already changed the function to do so, but it's my fault that i didn't update the code to do so

but anyway, thanks for your nice comment and i wish this function can be useful

3

Have you tried biterscripting for web page scraping or harvesting ? There are several scripts posted on the web for extracting various things from web pages. The one I like is at http://www.biterscripting.com/SS_WebPageToText.html . It extracts plain text out of a web page. To try, download biterscripting free at http://www.biterscripting.com . Install all their sample scripts using the following command

script "http://www.biterscripting.com/Download/SS_AllSamples.txt"

The run command

script SS_WebPageToText.txt page("http://www.espace.com.eg/blog/2009/04/05/html-scrapping-using-javascript-for-google-gadgets")

The above command will scrape this web page itself. I just tried it on my computer.

Randi

Post a Comment

eSpace podcast Prodcast

RSS iTunes