HTML Scrapping Using Javascript ((For Google Gadgets))
3
Over the past few months, people here at eSpace have been working on developing google gadgets. A large number of these gadgets were dependent on data gathered from other websites which lack any XML or RSS service providing this data in a direct way. Since this is a problem we will face every now and then, we started thinking about a more generic solution to use in any gadget depending on such source of data.
The solution that we needed boiled down to one of these three
- Using a scrapping service such as Dapper or Yahoo pipes to do the scrapping on behalf of us and return a well formed XML file to use in any gadget
- Creating a google app engine that we call and it scrapes the data and returns XML to us
- Using JS for scrapping HTML pages
The first and second solutions may seem the same, and actually they are, except that Dapper isn't that reliable as it sometimes fails due to extra load on it. Google app engine, on the other hand, was proven to survive under high request rates. Anyways, I liked the third solution and told myself to give it a try and see if it would get the job done well or not. I thought that scrapping html using JS would be an easy matter that could be done easily in any google gadget, but I was proven wrong. I will summerize the trials I made here, starting from those that failed to the last solution that worked.
- Depending on Google Api method "_IG_FetchXmlContent". This way failed easily because it was expecting an XML document and was faced with HTML Page. It gave me parse error on Doctype line. The result is FAILURE.
- Depending on Google Api method "_IG_FetchContent". This way gave us the html as it is and it was then time to parse it using DOM Parsers built already inside browsers. I tried doing so using Firefox browser, but also got parse error because it is not an XML document but an HTML one and parsers available only expect XML. The result is FAILURE.
- Repeating step 2 again but after using a regular expression to take only inner HTML of body tag. DOM Parser failed on one of the comments' lines present in the HTML page, which may appear in
may page,s so this isn't generic enough of a solution to be accepted. The result is FAILURE. - Using Regular expression to get body inner html and then add this to a hidden div then using normal JS methods for traversing DOM nodes considering this Div as my root. The result is SUCCESS.
Since the fourth trial was successful, I made a generic method that anyone can use in his gadget. This simple method will just get html and scrape based on your scrapping function. To understand what I mean, have a look at the function definition first:
scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
As in this definition we see that the function needs some parameters:
- url to retrieve html from
- dataHolderId : the id of the hidden div that the retrieved html will be added to it
- scrapeFunction: a function that takes the hidden div as a root element and uses JS to get the data desired (each person should write his according to what he wants to retrieve).
This is the implementation of it:
scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
}
operate = function(responseText, dataHolderId, scrapeFunction){
var body = /<body.*?>((.|\n|\r)*)<\/body>/.exec(responseText);
var bodyData = body[1];
scrapeFunction(dataHolderId);
}
These two functions are used to get html page, then retrieve body inner html, then call the scrape function passing through it the id of the hidden div containing the html body.
It is your responsibility now to write the scrapping function desired based on that this div is the root of your DOM tree.
This is an example of a scrapping function I defined:
scrape = function(dataHolderId){
var elements =
var noktas = [];
var num = elements.length;
for(i=0 ; i<num ; i+=2) noktas.push(elements[i].childNodes[0].innerHTML);
for(i=0 ; i<noktas.length ; i++){
var e = document.createElement('p');
e.innerHTML = noktas[i];
document.body.appendChild(e);
}
}
That's it! I think that you are now ready to use these two functions in any gadget thats data source should be scrapped.
This method should be better than most, as here all the processing is done on the client machine rather than on any other servers.
Written By:
Osama Breka (http://bionuc-tech.blogspot.
Comments
Post a Comment
eSpace podcast Prodcast
Archive
- September 2011
- April 2011
- March 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010
- June 2010
- April 2010
- March 2010
- November 2009
- October 2009
- September 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- November 2008
- October 2008
- September 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- January 2008
- April 2007
- March 2007
Latest Comments
- SpectraMind Commented on Egypt Wins UK's National Outsourcing Association Award
- Rofaida Awad Commented on Go Egypt Go!
- Different Mike Commented on Only idiots change their iPhone root password!
- Mike Commented on Only idiots change their iPhone root password!
- smile Commented on Only idiots change their iPhone root password!

