more on web harvesting

Jun 4, 2010

—

in HTTP scripting, page scraping, web harvesting

Data Extraction for Web 2.0: Screen Scraping in
Ruby/Rails, Episode 1
http://scrubyt.org (ruby)
HPricot.com : “a swift, liberal HTML parser with a fantastic library” (ruby)
http://brightplanet.com : “Pioneers in Harvesting the Deep Web”
…

Update 2010-06-05/06:
One night later I am still very impressed by scrubyt, and I rather want to try it on a real life example quite soon.
Actually in a way scrubyt does, what I also do with my JHwis toolkit, but of course, it looks, as if goes far (?!?) beyond that. JHwis navigates in a programmed way through web-sites, and it downloads certain HTML files to the disk for further processing. Those HTML files contain HTML tables, and there is already a nice PERL library, that I wrap into a command line utility, that extracts HTML tables into CSV files. These CSV files are actually not really of a kind, that you can directly load into a spreadsheet GUI utility like OpenOffice Calc or whatever. They need further mechanical processing and refinement, before they can get loaded into database tables.
With scrubyt’s help (apparently) you extract an XML file from the quite nested HTML table structures of a web page.
Years ago, when I started my project I created CSV files. A couple of years later, I also created XML files. But I never adapted the entire tool chain to make use of these XML files.
My XML files only reflect exactly the data, that I want to make use of.
scrubyt’s XML files reflect (I think) the entire table structure.
Nowadays with XSLT processors you “easily” develop an XSL script (aka “stylesheet”), that extracts the portion, that you are really interested in.
To be continued …

more on web harvesting

Share this:

Like this:

Comments

Leave a ReplyCancel reply