Category: web harvesting
-
harvesting HTML-obfuscated web-sites looks like horror to you?
I just completed 2 tasks, where I faced obfuscated CGI forms. It was quite a challenge, and I didn’t anticipate the final success from the beginning. But it’s done.
Now I am rather eager to apply my technology for interesting and lucrative tasks.
-
CPAN: Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework
Scrappy – metacpan.org: “Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework”
-
An Introduction to Testing Web Applications with twill and Selenium – O’Reilly Media
An Introduction to Testing Web Applications with twill and Selenium – O’Reilly Media
To cheap not to own it – I thought a little, now I am reading it.
-
more on web harvesting
-
Data Extraction for Web 2.0: Screen Scraping in
Ruby/Rails, Episode 1 - http://scrubyt.org (ruby)
- HPricot.com : “a swift, liberal HTML parser with a fantastic library” (ruby)
- http://brightplanet.com : “Pioneers in Harvesting the Deep Web”
- …
Update 2010-06-05/06:
One night later I am still very impressed by scrubyt, and I rather want to try it on a real life example quite soon.
Actually in a way scrubyt does, what I also do with my JHwis toolkit, but of course, it looks, as if goes far (?!?) beyond that. JHwis navigates in a programmed way through web-sites, and it downloads certain HTML files to the disk for further processing. Those HTML files contain HTML tables, and there is already a nice PERL library, that I wrap into a command line utility, that extracts HTML tables into CSV files. These CSV files are actually not really of a kind, that you can directly load into a spreadsheet GUI utility like OpenOffice Calc or whatever. They need further mechanical processing and refinement, before they can get loaded into database tables.
With scrubyt’s help (apparently) you extract an XML file from the quite nested HTML table structures of a web page.
Years ago, when I started my project I created CSV files. A couple of years later, I also created XML files. But I never adapted the entire tool chain to make use of these XML files.
My XML files only reflect exactly the data, that I want to make use of.
scrubyt’s XML files reflect (I think) the entire table structure.
Nowadays with XSLT processors you “easily” develop an XSL script (aka “stylesheet”), that extracts the portion, that you are really interested in.
To be continued … -