- I remember, I had left a link here at my Aleph-Soft.com website
- that leads me to my slightly more extensive dedicated article
- of course, while I read it, I switch to the sources of that article, so that I can improve the article “en passent”; OMG: running that DocBook website toolchain even works after at least a year or so! I’m amazed. well, not updating software does have some positive side-effects.
- does LiveHTTPHeaders still work with my current Firefox? LiveHTTPHeaders is one of the reasons I still keep my Firefox updated, although I chose Chromium as my main browser on all platforms (*** bookmark ***)
- what about its cousin ieHTTPHeaders for IE? WTF, where does it actually live and get maintained? alright, I assume Jonas Blunck is the creator and maintainer
- is there anything like *HTTPHeaders for Chrome/Chromium? that would be nice; I would have to make my respective tool read its logfile then
- creating a perl script from LiveHTTPHeaders’s log file still works
- integrated that perl script into my framework for that kind of stuff
- download the root HTML page, parsing it, extracting the 1st few bits of information wanted
- download the 1st linked page; the navigation doesn’t go further / deeper than this
- TBD: extract the information details from that linked page; CAVEAT: there is an optional intermediate (“region”) level within that page
- …
Category: page scraping
-
my new page scraping assignment – getting familiar again with my toolkit
For my new page scraping assignment I thought for a while of trying a much more modern approach.That actually kept me from really starting it for quite a couple of weeks now, because it seemed so very tedious and I thought, I don’t have like 3 shots for it.This week I thought about going with my own old approach and about making use of the state-of-the-art technology at a (slightly) later stage. That should work.So where is my software and where is my documentation?(This article is getting extended and updated these days in early November 2011.) -
CPAN: Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework
Scrappy – metacpan.org: “Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework”
-
more on web harvesting
-
Data Extraction for Web 2.0: Screen Scraping in
Ruby/Rails, Episode 1 - http://scrubyt.org (ruby)
- HPricot.com : “a swift, liberal HTML parser with a fantastic library” (ruby)
- http://brightplanet.com : “Pioneers in Harvesting the Deep Web”
- …
Update 2010-06-05/06:
One night later I am still very impressed by scrubyt, and I rather want to try it on a real life example quite soon.
Actually in a way scrubyt does, what I also do with my JHwis toolkit, but of course, it looks, as if goes far (?!?) beyond that. JHwis navigates in a programmed way through web-sites, and it downloads certain HTML files to the disk for further processing. Those HTML files contain HTML tables, and there is already a nice PERL library, that I wrap into a command line utility, that extracts HTML tables into CSV files. These CSV files are actually not really of a kind, that you can directly load into a spreadsheet GUI utility like OpenOffice Calc or whatever. They need further mechanical processing and refinement, before they can get loaded into database tables.
With scrubyt’s help (apparently) you extract an XML file from the quite nested HTML table structures of a web page.
Years ago, when I started my project I created CSV files. A couple of years later, I also created XML files. But I never adapted the entire tool chain to make use of these XML files.
My XML files only reflect exactly the data, that I want to make use of.
scrubyt’s XML files reflect (I think) the entire table structure.
Nowadays with XSLT processors you “easily” develop an XSL script (aka “stylesheet”), that extracts the portion, that you are really interested in.
To be continued … -