For my new page scraping assignment I thought for a while of trying a much more modern approach.
That actually kept me from really starting it for quite a couple of weeks now, because it seemed so very tedious and I thought, I don’t have like 3 shots for it.
This week I thought about going with my own old approach and about making use of the state-of-the-art technology at a (slightly) later stage. That should work.
So where is my software and where is my documentation?
- I remember, I had left a link here at my Aleph-Soft.com website
- that leads me to my slightly more extensive dedicated article
- of course, while I read it, I switch to the sources of that article, so that I can improve the article “en passent”; OMG: running that DocBook website toolchain even works after at least a year or so! I’m amazed. well, not updating software does have some positive side-effects.
- does LiveHTTPHeaders still work with my current Firefox? LiveHTTPHeaders is one of the reasons I still keep my Firefox updated, although I chose Chromium as my main browser on all platforms (*** bookmark ***)
- what about its cousin ieHTTPHeaders for IE? WTF, where does it actually live and get maintained? alright, I assume Jonas Blunck is the creator and maintainer
- is there anything like *HTTPHeaders for Chrome/Chromium? that would be nice; I would have to make my respective tool read its logfile then
- creating a perl script from LiveHTTPHeaders’s log file still works
- integrated that perl script into my framework for that kind of stuff
- download the root HTML page, parsing it, extracting the 1st few bits of information wanted
- download the 1st linked page; the navigation doesn’t go further / deeper than this
- TBD: extract the information details from that linked page; CAVEAT: there is an optional intermediate (“region”) level within that page
- …
(This article is getting extended and updated these days in early November 2011.)