my new page scraping assignment – getting familiar again with my toolkit

For my new page scraping assignment I thought for a while of trying a much more modern approach.
That actually kept me from really starting it for quite a couple of weeks now, because it seemed so very tedious and I thought, I don’t have like 3 shots for it.
This week I thought about going with my own old approach and about making use of the state-of-the-art technology at a (slightly) later stage. That should work.
So where is my software and where is my documentation?
  • I remember, I had left a link here at my Aleph-Soft.com website
  • that leads me to my slightly more extensive dedicated article
  • of course, while I read it, I switch to the sources of that article, so that I can improve the article “en passent”; OMG: running that DocBook website toolchain even works after at least a year or so! I’m amazed. well, not updating software does have some positive side-effects.
  • does LiveHTTPHeaders still work with my current Firefox? LiveHTTPHeaders is one of the reasons I still keep my Firefox updated, although I chose Chromium as my main browser on all platforms (*** bookmark ***)
  • what about its cousin ieHTTPHeaders for IE? WTF, where does it actually live and get maintained? alright, I assume Jonas Blunck is the creator and maintainer
  • is there anything like *HTTPHeaders for Chrome/Chromium? that would be nice; I would have to make my respective tool read its logfile then
  • creating a perl script from LiveHTTPHeaders’s log file still works
  • integrated that perl script into my framework for that kind of stuff
  • download the root HTML page, parsing it, extracting the 1st few bits of information wanted
  • download the 1st linked page; the navigation doesn’t go further / deeper than this
  • TBD: extract the information details from that linked page; CAVEAT: there is an optional intermediate (“region”) level within that page
(This article is getting extended and updated these days in early November 2011.)

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.