web scraping afternoon

This wasn’t meant to be yet another web scraping afternoon.

This afternoon started with me trying to recover a little from a hard time.
I had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don’t want to tell you about the real trouble.

I got intrigued to search oreilly.com for literature on Selenium and found a “Short Cut” document.
I found something.
I had a few looks over the chapter on “twill”.
Before I really dived into the chapter on Selenium, I summed up, what I really liked and disliked about Selenium.
Of course, being able to use XPath is great.
With Selenium you somehow aren’t aware at all, that there is Javascript being made use of on a web-site, but you just leave this to the browser engine, initially to Firefox and to the Selenium IDE.
I actually hate it, if your HTTP scripting depends on desktop computers running a browser and some remote control software to connect your server, where you “HTTP scripts” actually run, and the web browser(s), that you make use of.
I did a little superficial research on: perl/ruby + mechanize + xpath.
Yes, there is still scrubyt around, but isn’t that vaporware now itself?
Found perl’s WWW::Scraper::TidyXML – “TidyXML and XPath support for Scraper”. Not bad. But then it’s from around 2003, and it seems to be vaporware. My e-mail to the author could not get delivered (“over quota”), so I guess, it’s seriously no longer maintained.
WWW::Mechanize::Firefox seems to be nice, have a look at WWW::Mechanize::Firefox::Cookbook!
…

web scraping afternoon

Like this:

Comments

Leave a ReplyCancel reply

More posts

PDF OCR

my 2026 Windows working environment

find when a phrase was added to a Wikipedia page

“Dev Container” – a Linux standard established by Microsoft? devcontainer.json

web scraping afternoon

Share this:

Like this:

Comments

Leave a ReplyCancel reply

More posts

PDF OCR

my 2026 Windows working environment

find when a phrase was added to a Wikipedia page

“Dev Container” – a Linux standard established by Microsoft? devcontainer.json