Blog
-
web scraping afternoon
This wasn’t meant to be yet another web scraping afternoon.
This afternoon started with me trying to recover a little from a hard time.
I had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don’t want to tell you about the real trouble.- I got intrigued to search oreilly.com for literature on Selenium and found a “Short Cut” document.
- I found something.
- I had a few looks over the chapter on “twill”.
- Before I really dived into the chapter on Selenium, I summed up, what I really liked and disliked about Selenium.
- Of course, being able to use XPath is great.
- With Selenium you somehow aren’t aware at all, that there is Javascript being made use of on a web-site, but you just leave this to the browser engine, initially to Firefox and to the Selenium IDE.
- I actually hate it, if your HTTP scripting depends on desktop computers running a browser and some remote control software to connect your server, where you “HTTP scripts” actually run, and the web browser(s), that you make use of.
- I did a little superficial research on: perl/ruby + mechanize + xpath.
- Yes, there is still scrubyt around, but isn’t that vaporware now itself?
- Found perl’s WWW::Scraper::TidyXML – “TidyXML and XPath support for Scraper”. Not bad. But then it’s from around 2003, and it seems to be vaporware. My e-mail to the author could not get delivered (“over quota”), so I guess, it’s seriously no longer maintained.
- WWW::Mechanize::Firefox seems to be nice, have a look at WWW::Mechanize::Firefox::Cookbook!
- …
-
EDI for Ruby (edi4r)
Actually they refer to EDIFACT here.
You can use this software to output JSON, which you can process in any other software than.
-
WWW::Mechanize::Firefox – search.cpan.org
WWW::Mechanize::Firefox – search.cpan.org
Support for Javascript and XPath.
What about recording resp. capturing such a script?
-
perl, cpan: WWW::Scripter
WWW::Scripter – search.cpan.org
From the POD there:
DESCRIPTION
This is a subclass of WWW::Mechanize that uses the W3C DOM and provides support for scripting.
No actual scripting engines are provided with WWW::Scripter, but are available as separate plugins. (See also the “SEE ALSO” section below.)
So it supports DOM, but no XPath expression yet.
And there is Javascript support through plugins. -
An Introduction to Testing Web Applications with twill and Selenium – O’Reilly Media
An Introduction to Testing Web Applications with twill and Selenium – O’Reilly Media
To cheap not to own it – I thought a little, now I am reading it.
-
HSDD = hypoactive sexual desire disorder
A link to the abstract of the conference article / press release.
From that abstract:
CONCLUSION: Cerebral activation patterns in women with HSDD differs from those in women with normal sexual function and may reflect differences in how they interpret sexual stimuli.
In other words: Women with low libidos ‘have different brains’.
Have a good laugh!!!
Here is a lengthy discussion of the “miserable” approach in that article.
-
Selenium+XPather: e.g. verifyTextPresent vs. verifyElementPresent
Selenium usually records string clicks and tests instead of true native language independent XPath expressions. But you can always find the right XPath expression yourself (resp. with the help of XPather, a Firefox extension), and make use of it in your selenium code.
Caveat: the XPath expression, that XPather tells you, needs yet another ‘/’ in the beginning to be useful in your Selenium code.
Yes, these XPath expressions are lengthy, and you may think they are overspecifying your location in question, but then: when will that lengthy XPath expression ever fail? If your HTML programmer changes his code. And that’s exactly, what you should insist of being informed of in the first place. Track your HTML programmer! If you don’t, he will screw you w/o any mercy. You don’t want to screw him, but you need to know the consequences of what he is doing. Actually not in every detail, but more details are better than no details at all.
We replaced verifyTextPresent with verifyElementPresent, and it worked “out of the box”. We gained native language independence immediately.