{"id":2010,"date":"2010-10-29T22:32:00","date_gmt":"2010-10-29T22:32:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2010\/10\/29\/web-scraping-afternoon\/"},"modified":"2010-10-29T22:32:00","modified_gmt":"2010-10-29T22:32:00","slug":"web-scraping-afternoon","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2010\/10\/29\/web-scraping-afternoon\/","title":{"rendered":"web scraping afternoon"},"content":{"rendered":"<p>\t\t\t\tThis wasn&#8217;t meant to be yet another web scraping afternoon.<\/p>\n<p>This afternoon started with me trying to recover a little from a hard time.<br \/>\nI had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don&#8217;t want to tell you about the real trouble.<\/p>\n<p><\/p>\n<ul>\n<li>I got intrigued to search oreilly.com for literature on Selenium and found a &#8220;Short Cut&#8221; document.<\/li>\n<li>I found something.<\/li>\n<li>I had a few looks over the chapter on &#8220;twill&#8221;.<\/li>\n<li>Before I really dived into the chapter on Selenium, I summed up, what I really liked and disliked about Selenium.<\/li>\n<li>Of course, being able to use XPath is great.<\/li>\n<li>With Selenium you somehow aren&#8217;t aware at all, that there is Javascript being made use of on a web-site, but you just leave this to the browser engine, initially to Firefox and to the Selenium IDE.<\/li>\n<li>I actually hate it, if your HTTP scripting depends on desktop computers running a browser and some remote control software to connect your server, where you &#8220;HTTP scripts&#8221; actually run, and the web browser(s), that you make use of.<\/li>\n<li>I did a little superficial research on: perl\/ruby + mechanize + xpath.<\/li>\n<li>Yes, there is still scrubyt around, but isn&#8217;t \u00a0that vaporware now itself?<\/li>\n<li>Found perl&#8217;s\u00a0<a href=\"http:\/\/search.cpan.org\/perldoc?WWW::Scraper::TidyXML\">WWW::Scraper::TidyXML<\/a> &#8211; &#8220;TidyXML and XPath support for Scraper&#8221;. Not bad. But then it&#8217;s from around 2003, and it seems to be vaporware. My e-mail to the author could not get delivered (&#8220;over quota&#8221;), so I guess, it&#8217;s seriously no longer maintained.<\/li>\n<li>WWW::Mechanize::Firefox seems to be nice, have a look at\u00a0<a href=\"http:\/\/search.cpan.org\/perldoc?WWW::Mechanize::Firefox::Cookbook\">WWW::Mechanize::Firefox::Cookbook<\/a>!<\/li>\n<li>\u2026<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This wasn&#8217;t meant to be yet another web scraping afternoon. This afternoon started with me trying to recover a little from a hard time. I had two probation days for a web-site testing job with Selenium, I am in the middle of a couple of recruitment processes, and I don&#8217;t want to tell you about [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[666],"tags":[],"class_list":["post-2010","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-wq","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=2010"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2010\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=2010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=2010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=2010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}