{"id":932,"date":"2011-11-04T10:15:00","date_gmt":"2011-11-04T10:15:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2011\/11\/04\/my-new-page-scraping-assignment-getting-familiar-again-with-my-toolkit\/"},"modified":"2023-08-25T17:23:31","modified_gmt":"2023-08-25T15:23:31","slug":"my-new-page-scraping-assignment-getting-familiar-again-with-my-toolkit","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2011\/11\/04\/my-new-page-scraping-assignment-getting-familiar-again-with-my-toolkit\/","title":{"rendered":"my new page scraping assignment \u2013 getting familiar again with my toolkit"},"content":{"rendered":"<p><\/p>\n<div>\nFor my new page scraping assignment I thought for a while of trying a much more modern approach.<\/div>\n<div>\nThat actually kept me from really starting it for quite a couple of weeks now, because it seemed so very tedious and I thought, I don&#8217;t have like 3 shots for it.<\/div>\n<div>\nThis week I thought about going with my own old approach and about making use of the state-of-the-art technology at a (slightly) later stage. That should work.<\/div>\n<div>\nSo where is my software and where is my documentation?<\/div>\n<ul>\n<li>I remember, I had left a link <a href=\"http:\/\/aleph-soft.com\/JHwis.html\">here<\/a> at my <a href=\"http:\/\/aleph-soft.com\/\">Aleph-Soft.com<\/a> website<\/li>\n<li>that leads me to my slightly more extensive <a href=\"http:\/\/aleph-soft.com\/JHwis\/\">dedicated article<\/a><\/li>\n<li>of course, while I read it, I switch to the sources of that article, so that I can improve the article &#8220;en passent&#8221;; OMG: running that DocBook website toolchain even works after at least a year or so! I&#8217;m amazed. well, not updating software does have some positive side-effects.<\/li>\n<li>does\u00a0<span><a href=\"http:\/\/livehttpheaders.mozdev.org\/\">LiveHTTPHeaders<\/a> still work with my current Firefox?\u00a0<\/span><span>LiveHTTPHeaders is one of the reasons I still keep my Firefox updated, although I chose Chromium as my main browser on all platforms (*** <b>bookmark<\/b> ***)<\/span><\/li>\n<li><span>what about its cousin <a href=\"http:\/\/www.blunck.info\/iehttpheaders.html\">ieHTTPHeaders<\/a> for IE? WTF, where does it actually live and get maintained? alright, I assume\u00a0<a href=\"http:\/\/www.blunck.info\/\">Jonas Blunck<\/a>\u00a0is the creator and maintainer<\/span><\/li>\n<li><span>is there anything like *HTTPHeaders for Chrome\/Chromium? that would be nice; I would have to make my respective tool read its logfile then<\/span><\/li>\n<li>creating a perl script from LiveHTTPHeaders&#8217;s log file still works<\/li>\n<li>integrated that perl script into my framework for that kind of stuff<\/li>\n<li>download the root HTML page, parsing it, extracting the 1st few bits of information wanted<\/li>\n<li>download the 1st linked page; the navigation doesn&#8217;t go further \/ deeper than this<\/li>\n<li>TBD: extract the information details from that linked page; CAVEAT: there is an optional intermediate (&#8220;region&#8221;) level within that page<\/li>\n<li><span>\u2026<\/span><\/li>\n<\/ul>\n<div>\n<div>\n(This article is getting extended and updated these days in early November 2011.)<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>For my new page scraping assignment I thought for a while of trying a much more modern approach. That actually kept me from really starting it for quite a couple of weeks now, because it seemed so very tedious and I thought, I don&#8217;t have like 3 shots for it. This week I thought about [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[101,103,282,413],"tags":[],"class_list":["post-932","post","type-post","status-publish","format-standard","hentry","category-docbook","category-docbook-website","category-jhwis","category-page-scraping"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-f2","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/932","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=932"}],"version-history":[{"count":1,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/932\/revisions"}],"predecessor-version":[{"id":12645,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/932\/revisions\/12645"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=932"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=932"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=932"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}