{"id":2398,"date":"2010-06-30T00:06:00","date_gmt":"2010-06-30T00:06:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2010\/06\/30\/spidering-hacks-oreilly-media\/"},"modified":"2010-06-30T00:06:00","modified_gmt":"2010-06-30T00:06:00","slug":"spidering-hacks-oreilly-media","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2010\/06\/30\/spidering-hacks-oreilly-media\/","title":{"rendered":"Spidering Hacks &#8211; O&#8217;Reilly Media"},"content":{"rendered":"<p>\t\t\t\t<b>CHAPTER ONE<\/b><br \/><b><br \/><\/b><br \/><b>Hack #2  \u2013 Best Practices for You and Your Spider\u00a0\u00a0\u00a0\u00a0<\/b><br \/>\n\u2026 <b><br \/><\/b><br \/><b>Be Liberal in What You Accept<\/b><br \/>\n\u2026 This is an inexact science, to put it mildly. \u2026<br \/>\n\u2026<br \/>\nMonitor your spider\u2019s output on a regular basis to make sure it\u2019s working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.<br \/>\n\u2026<br \/><b>Don\u2019t Reinvent the Wheel<\/b><\/p>\n<ul>\n<li><b>\u2026 <\/b><\/li>\n<li><b>Best Practices for You<\/b><\/li>\n<\/ul>\n<p>If you must scrape HTML, do so sparingly. If the information you want is avail- able only embedded in an HTML page, try to find a \u201cText Only\u201d or \u201cPrint this Page\u201d variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient, and they don\u2019t tend to change all that much (by comparison) during site redesigns.<br \/><b>Hack #4  \u2013 Registering Your Spider<\/b><br \/>\nBy the way, you might think that your spider is minimal or low-key enough that nobody\u2019s going to notice it. That\u2019s probably not the case. In fact, sites like Webmaster World (http:\/\/www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don\u2019t think that your spider is going to get ignored just because you\u2019re not using a thousand online servers and spidering millions of pages a day.<br \/><b>Naming Your Spider<\/b><br \/>\n\u2026 There are web sites, like http:\/\/www.iplists.com, devoted to tracking IP addresses of legitimate spiders. \u2026<b><\/b><br \/><b>Hack #5  \u2013 <\/b><b>Preempting Discovery<\/b><br \/>\nNo matter how gentle and polite your spider is, sooner or later you\u2019re going to be noticed. Some webmaster\u2019s going to see what your spider is up to, and they\u2019re going to want some answers.<br \/><b>\u2026<\/b><br \/><b>Hack #6  \u2013 <\/b><b>Keeping Your Spider Out of Sticky Situations Hack <\/b><br \/><b>Bad Spider, No Biscuit!<\/b><br \/>\n\u2026 There is nothing stopping a disgruntled site from revising its TOS to deny a spider\u2019s access, and then sending you a \u201ccease and desist\u201d letter. \u2026 Spidering another site\u2019s content and reappropriating it into your own framed pages is bad. Don\u2019t do it. \u2026<br \/><b>Competitive Intelligence<\/b>Some sites complain because their competitors access and spider their data\u2014data that\u2019s publicly available to any browser\u2014and use it in their com- petitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bid- der\u2019s Edge was sued by eBay (http:\/\/pub.bna.com\/lw\/21200.htm) for such a spider. \u2026<br \/><b>Possible Consequences of Misbehaving Spiders<\/b><br \/>\n\u2026 But considering lawyer\u2019s fees, the time it\u2019ll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it\u2019s a good enough reason to make sure that your spiders are behaving and your intent is fair.<b> \u2026<\/b><br \/><b>CHAPTER TWO<br \/>Assembling a Toolbox<\/b><br \/><b>Hacks #8\u201332<\/b><br \/><b>\u2026<\/b><br \/><b><br \/><\/b><br \/><b>Chapter 4 Gleaning Data from Databases<\/b><br \/>\n\u2026<br \/><b>Hack #69<\/b><b>\u00a0\u2013\u00a0<\/b><b>Aggregating RSS and Posting Changes<\/b><br \/>\n-&gt; meta feeds, aggregating feeds, \u2026<br \/><b><br \/><\/b>\t\t\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>CHAPTER ONEHack #2 \u2013 Best Practices for You and Your Spider\u00a0\u00a0\u00a0\u00a0 \u2026 Be Liberal in What You Accept \u2026 This is an inexact science, to put it mildly. \u2026 \u2026 Monitor your spider\u2019s output on a regular basis to make sure it\u2019s working as expected [Hack #31], make the appropriate adjustments as soon as possible [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[666],"tags":[],"class_list":["post-2398","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-CG","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2398","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=2398"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2398\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=2398"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=2398"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=2398"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}