Spidering Hacks – O’Reilly Media

CHAPTER ONE


Hack #2 – Best Practices for You and Your Spider    


Be Liberal in What You Accept
… This is an inexact science, to put it mildly. …

Monitor your spider’s output on a regular basis to make sure it’s working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.

Don’t Reinvent the Wheel

  • Best Practices for You

If you must scrape HTML, do so sparingly. If the information you want is avail- able only embedded in an HTML page, try to find a “Text Only” or “Print this Page” variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient, and they don’t tend to change all that much (by comparison) during site redesigns.
Hack #4 – Registering Your Spider
By the way, you might think that your spider is minimal or low-key enough that nobody’s going to notice it. That’s probably not the case. In fact, sites like Webmaster World (http://www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don’t think that your spider is going to get ignored just because you’re not using a thousand online servers and spidering millions of pages a day.
Naming Your Spider
… There are web sites, like http://www.iplists.com, devoted to tracking IP addresses of legitimate spiders. …
Hack #5 – Preempting Discovery
No matter how gentle and polite your spider is, sooner or later you’re going to be noticed. Some webmaster’s going to see what your spider is up to, and they’re going to want some answers.

Hack #6 – Keeping Your Spider Out of Sticky Situations Hack
Bad Spider, No Biscuit!
… There is nothing stopping a disgruntled site from revising its TOS to deny a spider’s access, and then sending you a “cease and desist” letter. … Spidering another site’s content and reappropriating it into your own framed pages is bad. Don’t do it. …
Competitive IntelligenceSome sites complain because their competitors access and spider their data—data that’s publicly available to any browser—and use it in their com- petitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bid- der’s Edge was sued by eBay (http://pub.bna.com/lw/21200.htm) for such a spider. …
Possible Consequences of Misbehaving Spiders
… But considering lawyer’s fees, the time it’ll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it’s a good enough reason to make sure that your spiders are behaving and your intent is fair.
CHAPTER TWO
Assembling a Toolbox

Hacks #8–32



Chapter 4 Gleaning Data from Databases

Hack #69 – Aggregating RSS and Posting Changes
-> meta feeds, aggregating feeds, …


Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.