what do you think about working through elance.com as a “provider“?
Blog
-
Spidering Hacks – O’Reilly Media
CHAPTER ONE
Hack #2 – Best Practices for You and Your Spider
…
Be Liberal in What You Accept
… This is an inexact science, to put it mildly. …
…
Monitor your spider’s output on a regular basis to make sure it’s working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.
…
Don’t Reinvent the Wheel- …
- Best Practices for You
If you must scrape HTML, do so sparingly. If the information you want is avail- able only embedded in an HTML page, try to find a “Text Only” or “Print this Page” variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient, and they don’t tend to change all that much (by comparison) during site redesigns.
Hack #4 – Registering Your Spider
By the way, you might think that your spider is minimal or low-key enough that nobody’s going to notice it. That’s probably not the case. In fact, sites like Webmaster World (http://www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don’t think that your spider is going to get ignored just because you’re not using a thousand online servers and spidering millions of pages a day.
Naming Your Spider
… There are web sites, like http://www.iplists.com, devoted to tracking IP addresses of legitimate spiders. …
Hack #5 – Preempting Discovery
No matter how gentle and polite your spider is, sooner or later you’re going to be noticed. Some webmaster’s going to see what your spider is up to, and they’re going to want some answers.
…
Hack #6 – Keeping Your Spider Out of Sticky Situations Hack
Bad Spider, No Biscuit!
… There is nothing stopping a disgruntled site from revising its TOS to deny a spider’s access, and then sending you a “cease and desist” letter. … Spidering another site’s content and reappropriating it into your own framed pages is bad. Don’t do it. …
Competitive IntelligenceSome sites complain because their competitors access and spider their data—data that’s publicly available to any browser—and use it in their com- petitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bid- der’s Edge was sued by eBay (http://pub.bna.com/lw/21200.htm) for such a spider. …
Possible Consequences of Misbehaving Spiders
… But considering lawyer’s fees, the time it’ll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it’s a good enough reason to make sure that your spiders are behaving and your intent is fair. …
CHAPTER TWO
Assembling a Toolbox
Hacks #8–32
…
Chapter 4 Gleaning Data from Databases
…
Hack #69 – Aggregating RSS and Posting Changes
-> meta feeds, aggregating feeds, …
-
jobs.perl.org needs a couple of changes — let’s start brain storming!
For me as a freelancer it’s very clear:
- There must be separate feeds for freelance and salaried staff.
- There should be an opportunity of commenting on the job postings, e.g. if the original poster doesn’t close the job, it makes sense to get that information from somebody else, maybe from somebody who was somehow involved. Yes, that cannot happen anonymously.
- …
What else?
Yes, I tried to contact Ask
Bjørn Hansen at ask(AT)perl.org before I started this here, but to no success. -
“Senior Software Engineer – Perl” / “Germany, Karlsruhe” / “Pay rate: 70,00 €/h” / CLOSED
“Closed”, so the recruiter says.
What a pity, that comments on job postings on jobs.perl.org are not possible. -
“Tour de Babel” by Steve Yegge
[…]
My whirlwind tour will cover C, C++, Lisp, Java, Perl, (all
languages we use at Amazon), Ruby (which I just plain like), and Python,
which is in there because — well, no sense getting ahead of ourselves,
now.
[…] -
“A Quick Tour of Ruby” by Steve Yegge
Very nice to read.
Ruby used to annoy me simply by existing. I first heard about Ruby
years ago, in maybe 1997 or 1998, and folks said it was kind of like
Perl, but “cleaner”, whatever that meant. Ruby fans back then seemed
like a tiny minority of rebels and fringe separatists.
Ruby irked me primarily because we already had Perl, which was
working just fine thank you very much. And if for some strange reason
you didn’t like Perl, we had Python. If Perl fans were dog owners, and
Python fans were cat owners, then Ruby fans seemed like ferret owners.
They could go on and on about how much they adored their
beady-eyed albino stretch-limo rats, and how cute they were,
but we all knew they were just looking for attention. Nobody really
wants a pet rat. (Ferret owners will correct me and say they’re not
rodents; they’re more closely related to weasels and skunks. As if that
helps.) Regardless, I didn’t want to have anything to do with Ruby.
Last year, though, I was looking at a bunch of different languages
in the hopes of finding one to replace Perl for small- to medium-sized
tasks. One day my magic Perl dust had worn off rather suddenly, and I’d
joined the growing ranks of people who were beginning to notice the
emperor was a wee bit underdressed. But all the alternatives to Perl
looked pretty bad themselves, and I started judging languages by how far
I’d get into the reference manual before throwing it across the room.
I eventually picked up a Ruby book — …Steve …’s home page.
I personally keep loving both of them. I can afford that in the comp.lang.* area and in some others as well, but that doesn’t concern my girl-friend, of course.
I actually came across Steve, when I searched for elisp.
-
iPhone apps, that I need sooner or later
- Telefonkarte (Calling Card): supplies you with support for all sorts of call-through telecom providers, even your FRITZ!Box at home can serve as one and is indeed supported by this app
- …
-
how to avoid to accidentally Quit Firefox?
Is there any config. variable?
There is a checkbox labeled “Warn me when closing multiple tabs”. That does the job. -
first steps in IRC with pidgin
- “Add Account” for each IRC server/user pair (e.g. irc.freenode.net), that you want to use, within pidgin with IRC as protocol
- “Join a Chat” (below Buddies), select the right Account (i.e. one of your IRC protocol/server accounts), enter the Channel (including the ‘#’), leave the Password blank! here we are!
Did I mention recently, how much I love my pidgin?
I did all this with a (fink) pidgin on my MacBook running Snow Leopard (OS X), but I don’t doubt, it will also run on my openSUSE Samsung notebook. -
networks and logos
Where do you get to the personalised logos resp. badges of misc. networks:
To be continue …