Category: web scraping
-
“Apache Nutch” is a highly extensible and scalable open source web crawler software project
- https://en.wikipedia.org/wiki/Apache_Nutch
- https://en.wikipedia.org/wiki/Web_crawler
- https://nutch.apache.org – official website
- https://wiki.apache.org/nutch – official wiki
- https://wiki.apache.org/nutch/Nutch2Crawling – “a description of the crawling jobs and field to database mappings”
- https://www.amazon.de/dp/1590596870 – apress: Building Search Applications with Lucene and Nutch
- https://www.amazon.de/dp/1783286857 – PACKT: Web Crawling and Data Mining with Apache Nutch (2017-06-20: PACKT do not list this product any longer – but still date 2014-… and available “somewhere”)
- https://www.amazon.de/dp/1156025532 – LLC Books: Free Search Engine Software: Lucene, Apache Solr, Yacy, Dataparksearch, Nutch, Pubchemsr, Sciencenet, Xapian, Opensearchserver, Grub, Ht—Dig
-
Google+ Scraper – retrieve data from Google+ profiles with NodeJS and CoffeeScript
fhemberger/googleplus-scraper – GitHub
A lot of Javascript, CoffeeScript, NodeJS, etc.
-
Firefox Add-on “Dafizilla Table2Clipboard”
Dafizilla Table2Clipboard :: Add-ons for Firefox
sources on Sourceforge.net
If you want to paste data in Microsoft Excel or OpenOffice Calc with correct disposition simply use Table2Clipboard.
-
harvesting HTML-obfuscated web-sites looks like horror to you?
I just completed 2 tasks, where I faced obfuscated CGI forms. It was quite a challenge, and I didn’t anticipate the final success from the beginning. But it’s done.
Now I am rather eager to apply my technology for interesting and lucrative tasks.
-
CPAN: Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework
Scrappy – metacpan.org: “Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework”