web scraping – wp.jochen.hayek.name/blog-en

https://en.wikipedia.org/wiki/Apache_Nutch
https://en.wikipedia.org/wiki/Web_crawler
https://nutch.apache.org – official website
https://wiki.apache.org/nutch – official wiki
https://wiki.apache.org/nutch/Nutch2Crawling – “a description of the crawling jobs and field to database mappings”
https://www.amazon.de/dp/1590596870 – apress: Building Search Applications with Lucene and Nutch
https://www.amazon.de/dp/1783286857 – PACKT: Web Crawling and Data Mining with Apache Nutch (2017-06-20: PACKT do not list this product any longer – but still date 2014-… and available “somewhere”)
https://www.amazon.de/dp/1156025532 – LLC Books: Free Search Engine Software: Lucene, Apache Solr, Yacy, Dataparksearch, Nutch, Pubchemsr, Sciencenet, Xapian, Opensearchserver, Grub, Ht—Dig

2017-06-20

VTI’s tutorial on “web scraping with LWP”

Perltuts.com | Interactive Perl tutorials

2012-08-03

Google+ Scraper – retrieve data from Google+ profiles with NodeJS and CoffeeScript

fhemberger/googleplus-scraper – GitHub

A lot of Javascript, CoffeeScript, NodeJS, etc.

2012-01-20

Firefox Add-on “Dafizilla Table2Clipboard”

Dafizilla Table2Clipboard :: Add-ons for Firefox

sources on Sourceforge.net

If you want to paste data in Microsoft Excel or OpenOffice Calc with correct disposition simply use Table2Clipboard.

2012-01-20

Matthew P. Sisk’s project HTML-TableExtract

HTML-TableExtract

2012-01-06

HTML::TableExtract – metacpan.org

HTML::TableExtract – Perl module for extracting the content contained in tables within an HTML document, either as text or encoded element trees. – metacpan.org

2012-01-06

harvesting HTML-obfuscated web-sites looks like horror to you?

I just completed 2 tasks, where I faced obfuscated CGI forms. It was quite a challenge, and I didn’t anticipate the final success from the beginning. But it’s done.

Now I am rather eager to apply my technology for interesting and lucrative tasks.

2012-01-05

quora.com/Web-Scraping

Web Scraping – Quora

2011-11-17

CPAN: Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework

Scrappy – metacpan.org: “Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework”

2011-10-08

Category: web scraping

automating & scraping the Web with JavaScript and Puppeteer

“Apache Nutch” is a highly extensible and scalable open source web crawler software project