Category: web scraping
-
automating & scraping the Web with JavaScript and Puppeteer
https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921 https://developers.google.com/web/tools/puppeteer/ https://github.com/puppeteer/puppeteer/
-
“Apache Nutch” is a highly extensible and scalable open source web crawler software project
https://en.wikipedia.org/wiki/Apache_Nutch https://en.wikipedia.org/wiki/Web_crawler https://nutch.apache.org – official website https://wiki.apache.org/nutch – official wiki https://wiki.apache.org/nutch/Nutch2Crawling – “a description of the crawling jobs and field to database mappings” https://www.amazon.de/dp/1590596870 – apress: Building Search Applications with Lucene and Nutch https://www.amazon.de/dp/1783286857 – PACKT: Web Crawling and Data Mining with Apache Nutch (2017-06-20: PACKT do not list this product any longer – but still date 2014-… and available “somewhere”) https://www.amazon.de/dp/1156025532 –…
-
VTI’s tutorial on “web scraping with LWP”
Perltuts.com | Interactive Perl tutorials
-
Google+ Scraper – retrieve data from Google+ profiles with NodeJS and CoffeeScript
fhemberger/googleplus-scraper – GitHub A lot of Javascript, CoffeeScript, NodeJS, etc.
-
Firefox Add-on “Dafizilla Table2Clipboard”
Dafizilla Table2Clipboard :: Add-ons for Firefox sources on Sourceforge.net If you want to paste data in Microsoft Excel or OpenOffice Calc with correct disposition simply use Table2Clipboard.
-
HTML::TableExtract – metacpan.org
HTML::TableExtract – Perl module for extracting the content contained in tables within an HTML document, either as text or encoded element trees. – metacpan.org
-
harvesting HTML-obfuscated web-sites looks like horror to you?
I just completed 2 tasks, where I faced obfuscated CGI forms. It was quite a challenge, and I didn’t anticipate the final success from the beginning. But it’s done. Now I am rather eager to apply my technology for interesting and lucrative tasks.
-
CPAN: Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework
Scrappy – metacpan.org: “Scrappy – The All Powerful Web Spidering, Scraping, Creeping Crawling Framework”