“Apache Nutch” is a highly extensible and scalable open source web crawler software project

https://en.wikipedia.org/wiki/Apache_Nutch https://en.wikipedia.org/wiki/Web_crawler https://nutch.apache.org – official website https://wiki.apache.org/nutch – official wiki https://wiki.apache.org/nutch/Nutch2Crawling – “a description of the crawling jobs and field to database mappings” https://www.amazon.de/dp/1590596870 – apress: Building Search Applications with Lucene and Nutch https://www.amazon.de/dp/1783286857 – PACKT: Web Crawling and Data Mining with Apache Nutch (2017-06-20: PACKT do not list this product any longer – but still date 2014-… and available “somewhere”) https://www.amazon.de/dp/1156025532 –… Continue reading “Apache Nutch” is a highly extensible and scalable open source web crawler software project

web harvesting and my toolkit JHwis

I implemented a toolkit years ago, that I call JHwis. Now and then I think, I should have do more advertising for it. I have been using software created by that toolkit for downloading bank account statements and other stuff for years now. I would like to prove you, it’s also very well suited for… Continue reading web harvesting and my toolkit JHwis