wp.jochen.hayek.name/blog-en

Category: pdftohtml

pdfguru – subscription trap – I got my money back
I used pdfguru a couple of times for getting “image PDFs” OCR-ed. Nice service actually. Cost me € 0.99 per week. Sort of really, really cheap.

But then they charged me € 49.99, and I found out, they thought, I opted for a monthly ongoing subscription. That’s called subscription trap, in German: “Abofalle”. I searched the web.
- https://www.scam-detector.com/validator/pdfguru-com-review/ – go to the bottom of that page!
That’s mean!

I contacted my bank. OMG, what a painful process altogether!

I also contacted pdfguru, and asked them to reimburse me. They replied, that’s impossible. Once charged, they cannot reimburse me.

My bank accepted the “plea” and started the process. Several days later, pdfguru sent me a mail saying, I got reimbursed, A few days later the transfer showed up my cash account. I am happy.

BTW: If you are looking for a similar (but …) service, look for “pdf24” and then for OCR. I keep using their web page for that service.
2026-02-06
PDF OCR
- https://tools.pdf24.org/en/ocr-pdf
  - asks the user, which language is involved – and that’s important
- https://smallpdf.com/pdf-ocr
  - does not ask the user, which language is involved
2026-01-30
“pdftohtml -xml” – only the poppler suite supports “-xml”
- https://forum.xpdfreader.com/viewtopic.php?f=3&t=41211
- only the poppler toolset (the xpdf-related toolset) has “pdftohtml -xml“
- https://en.wikipedia.org/wiki/Poppler_(software)
- https://poppler.freedesktop.org
- https://anongit.freedesktop.org/git/poppler/poppler.git
One of my most favourite tools.

I have been using it for years now – on a daily basis. (I came across it in my local Ruby user group many years ago.)

Of course it only works on PDF with text.

Luckily enough there are tools resp. services, that “OCR” your “image PDF”, just in case your PDF file does not include the text it shows as text.

I am editing the XML result in Emacs with nXML mode, and I developed a RELAX-NG grammar for context sensitive editing of such XML files.
I am annotating these XML files using specific XML comments.
For PDF files from several providers I created scripts for automated annotation. (Best case: find lvalue and rvalue together. Most of the time I find at least lvalue.)
I created scripts to extract the details from those annotations. And they create text, that resembles by (personal home-made / home-maintained) bank statements – so I can “reconcile” them.

I am processing every bill PDF like that.

I am processing every contract PDF like that. I guess you understand, how much better it is to read and annotate a text file in place instead of keeping notes outside the source. Yes, that’s of course like inline documentation within programming language source files.

Just in case anybody reads this and finds it useful: Of course I am able and most willing to provide far more details.
2021-02-02