wp.jochen.hayek.name/blog-en

on PDF

Nowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper.

Roughly spoken PDF documents are expected to display the same way on every computer
platform (as opposed to documents created by usual word processing
software). This is regarded a major advantage of PDF.

PDF vs. fonts vs. platform (in)dependence vs. resizability/scalability

Whenever a PDF document makes use of  outline fonts and stroke fonts as opposed to bitmap fonts (see the Wikipedia article on computer fonts!), you are able to resize resp. rescale your document to different sizes without suffering from the loss of quality of the fonts used. This is in general considered another major advantage.
But computer fonts are not in the public domain, so on every computer platform, different available fonts are used for PDF documents.

So what can we do against platform dependency stemming from fonts?

Now you know: different kinds of PDF documents come with different advantages and also disadvantages.

I am interested here in PDF documents, that are not rendered into “one bitmap per page”, but which rather contain the source document’s text. Extracting that text simply as text is more or less an easy piece of cake, and there already exists software for
this purpose.

PDF basics

Before I dive with you into what information we want to extract from PDF files, I want to explain PDF a little.

I am honestly not too deep into PDF, but I
understand it as an advanced and optimized version of PostScript. My little
knowledge of PostScript is (please find a slightly lengthier version here in
the Wikipedia article!):

Now you have an idea of how PDF looks like, and you may have a vague idea, of what is possible with PDF and what isn’t.

Exit mobile version