{"id":2068,"date":"2010-10-05T12:42:00","date_gmt":"2010-10-05T12:42:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2010\/10\/05\/on-pdf\/"},"modified":"2010-10-05T12:42:00","modified_gmt":"2010-10-05T12:42:00","slug":"on-pdf","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2010\/10\/05\/on-pdf\/","title":{"rendered":"on PDF"},"content":{"rendered":"<p>\t\t\t\tNowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper.<\/p>\n<p>Roughly spoken PDF documents are expected to display the same way on every computer<br \/>\nplatform (as opposed to documents created by usual word processing<br \/>\nsoftware). This is regarded a major advantage of PDF.<\/p>\n<p><span><b>PDF vs. fonts vs. platform (in)dependence vs. resizability\/scalability<\/b><\/span><\/p>\n<p>Whenever a PDF document makes use of\u00a0 outline fonts and stroke fonts as opposed to bitmap fonts (see the Wikipedia <a href=\"http:\/\/en.wikipedia.org\/wiki\/Computer_font\">article on computer fonts<\/a>!), you are able to resize resp. rescale your document to different sizes without suffering from the loss of quality of the fonts used. This is in general considered another major advantage.<br \/>\nBut computer fonts are not in the public domain, so on every computer platform, different available fonts are used for PDF documents.<\/p>\n<p>So what can we do against platform dependency stemming from fonts?<\/p>\n<ul>\n<li>Include the fonts: that&#8217;s the approach used by <a href=\"http:\/\/en.wikipedia.org\/wiki\/PDF\/A\">PDF\/A<\/a>.<br \/><a href=\"http:\/\/en.wikipedia.org\/wiki\/PDF\/A\">PDF\/A<\/a> is especially employed, where documents need to be available even after many years in the context of document archives.<br \/>The major downside of this approach: <a href=\"http:\/\/en.wikipedia.org\/wiki\/PDF\/A\">PDF\/A<\/a> documents are much, much bigger than usual PDF documents, storing the fonts within them takes a lot space.<\/li>\n<li>Another approach is to render text and fonts into ready-made bitmaps.<br \/>Of course documents of this kind display best with a 1:1 relationship of the pixels in your documents to the pixels on your screen resp. on your printer output.<br \/>Any resizing \/ rescaling results in pour quality.<br \/>And I think you understand this very well: there is not text (as text) at all left in your PDF document, and you will not be able to extract any text from such a document.<\/li>\n<\/ul>\n<p>\nNow you know: different kinds of PDF documents come with different advantages and also disadvantages.<\/p>\n<p>I am interested here in PDF documents, that are <b>not<\/b> rendered into &#8220;one bitmap per page&#8221;, but which rather contain the source document&#8217;s <b>text<\/b>. Extracting that text simply as text is more or less an easy piece of cake, and there already exists software for<br \/>\nthis purpose.<\/p>\n<p><span><b>PDF basics<\/b><\/span><\/p>\n<p>Before I dive with you into what information we want to extract from PDF files, I want to explain PDF a little.<\/p>\n<p>I am honestly not too deep into PDF, but I<br \/>\nunderstand it as an advanced and optimized version of\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/PostScript\">PostScript<\/a>. My little<br \/>\n knowledge of PostScript is (please find a slightly lengthier version\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/PostScript#The_language\">here<\/a>\u00a0in<br \/>\nthe Wikipedia article!):<\/p>\n<ul>\n<li>It&#8217;s a stack-based programming language like Forth\u00a0using reverse<br \/>\nPolish notation.<\/li>\n<li>It has data structures like arrays and dictionaries, but nothing<br \/>\nmore abstract than that.<\/li>\n<li>Subprograms are called resp. regarded as operators of the stack<br \/>\nmachine.<\/li>\n<li>Some relevant information details may be coded into operator names. <\/li>\n<li>Some other relevant information details (like page numbers) are coded into<br \/>\n comment lines, see the article on PostScript\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Document_Structuring_Conventions\">Document<br \/>\n Structuring Conventions<\/a>. I have no clue, what corresponds to that<br \/>\nin PDF. Maybe there are language elements for that.<\/li>\n<\/ul>\n<p>Now you have an idea of how PDF looks like, and you may have a vague idea, of what is possible with PDF and what isn&#8217;t.\t\t\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper. Roughly spoken PDF documents are expected to display the same way on every computer platform (as opposed to documents created by usual word processing software). This is regarded a major advantage of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false,"_share_on_mastodon":"0"},"categories":[666],"tags":[],"class_list":["post-2068","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-xm","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2068","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=2068"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/2068\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=2068"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=2068"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=2068"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}