{"id":1745,"date":"2011-05-20T11:05:00","date_gmt":"2011-05-20T11:05:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2011\/05\/20\/extracting-infos-from-a-rather-detailed-pdf-from-a-software-developers-point-of-view\/"},"modified":"2011-05-20T11:05:00","modified_gmt":"2011-05-20T11:05:00","slug":"extracting-infos-from-a-rather-detailed-pdf-from-a-software-developers-point-of-view","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2011\/05\/20\/extracting-infos-from-a-rather-detailed-pdf-from-a-software-developers-point-of-view\/","title":{"rendered":"extracting infos from a rather detailed PDF (from a software developer&#8217;s point of view)"},"content":{"rendered":"<p>\t\t\t\tIf I access PDF, I rather read the XML created by &#8220;<span>pdfthtml -xml<\/span>&#8221; for a PDF file. Although there are features, that I miss with XML::Simple, I find that module rather convenient.<\/p>\n<p>Think of a pay slip as PDF. It has quite a regular structure. (Of course, you might also want to receive an XML representation of it directly from the salary software, but that&#8217;s another issue. In this very case this looked like rather hard to achieve.)<br \/>\nThere are <i>labels<\/i> and there are <i>values<\/i>. I want to access <i>values<\/i> by their <i>labels<\/i>. Therefore I need a specification describing, where the value belonging to a specific label is located relatively. I do this by giving a <i><u>relative<\/u> rectangular range<\/i>\u00a0\/ <i>region<\/i>. All text strings provided by &#8220;<span>pdftohmtl -xml<\/span>&#8221; (i.e. the <span>text<\/span> elements) get stored into a matrix (X<span>\u00d7Y<\/span>). So far there were no big obstacles accessing the value for a label by scanning the matrix within that relative rectangular region.<br \/>\nI actually and also usually don&#8217;t want and need to specify, where the label is located on the page. Why would you want to specify that, as long as it&#8217;s not necessary?<br \/>\nBut certain labels appear more than once. I add the absolute rectangular region of the label, in case that is needed. Of course, this spec. is as terse as possible. A PDF page has its origin at the upper left corner (you do know that). So if the label is just above y=500, you neither need to give the left upper corner of the resp. rectangular region nor the lower right corner. This makes the label\/value spec. just as verbose as needed.<br \/>\n(Right, I know a picture would help: <a href=\"http:\/\/en.wikipedia.org\/wiki\/A_picture_is_worth_a_thousand_words\">A picture is worth a thousand words<\/a>.)<\/p>\n<p>My software is implemented in Perl, and so far the label\/value specs are done programmatically. Of course, I would like to have a spec as XML or as a DSL, but I am not there yet.<\/p>\n<p>To be continued \u2026\t\t\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If I access PDF, I rather read the XML created by &#8220;pdfthtml -xml&#8221; for a PDF file. Although there are features, that I miss with XML::Simple, I find that module rather convenient. Think of a pay slip as PDF. It has quite a regular structure. (Of course, you might also want to receive an XML [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[666],"tags":[],"class_list":["post-1745","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-s9","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/1745","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=1745"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/1745\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=1745"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=1745"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=1745"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}