{"id":816,"date":"2012-01-07T23:24:00","date_gmt":"2012-01-07T23:24:00","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/Hayek\/Jochen\/wp\/blog-en\/2012\/01\/07\/table_pdf2csv-pl-extracting-tables-from-pdf-saving-them-as-csv\/"},"modified":"2012-01-07T23:24:00","modified_gmt":"2012-01-07T23:24:00","slug":"table_pdf2csv-pl-extracting-tables-from-pdf-saving-them-as-csv","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2012\/01\/07\/table_pdf2csv-pl-extracting-tables-from-pdf-saving-them-as-csv\/","title":{"rendered":"table_pdf2csv.pl : extracting tables from PDF, saving them as CSV"},"content":{"rendered":"<ul>\n<li>I leave the PDF extraction bit to &#8220;<i><a href=\"http:\/\/pdftohtml.sourceforge.net\/\">pdftohtml<\/a> -xml<\/i>&#8220;.<\/li>\n<li>My perl scripts tells you, at what &#8220;<i>physical columns<\/i>&#8221; text gets found within the PDF file.<\/li>\n<li>You choose, which &#8220;<i>physical columns<\/i>&#8221; really makes sense to you as <i>logical column<\/i> starters.<\/li>\n<li>Now you run my perl script with those few serious physical columns specified,<br \/>\nand it creates a CSV file for you.<\/li>\n<li>Per <i>logical row<\/i> a few <i>physical rows<\/i> got created.<\/li>\n<li>If you want, you can merge cells from neighboring rows into <i>logical cells<\/i>,<br \/>\nyou can use <i>LibreOffice Calc<\/i>, or <i>OpenOffice Calc<\/i>, or <i>Excel<\/i> for this step.<\/li>\n<\/ul>\n<div>Does this sound interesting to you?<\/div>\n<div>Update 2015-05-25: Uploaded the Perl script and its Shell script wrapper to\u00a0<a href=\"https:\/\/github.com\/JochenHayek\/misc\">https:\/\/github.com\/JochenHayek\/misc<\/a>.<\/div>\n","protected":false},"excerpt":{"rendered":"<p>I leave the PDF extraction bit to &#8220;pdftohtml -xml&#8220;. My perl scripts tells you, at what &#8220;physical columns&#8221; text gets found within the PDF file. You choose, which &#8220;physical columns&#8221; really makes sense to you as logical column starters. Now you run my perl script with those few serious physical columns specified, and it creates [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[75],"tags":[],"class_list":["post-816","post","type-post","status-publish","format-standard","hentry","category-csv"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-da","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/816","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=816"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/816\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=816"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=816"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=816"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}