{"id":7586,"date":"2017-09-27T11:14:44","date_gmt":"2017-09-27T09:14:44","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/hayek\/jochen\/wp\/blog-en\/?p=7586"},"modified":"2017-09-27T11:14:44","modified_gmt":"2017-09-27T09:14:44","slug":"pdftohtml-xml","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2017\/09\/27\/pdftohtml-xml\/","title":{"rendered":"&#8220;pdftohtml&#8221; \u2013 the one PDF utility I cannot &#8220;be&#8221; without"},"content":{"rendered":"<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Poppler_(software)\">https:\/\/en.wikipedia.org\/wiki\/Poppler_(software)<\/a><\/li>\n<li>\u00a0\u2013 very nice description of the\u00a0poppler-utils<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Xpdf\">https:\/\/en.wikipedia.org\/wiki\/Xpdf<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Entware-ng\/Entware-ng\">https:\/\/github.com\/Entware-ng\/Entware-ng<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki\">https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki\/Install-on-Synology-NAS\">https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki\/Install-on-Synology-NAS<\/a><\/li>\n<\/ul>\n<p>I actually mean &#8220;<em>pdftohtml <span style=\"text-decoration: underline\">-xm<\/span><\/em><span style=\"text-decoration: underline\">l<\/span>&#8221; \u2013 which creates XML from PDF, and this is my command line:<\/p>\n<p><code>$ pdftohtml -xml -i -nomerge -hidden FILE.pdf<\/code><\/p>\n<p>resp.:<\/p>\n<p><code>$ pdftohtml -xml -i -nomerge -hidden FILE.pdf FILE.pdftohtml.xml<\/code><\/p>\n<p>Sometimes I need to run &#8220;pdftohtml -xml&#8221; (on the command line) on a file living on my NAS \u2013 it is really an essential utility for me.<\/p>\n<p>CAVEAT: Be sure you have <em>poppler-utils<\/em> installed, not <em>xpdf<\/em> \u2013\u00a0<em>xpdf&#8217;s<\/em> <em>pdftohtml<\/em> is far outdated (their numbering schemes are different):<\/p>\n<p><code>root@DiskStation:~# \/opt\/bin\/opkg install xpdf<br \/>\nInstalling xpdf (4.00-1) to root...<br \/>\nDownloading http:\/\/pkg.entware.net\/binaries\/x86-64\/xpdf_4.00-1_x86-64.ipk<br \/>\nConfiguring xpdf.<br \/>\nroot@DiskStation:~# \/opt\/bin\/opkg search \/opt\/bin\/pdftohtml<br \/>\nxpdf - 4.00-1<br \/>\nroot@DiskStation:~# \/opt\/bin\/pdftohtml --help<br \/>\npdftohtml version 4.00<\/code><code><br \/>\nroot@DiskStation:~# \/opt\/bin\/opkg remove xpdf<br \/>\nRemoving package xpdf from root...<br \/>\nroot@DiskStation:~# \/opt\/bin\/opkg install poppler-utils<br \/>\nInstalling poppler-utils (0.53.0-1) to root...<br \/>\nDownloading http:\/\/pkg.entware.net\/binaries\/x86-64\/poppler-utils_0.53.0-1_x86-64.ipk<br \/>\nConfiguring poppler-utils.<br \/>\nroot@DiskStation:~# \/opt\/bin\/opkg search \/opt\/bin\/pdftohtml<br \/>\npoppler-utils - 0.53.0-1<br \/>\nroot@DiskStation:~# \/opt\/bin\/pdftohtml --help<br \/>\npdftohtml version 0.53.0<\/code>\t\t\t\t<\/p>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/en.wikipedia.org\/wiki\/Poppler_(software) \u00a0\u2013 very nice description of the\u00a0poppler-utils https:\/\/en.wikipedia.org\/wiki\/Xpdf https:\/\/github.com\/Entware-ng\/Entware-ng https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki https:\/\/github.com\/Entware-ng\/Entware-ng\/wiki\/Install-on-Synology-NAS I actually mean &#8220;pdftohtml -xml&#8221; \u2013 which creates XML from PDF, and this is my command line: $ pdftohtml -xml -i -nomerge -hidden FILE.pdf resp.: $ pdftohtml -xml -i -nomerge -hidden FILE.pdf FILE.pdftohtml.xml Sometimes I need to run &#8220;pdftohtml -xml&#8221; (on the command line) [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[255,397,575],"tags":[],"class_list":["post-7586","post","type-post","status-publish","format-standard","hentry","category-ipkg","category-opkg","category-synology"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-1Ym","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/7586","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=7586"}],"version-history":[{"count":0,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/7586\/revisions"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=7586"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=7586"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=7586"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}