{"id":6458,"date":"2016-04-20T06:00:47","date_gmt":"2016-04-20T04:00:47","guid":{"rendered":"http:\/\/www.b.shuttle.de\/hayek\/hayek\/jochen\/wp\/blog-en\/?p=6458"},"modified":"2022-12-23T12:27:13","modified_gmt":"2022-12-23T11:27:13","slug":"using-xpath-on-non-xml-html-how-to-tidy-dirty-html","status":"publish","type":"post","link":"https:\/\/wp.jochen.hayek.name\/blog-en\/2016\/04\/20\/using-xpath-on-non-xml-html-how-to-tidy-dirty-html\/","title":{"rendered":"using XPath on non-XML HTML \u2013 how to tidy dirty HTML?"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">My XPath tool is XMLStarlet:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/XMLStarlet\">https:\/\/en.wikipedia.org\/wiki\/XMLStarlet<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">And it can also help reformatting HTML, so that XPath expressions can get applied. I pipe &#8220;dirty HTML&#8221; &nbsp;through this command line:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ xmlstarlet fo --html --recover 2&gt;\/dev\/null<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML? My XPath tool is XMLStarlet: And it can also help reformatting HTML, so that XPath expressions can get applied. I [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"using XPath on non-XML HTML \u2013 how to tidy dirty HTML?","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_share_on_mastodon":"0"},"categories":[851,732],"tags":[],"class_list":["post-6458","post","type-post","status-publish","format-standard","hentry","category-xmlstarlet","category-xpath"],"share_on_mastodon":{"url":"","error":""},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paO0kP-1Ga","jetpack_likes_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/6458","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/comments?post=6458"}],"version-history":[{"count":2,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/6458\/revisions"}],"predecessor-version":[{"id":12252,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/posts\/6458\/revisions\/12252"}],"wp:attachment":[{"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/media?parent=6458"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/categories?post=6458"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.jochen.hayek.name\/blog-en\/wp-json\/wp\/v2\/tags?post=6458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}