wp.jochen.hayek.name/blog-en

using XPath on non-XML HTML – how to tidy dirty HTML?

Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML?

My XPath tool is XMLStarlet:

And it can also help reformatting HTML, so that XPath expressions can get applied. I pipe “dirty HTML”  through this command line:

$ xmlstarlet fo --html --recover 2>/dev/null
Exit mobile version