using XPath on non-XML HTML – how to tidy dirty HTML?

johayek

10 years ago

Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML?

My XPath tool is XMLStarlet:

https://en.wikipedia.org/wiki/XMLStarlet

And it can also help reformatting HTML, so that XPath expressions can get applied. I pipe “dirty HTML” through this command line:

$ xmlstarlet fo --html --recover 2>/dev/null

Share this: