using XPath on non-XML HTML – how to tidy dirty HTML?

Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML?

My XPath tool is XMLStarlet:

And it can also help reformatting HTML, so that XPath expressions can get applied. I pipe “dirty HTML”  through this command line:

$ xmlstarlet fo --html --recover 2>/dev/null

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.