wp.jochen.hayek.name/blog-en

using XPath on non-XML HTML – how to tidy dirty HTML?

Written by

in

xmlstarlet, XPath

Scraping HTML using XPath is far nicer than through low-level text processing. But how to proceed, if your XPath tool cannot deal with the HTML, because it is not XHTML conform resp. properly formatted XML?

My XPath tool is XMLStarlet:

https://en.wikipedia.org/wiki/XMLStarlet

And it can also help reformatting HTML, so that XPath expressions can get applied. I pipe “dirty HTML” through this command line:

$ xmlstarlet fo --html --recover 2>/dev/null

Comments

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

More posts