Extract XML tags and values from the CLI using sed

Extracting certain tags and values using XML is not something we do on a daily basis. In my case, I had to figure out how to extract all DevCoops blog posts titles using the sitemap.xml file. And here’s how I’ve done it.

Prerequisites

Solution

Step 1. First, I need the <loc> tag only. This is where the URLs are stored. To get them, run:

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'

^<loc>: match anything that starts with <loc>.
!d: do not delete.

Step 2. Next, remove the <loc> tags.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g'

-e (optional): –expression=script. Used with one or multiple commands (scripts) without invoking more than one instance of sed.
s/<[^>]*>//g: removes any tag occurrences.

Conclusion

Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.

Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.

Extract XML tags and values from the CLI using sed

Prerequisites

Solution

Conclusion

Let's keep in touch