Extracting certain tags and values using XML is not something we do on a daily basis. In my case, I had to figure out how to extract all DevCoops blog posts titles using the sitemap.xml
file. And here’s how I’ve done it.
Prerequisites
- sed
Solution
Step 1. First, I need the <loc>
tag only. This is where the URLs are stored. To get them, run:
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'
^<loc>
: match anything that starts with<loc>
.!d
: do not delete.
Step 2. Next, remove the <loc>
tags.
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g'
-e
(optional): –expression=script. Used with one or multiple commands (scripts) without invoking more than one instance ofsed
.s/<[^>]*>//g
: removes any tag occurrences.
Conclusion
Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.
Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.