I wanted to do some analytics automation so, at first I needed to pull a list of all blog post links (URLs) from the sitemap.xml
file. And here’s what I came up with.
Prerequisites
- curl
- sed
Solution
Let’s take the DevCoops sitemap.xml. file as an example.
Step 1. First, GET the sitemap.xml
file using curl
.
curl https://devcoops.com/sitemap.xml
Example:
...
<url>
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<lastmod>2022-03-19T00:00:00+01:00</lastmod>
</url>
<url>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
<lastmod>2022-03-20T00:00:00+01:00</lastmod>
</url>
...
Step 2. Remove everything that’s not starting / ending with the <loc>
/ </loc>
tags using sed
text stream editor.
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'
Example:
...
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
...
Step 3. Get rid of the tags.
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g'
Example:
...
https://devcoops.com/install-bc-linux-macos-windows/
https://devcoops.com/repair-and-optimize-mysql-databases/
...
Step 4. Save the output in a file. For instance:
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' > sitemap_results.txt
Bonus tip(s):
- Since
sitemap.xml
stores the “extra” Jekyll pages that aren’t posts including:/categories/
,/tags/
,/contact/
and/privacy-policy/
, to remove them, being written as the last 4 lines of the file, run the following command instead:curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' | ghead -n -4 > sitemap_results.txt
- If working on macOS, use
ghead
instead ofhead
as the latter doesn’t support negative line counts. To installghead
, run:brew install coreutils
Conclusion
Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.
Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.