Simple CLI command to extract URLs from the sitemap.xml file

I wanted to do some analytics automation so, at first I needed to pull a list of all blog post links (URLs) from the sitemap.xml file. And here’s what I came up with.

Prerequisites

curl
sed

Solution

Let’s take the DevCoops sitemap.xml. file as an example.

Step 1. First, GET the sitemap.xml file using curl.

curl https://devcoops.com/sitemap.xml

Example:

...
<url>
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<lastmod>2022-03-19T00:00:00+01:00</lastmod>
</url>
<url>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
<lastmod>2022-03-20T00:00:00+01:00</lastmod>
</url>
...

Step 2. Remove everything that’s not starting / ending with the <loc> / </loc> tags using sed text stream editor.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'

Example:

...
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
...

Step 3. Get rid of the tags.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' 

Example:

...
https://devcoops.com/install-bc-linux-macos-windows/
https://devcoops.com/repair-and-optimize-mysql-databases/
...

Step 4. Save the output in a file. For instance:

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' > sitemap_results.txt

Bonus tip(s):

Since sitemap.xml stores the “extra” Jekyll pages that aren’t posts including: /categories/, /tags/, /contact/ and /privacy-policy/, to remove them, being written as the last 4 lines of the file, run the following command instead:
```
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' | ghead -n -4 > sitemap_results.txt
```
If working on macOS, use ghead instead of head as the latter doesn’t support negative line counts. To install ghead, run:
```
brew install coreutils
```

Conclusion

Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.

Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.

Simple CLI command to extract URLs from the sitemap.xml file

Prerequisites

Solution

Conclusion

Let's keep in touch