Categories
Spreadsheets

Import a Google Sitemap XML file into Google Sheets

I wanted to get my sitemap data into Google Sheets for a few content audit and statistics dashboards.

The IMPORTXML() function in Google Sheets sounded pretty self explanatory and the XPATH syntax seemed almost graspable.

=IMPORTXML("https://www.example.com/sitemap.xml", "//url/loc")

But it just wouldn’t sodding work.

It turned out that this was because Google XML Sitemaps declare a namespace.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Once a namespace has been declared it’s no good saying you want //url/loc, you have to specifically ask for that data from the applicable namespace. But IMPORTXML() doesn’t support that.

The trick I eventually found was to use LOCAL-NAME with a search filter, which lets you pick out the loc data by searching for it by name regardless of namespace.

This is the working formula:

=IMPORTXML("https://www.example.com/sitemap.xml", "//*[local-name() ='url']/*[local-name() ='loc']")

You can use the same method to get any of the other data in the sitemap.

=IMPORTXML("https://www.example.com/sitemap.xml", "//*[local-name() ='url']/*[local-name() ='lastmod']")

Import more than one column from a sitemap

Update: Since first writing this I’ve found a formula which brings the whole sitemap in a multi-column table:

=importxml("http://example.com/sitemap.xml","//*[local-name() = 'url']")

I’m not sure if I simply hadn’t tried that before or whether Google Sheets has tweaked things recently. Alternatively, the only other option seems to be to bring in the columns one at a time and rely on them lining up.

PATH syntax uses a pipe to separate multiple queries but this creates just a single column. As this is my first use of XPATH I don’t know if this is a flaw of XPATH or Google Sheets, or a super important design feature.

=importxml("http://example.com/sitemap.xml","//*[local-name() ='url']/*[local-name() ='loc'] | //*[local-name() ='url']/*[local-name() ='lastmod']")

Google Sheets showing values merged into one column

I’ve only been able to get around this by placing separate IMPORTXML formulas in adjacent cells (eg. Allowing for headers in row 1,  put ‘loc’ formula in A2 and ‘lastmod’ in B2).

As long as the sitemap file is properly structured the rows should line up.

Google Sheets using multiple formulas to create two columns

Why bother?

It’s really easy to import a Google XML Sitemap into Excel, but i wanted it in Google sheets so i could use this data alongside Google Analytics stats, which are much easier to import into Google Sheets.

I can now:

  • Find unviewed pages by comparing the sitemap contents against Google Analytics pageviews
  • Use the latest page edits in statistics dashboards

11 replies on “Import a Google Sitemap XML file into Google Sheets”

Thanks for sharing really good technique. I have a question, is there a way if we add an option to in the query part of importxml to exclude certain item if it matches the criteria? example

=importxml(importxml(“http://example.com/sitemap.xml”,”//*/chanel | not(//*/chanel/item=”safely”)

meaning if branches under //*/chanel/item consist word “safely” then skip that item.

Wow. Thank you for the information. Btw, as to your problem with adjacent cells, I was able to resolve it using transpose function.

I’ve got what I’m trying to to partially working.., can someone help me?

I’m trying to import the following the same guidelines above, but i’m having and issue.

——————–
Using this:
//*[local-name() =’url’]/*[local-name() =’image:image’]/*[local-name() =’image:loc’]

——————–
To try and import this:

http://evryng.com/gallery
… daily
… 0.75
… 2016-07-18


……
………… http://www.example.com/image.jpg
……
……
………… Gallery
……

What am I doing wrong?

hi there i really appricate it and thankful for your help wondering if you can help because i tried your method but its not working do you have any other suggestions that may help to get all the URLs from my sitemap

Hi Iman, your link is to a ‘sitemap index’ file which is a group of sitemaps rather than an individual sitemap, it lists the URLs of sitemaps rather than the URLs of pages. That happens when you either choose to have different sitemaps for different groups of content, or when you have over 50,000 pages which is Google’s limit per individual sitemap and they have to be broken up.

You can change the formula to load the content of the Sitemap Index file, which will bring back the URLs of the individual sitemaps

=IMPORTXML("https://duo.com/sitemap_index.xml", "//*[local-name() ='sitemap']/*[local-name() ='loc']")

Or point to one of your individual sitemaps:

=IMPORTXML("https://duo.com/page-sitemap.xml", "//*[local-name() ='url']/*[local-name() ='loc']")

Leave a Reply

Your email address will not be published.