How to run a website metadata audit in Google Sheets

Google Sheets has a formula that lets you scrape values from a website: =importxml().

Get a list of your web pages (you can do this by importing your xml sitemap into Google Sheets) and then use the =importxml() formula to scrape values for things like metadata title and description, or on-page text like your Heading 1.

woometadata

Importxml(URL,Xpath)

Importxml has two parameters:

URL: is your web page address

XPATH: is the description of the content you’re requesting. w3schools.com has a pretty useful overview of the syntax.

Google Sheets importxml()

Fetch individual metadata values

Assuming you have the URL in column A, and column headings in row 1, the formulas you would use starting from row 2 are:

H1:

=importxml(A2,"//h1")

Browser Title

=importxml(A2,"//title")

Description

=importxml(A2,"//meta[@name='Description']/@content")

Keywords

=importxml(A2,"//meta[@name='Keywords']/@content)

There seems to be a hidden speed limit; I find that the first few hundred fields will populate really quickly but then it really slows down to just one cell updating every 10-20 seconds. Making a cup of tea or writing a blog post about it can help kill the time.

Fetch multiple values in one request

You can fetch multiple values in a single request. I presume this will help speed up the process but I don’t have it working reliably enough to use it in place of the above method.

Use a pipe separater in the XPATH value to request more than one item.

Fetching the H1 and Title in a single request can be done like this:

=importxml(A2,"//h1 | //title")

The first thing to be aware of is that this will bring each value in as a new row. To convert that to columns instead wrap the formula in =transpose():

=transpose(importxml(A2,"//h1 | //title"))

The problem I find with this method is that it gives the results in the order they appear on the page, not the order you request them.

For example the following formula doesn’t bring back keywords first:

=transpose(importxml(A1,"//meta[@name='Keywords']/@content | //h1 | //title | //meta[@name='Description']/@content"))

For me it returns Title, Description, Keywords, H1 because that’s the order in the HTML I’m scraping.

The bigger problem is that where a field doesn’t exist, its column doesn’t get skipped. Where keywords are blank it doesn’t leave a blank cell, it shifts the H1 into the column that contains keywords for other pages.

You can only really use this then for fields you know will be filled in.

My workflow for scraping metadata values

The value gets cached in Google Sheets for a few hours before it will fetch new content. Before this happens you should copy and paste as values to overwrite the formula with the content it fetched.

My process is:

  1. Get the URL list prepared, perhaps sorted by priority (such as an audit of most viewed pages, or all level 1 & 2 landing pages)
  2. Put the formulas in for the first page, and drag the formulas down to autofill about 50 rows.
  3. Wait for Google Sheets to fetch the values and then select all but the last row (e.g. rows 2-49 if you had filled the formula to row 50)
  4. Copy and Paste special: Paste values only
  5. Drag the formula from your last rown down about 50 rows
  6. Repeat 3-5

My method of filtering Analytics spam

There’s a growing problem of spam traffic in Google Analytics, but recently I’ve been able to eradicate most of ours with one very simple filter.

Filter Spam traffic from Google Analytics

Spammers are using the Analytics Measurement Protocol. Unlike the normal JavaScript implementation of Analytics, this is a server-side method. This means they can push data straight into your Analytics reporting without visiting your website.

Google analytics is a pretty open system. The account ID is all that’s needed to push data to your reports and these IDs are numeric so it’s easy for them to automate account IDs and have their servers push pageview and event data straight into your account.

Fake events recorded in Google Analytics

How to detect and remove spam

When something like a pageview is recorded in Analytics there are also a lot of secondary dimensions recorded that you can use to verify whether it’s spam or not.

Real and fake referrals in Google Analytics
I used the Hostname dimension. It records the domain that the visitor was viewing when triggering the pageview. By adding the hostname secondary dimension to pageview, referral and event reports I realised that this was missing for the fake spam data. In the screenshot you can see genuine referrals with the hostname set to historicengland.org.uk. The hostname for spam traffic was (not set).

When I first tried to filter traffic based on hostname I set up a filter to exclude hostnames matching “” (blank) or “(not set)” but I discovered that a blank hostname is not the same thing as a missing hostname.

I changed this to a filter which I think is working well so far:

Only include traffic where hostname equals .

Google Analytics View Filter

The full stop is a regular expression wildcard for any 1 character.

Another option would be to specify your domain name, but I felt that that approach was too specific. There are genuine reasons for using alternative hostnames, like Google Translate. Also in case the domain for our website did ever change then I didn’t want the Google Analytics set up to be so heavily customised that the rest of the team might change something which stops it working.

There are some other links which I used along the way to understand the issue, but they tended to rely on individually filtering out specific referral sites and I didn’t want to have to keep on top of that!

3 XML Sitemap tricks for Google Sheets

In the last post I described how I eventually managed to import a Google XML sitemap file into Google sheets. In short, this:

=importxml("http://example.com/sitemap.xml","//*[local-name() = 'url']")

I’ve been building on that recently and thought I’d share a few ways I’ve been using it.

  • Filtering the sitemap by path
  • Filtering the sitemap by date
  • Counting pages

Continue reading 3 XML Sitemap tricks for Google Sheets

Import a Google Sitemap XML file into Google Sheets

I wanted to get my sitemap data into Google Sheets for a few content audit and statistics dashboards.

The IMPORTXML() function in Google Sheets sounded pretty self explanatory and the XPATH syntax seemed almost graspable.

=IMPORTXML("https://www.example.com/sitemap.xml", "//url/loc")

But it just wouldn’t sodding work.

Continue reading Import a Google Sitemap XML file into Google Sheets