There’s a growing problem of spam traffic in Google Analytics, but recently I’ve been able to eradicate most of ours with one very simple filter.
Google analytics is a pretty open system. The account ID is all that’s needed to push data to your reports and these IDs are numeric so it’s easy for them to automate account IDs and have their servers push pageview and event data straight into your account.
How to detect and remove spam
When something like a pageview is recorded in Analytics there are also a lot of secondary dimensions recorded that you can use to verify whether it’s spam or not.
I used the Hostname dimension. It records the domain that the visitor was viewing when triggering the pageview. By adding the hostname secondary dimension to pageview, referral and event reports I realised that this was missing for the fake spam data. In the screenshot you can see genuine referrals with the hostname set to historicengland.org.uk. The hostname for spam traffic was (not set).
When I first tried to filter traffic based on hostname I set up a filter to exclude hostnames matching “” (blank) or “(not set)” but I discovered that a blank hostname is not the same thing as a missing hostname.
I changed this to a filter which I think is working well so far:
Only include traffic where hostname equals .
The full stop is a regular expression wildcard for any 1 character.
Another option would be to specify your domain name, but I felt that that approach was too specific. There are genuine reasons for using alternative hostnames, like Google Translate. Also in case the domain for our website did ever change then I didn’t want the Google Analytics set up to be so heavily customised that the rest of the team might change something which stops it working.
There are some other links which I used along the way to understand the issue, but they tended to rely on individually filtering out specific referral sites and I didn’t want to have to keep on top of that!