Blog
Bitesize: Google API Documentation Leak
12th Jun, 2024
Steve Job
A couple of weeks ago, there was a massive Google API leak.
It wasn’t ever meant to be seen by the public, and has been verified by multiple sources as a legitimate Google document.
Naturally, a lot of SEOs have been talking about it because it contained around 8,000 elements relating to Google’s search engine, as well as thousands of other things relating to other areas of interest for Google.
This video from Rand Fishkin of Sparktoro provides a digestible breakdown of this leak and some of the main points, but a lot of it is in ‘SEO speak’ which needs translating.
This is what this Bitesize blog is for, as well as to offer some slightly different opinions on this API leak to Rand.
Who is Rand Fishkin?
Rand Fishkin has authority to speak on this because he’s the founder of Moz, an SEO tool that’s been around for decades.
He left Moz and moved to Sparktoro, moving away from SEO but remaining very vocal about Google and their monopoly on search.
Some of his takes are great, some I don’t agree with as much, but we all have very different takes on things!
An important caveat
There have been some wild assumptions from the SEO community about this leak and what it means for websites.
One of the biggest assumptions that has been made is that these are ranking factors.
They’re not. That we know of.
There is a lot of vital context missing from this document and it’s important to remember that.
What did Google say?
‘We would caution against making inaccurate assumptions about search based on out-of-context, outdated, or incomplete information’.Google</strong
Rand’s thoughts on this statement are that the document is not out of date, it’s new, it’s not out of context, and we have complete information within it.
I would caution against this level of confidence, because we don’t know that it is up to date, There was a brand new algorithm update rolled out on Monday 10th June, and there’s a new version 5 of the API that’s missing a lot of the stuff that’s been leaked, so I wouldn’t say it’s up to date.
There’s a huge lack of context, we might have the name of an attribute that they’re tracking, but we don’t know exactly what that attribute is or what makes up that attribute.
Navboost and Clickstream
The leak highlighted that Navboost, a Google ranking factor we discovered during the DOJ trial, comes from Google Chrome.
In the video, Rand references an internal email between Googlers that was brought up in the Department of Justice trial last year about Navboost. This email was from the early 2000s, but came to light just a few years ago. I question how accurate to SEO in 2024 an email from around 20 years ago can be.
He mentions Chrome Clickstream and Navboost in his video.
Chrome Clickstream is information collected about a user while we browse a website or use a web browser. Search engines use this Clickstream data set to show where users have searched for a term, and click, and then go back to the Google search. All of the clicks we make, make up the Clickstream.
Clickstream tracks things like:
- if you’re a unique or repeat visitor to a website,
- the terms that you type in to the search engines,
- the page you land on first,
- the amount of time you spend on that page,
- the features on that page that you engage with,
- where and when an item is added or removed from a cart,
- where you go next,
- if the back button is used.
All of that makes up Chrome Clickstream and is what Google is using, according to this leaked document.
Question from the team: Does this mean if a user is visiting a website via Firefox or safari it won’t impact rankings nearly as much as if they were using Chrome?
If you’re logged in as a Google user, it’s going to track everything even though you’re not on Chrome because you’re still using Google.
But at the same time Firefox and a bunch of other browsers have a lot of their own internal ‘non tracking’ things set up now.
Mozilla blocks cookies by default and so do many other browsers now. But you will still be impacting rankings.
Navboost re-ranks webpages based on click logs of user behaviour. It’s part of the Chrome data set and part of the clickstream. It used to be an algorithm a long time ago but now it’s just part of the algorithm.
We only know about it because it came out during the Department of Justice antitrust trial over whether Google’s search engine constitutes an illegal monopoly.
Navboost remembers clicks over roughly the past year and splits them up into desktop, mobile etc. It was a banking signal, probably still is, just not technically called Navboost anymore as it’s just part of the algorithm but it makes it easier to explain it as its own separate entity.
Google quality rater feedback and whitelists
The quality rater feedback/guidelines is something I’d always recommend anyone working in SEO or content should read. They’re not a fun read, it’s a long document, but it provides good criteria.
Although Rand says he’s surprised that they’re a ranking factor, I’m not. It makes a lot of sense. AI still needs to be taught lots of stuff as do all learning algorithms, so people still need to let it know what they think of it.
With regards to the whitelists being used by Google to suppress malicious sources of misinformation, that’s not at all surprising.
Google has said for a long time that expertise, authority and trust are really important.
White Papers are one of the best markers of this because nine times out of ten they’re research papers. Google has always said they don’t manipulate the search engine results pages unless there is a lot of misinformation, or there are amber alerts in the news that people need to see, so they can make sure that in times of crisis the relevant information is at the top. This makes a lot of sense.
You might be aware of the “medic” update in 2018, known as the ‘your money or your life’ update which mostly hit medical sites that weren’t a true authority and were spreading medical misinformation or claims not backed up by research.
Google started looking at authority, expertise and trust as very important ranking factors, so we’ve known this was extremely important to SEO for years.
Toxic backlinks
Rand says that the leak shows toxic backlinks exist and can harm your website’s rankings.
He sees this as a validating point for many in the SEO community. I think Rand has made a fairly large assumption here.
Yes, there is a score called ‘BadBackLinks_ penalised’ that is an attribute that Google is tracking. The leaked document says ‘whether this doc is penalized by BadBackLinks, in which case we should not use improvanchor score in mustang ascorer’.
Mustang and Ascorer are primary algorithms to do with ranking pages. Yes it says it is penalised by bad backlinks but bad backlinks is a doc. Whether or not that is still part of the rankings, even in the main doc that came out, sources have suggested that everything has been de-coupled, so yes, they’re tracking everything from a link perspective but he believes it’s been decoupled from a lot of stuff purely because the disavow file for example doesn’t appear anywhere in this API leak and the disavow file is what you’re supposed to do if you know you’ve been buying bad backlinks and you’re penalised for them.
The assumption has been made that the disavow file has been used to train Google’s spam algorithm so it knows what is and isn’t a good backlink. They do call them bad backlinks, but we don’t know whether there is a ranking penalty for them.
The leak mentions penguin penalty, penguin doesn’t technically exist anymore (its not a stand alone algoritm), it’s part of the greater algorithm and this is one of the problems.
Google’s statement for backlinks is, if you know you’ve been buying dodgy backlinks and if you’ve been penalised, upload those URLs to a disavow file and upload that to the search console.
If you have been buying dodgy backlinks and you haven’t been penalised, don’t worry.
Realistically, the more well known your website, the longer you’ll spend updating that disavow file purely because other people think that by linking to valid sites like yours it makes their site look more legitimate, which is a problem.
Imagine you work for Amazon, your entire job would just be uploading a disavow file every single day because of the amount of sites that are linking to you – it’s not a realistic or good use of time.
So, Google tracks and keeps score of these backlinks but then uses that data to train their own spambot so you don’t have to spend your entire life updating spam files manually.
Limiting the number of sites of a specific type in search results
An interesting point is that Google may be limiting the number of sites of a given type that can appear in the search results.
We know they deduplicated the SERPs a while ago, a few years ago you’d search for a product and the top ten results would all be Amazon with all the same content, all the same images, and the only difference is that they were technically being sold by different people.
That was unhelpful, so Google started to only return two or three Amazon results, alongside other types of content.
Now it turns out what they’re doing is completely limiting the amount of types of sites.
For example, they might say that for a particular search, only ten blogs can appear in the results. This is an interesting take away from the leak, but again a lot of this is an assumption and we don’t know accurately how many blogs, or other types of sites will show up.
Mentions of ‘entities’ being as important as backlinks?
Rand talks about the leak highlighting that mentions of ‘entities’ online may be as important as backlinks, but that there needs to be more testing from SEOs here.
This goes back to one of the first BrightonSEOs I ever attended, a long time ago. Greg Gifford did a talk called ‘The Internet is Entities’, what he means by that is everything that exists on the internet is an entity, and that entity relates to another entity and Google is constantly trying to figure out the relationships between everything.
For example, ‘Rand Fishkin’ and ‘Sparktoro’ are two entities and Google has figured out the relationship between them. The same as if you type in ‘Steve Jobs’ you’re probably going to see a lot of Apple related pages come up in search.
Google has figured out the relationship between those entities.
This is essentially where E-E-A-T comes from. Google is figuring out who you are, what you’re an authority in, do they trust you? Are other people mentioning you and your content?
They may not link directly to you, but they might mention you by name in their content and because Google now reads all of the content on every webpage it visits, it’s starting to form relationships between more entities and remember those relationships for future searches.
This is not actually that new, the internet was built on entities.
Site wide, not just page wide, implications of page titles
The most interesting thing that many SEOs may not have been aware of is the idea that page titles have site wide implications.
There’s a title match score for the site, not just for the page. Rand says that a lot of the things that SEOs thought were page specific, are actually a site wide signal for Google.
We all thought page titles were page related, unsurprisingly!
We thought they had a page wide impact on SERPs, nothing else. Potentially they impact the entire site.
Does this change what we do as SEOs? No. it just means that we now know the page titles we use have a much broader effect than we previously thought.
Question from the team: We now know that page titles have a site wide impact, do we know what this impact is? Is it just added context to the wider site and linking search entities or does it impact something entirely different?
We don’t know, the only reason we now know it’s a site wide score is because in the attribute it literally says ‘site’.
That’s all we know, we know it affects the site, we just don’t know how.
I’m making an educated guess at the moment that it’s from an entity point of view, using it to form relationships between company names and keywords and things like that.
Do paid-for clicks improve organic rankings?
In his video, Rand discussed whether paid ad clicks can boost a site’s organic rankings if the Chrome clickstream is used for rankings. He says that over the years, marketers have noticed an effect where they invest more budget into PPC and see an SEO boost.
Rand questions whether it could be possible that Google is discounting the specific paid search clicks, but not discounting when people click those links and then share them with friends, or open them on other devices, or bookmark them, or email them to themselves for later. These might have an impact, so paying for clicks could actually boost your organic rankings.
This is similar to the distinction between ‘no set’ and ‘direct’ traffic in GA4. If GA4 can’t figure out where you came from then you are ‘direct’ or ‘not set’ and if you are sharing PPC links via WhatsApp, Facebook messenger etc., they don’t technically exist as far as Google is aware.
A Facebook messenger click will not be attributed to Facebook because it’s not part of the tool, so that will go under ‘direct’ traffic until Google eventually figures out where it came from.
So, I can see how these ‘direct’ clicks might theoretically impact rankings, but again we don’t know for sure.
Site authority as a score
According to the leaked API, site authority is a score.
Google has said that Domain Authority is not a thing many, many times.
Lots of people called Google liars when this document was leaked, because of this. This document has a score in it called Site Authority so everyone was up in arms, saying ‘you said Domain Authority is not a thing but you track Site Authority’.
I wouldn’t necessarily say Google have lied to us, I’d say that:
A) They’ve said Domain Authority is not a ranking factor, so they might be tracking it but not using it as a ranking factor.
And B) we don’t know what site authority is made up of, it is definitely a thing but it might not be the same thing entirely.
Googlers may also have been obfuscating the truth here slightly because they don’t want us to know all of their secret recipes to ensure that people don’t try to game the system.
It’s the same as KFC not telling us all of the herbs and spices in their chicken, same as Coca Cola not telling you exactly what their ingredients are and in what quantities because it’s their secret sauce.
Question from the team: Why would we not automatically assume that Google looks at Domain Authority? It sounds like a sensible thing to do?
I think it’s because they’ve obfuscated what they do to some degree. Their argument is that they don’t track Domain Authority, they track Site Authority, so their terminology is different. That could be their entire argument in a nutshell – just semantics.
They definitely track it and store it and look at it. Whether or not it’s used as a ranking factor, we don’t know.
The reason we say we don’t know is because you would assume a high site authority page would automatically always be at the top of search, but that’s not always the case.
It’s difficult because we can’t see what makes up Google’s Site Authority, but I have seen sites that have next to no Domain Authority from Majestic, Moz and others, that outrank sites that have exceptionally high scores from those sites.
But, Google could be tracking Site Authority very differently to Domain Authority, we just don’t know at this point.
What are we actually changing?
What we really need to do as an industry is go back to what SEO should be, which is a lot of testing. You have a theory, you test it.
Apart from that, we don’t really change anything based on this leak. It’s still about writing good content, building the entities and building relationships. It’s still all about focusing on expertise, authority, and trust.
With link building and digital PR, the only thing that I’d probably change is that if you’re a really, really, really small site it’s not worth it, because it’s going to become expensive.
Chrome clickstream can devalue links from low traffic sites, so basically go big and aim for really big sites which is something we’ve always said. Get relevant site links to your website.
Page title matching is still a thing, it just has a much bigger site-wide effect now. So still pay attention to this.
The only other thing to mention is zero click search.
This is mentioned in the documents and is potentially not something you want to target with Clickstream data because zero click searches don’t benefit you if no one is clicking on them, because then Google is thinking ‘oh, people are searching but not clicking so they’re not finding what they need from you’.
Realistically, that is where the no right answer, or NORA keyword intent keywords are coming from. Google AI searches are usually appearing for no right answer searches so you’re probably not going to get those clicks from a user’s search anyway.
This is backed up by a blog on Ahrefs.com which says ‘No Right Answer keywords may become more important with the rise of AI search engines. Some believe this kind of keyword is where AI-generated responses will take most of the clicks from organic results’.