Linking suggestions, a look behind the scenes

The internal linking suggestions tool has been a staple of the Yoast SEO Premium plugin since 2017. After all these years it was time for an overhaul. In a dramatic turn of events, we actually took some lessons from the way search engines work. Ultimately, this led to a little internal search engine of our own, which we've released today with Yoast SEO 14.7! This post will take you on a journey behind the scenes.

A search engine?

Yes, indeed! If you look at it, out internal linking suggestion tool works just like a little internal search engine.

First of all, it creates a search query. This search query takes the form of a list of the most prominent words of the post that you are currently writing. Secondly, the tool creates an index, just like google, of the content of your website. It saves the most prominent words of each post, page or term in a separate table in your WordPress database. Last, but not least, the tool uses the search query to search the index for relevant content. This all happens on your own server. In the end, this leads to a list of link suggestions, not unlike a search engine results page!

Do you want to know more? Let us dive in deeper into the inner workings of this little search engine.

A deeper dive

The search query

The first step in the internal linking suggestions algorithm is to summarise the content of your post, page or term. This is more complicated than it may look because text has a natural order to it. If you change the order of two words in a sentence, its meaning change completely.

However, there is an easy solution that has been tried and tested for decades. We just ignore the ordering! This approach is called the bag-of-words approach. Because we do not want the search query to grow very large, we limit the number of words in the query to the 20 most used ones. We call these the prominent words of your post, page or term page.

Adding word form support

Before we create this list of prominent words, however, we apply a linguistic method called stemming. This 'collapses' all the different word forms you use in your content to one canonical form. This makes sure that the content within your posts, pages en terms can be compared, even when you use different word forms. For example, the words 'cats' and 'cat' both collapse to the word form 'cat'. Both words indicate our feline friends, so it makes sense that they would count as one prominent word.

By the way, for many languages like Spanish, English, German, Indonesian and Swedish, we automatically filter out common function words like 'the', 'one' and 'many'. This improves the algorithm even more.

The index

Every search engine needs an index. In our case, this index takes the form of a table in your site's database. This table connects each post, page and term of your website to a list of its prominent words. More specifically: the table connects each indexable to its prominent words. This makes querying the index fast and efficient.

When you save a post, page or term, its prominent words get added to the index. On top of the words themselves, we also save how often they and their word forms occur in that specific piece of content. This way, we have a way of telling just how prominent these words are. Another way to save the prominent words of your website's content is to use the internal linking indexing tool on the Yoast SEO tools page of your website. This will compute and save the prominent words of all your publicly available posts, pages and terms that have not been indexed yet in one go. This is useful when you just installed premium on a site, or if the indexing algorithm has been improved.

The search algorithm

Now that we have a representation of the content on your site, as well as the content of the post, page or term you are currently writing, we need to match them together. What we need is a way to compare the prominent words of one post with the other. If they are similar, we can say that their contents are similar as well. With a reasonable amount of confidence, of course.

We can compare two pieces of content with a mathematical concept called a similarity measure. This is a mathematical function that takes two vectors (more on that later) and outputs a number that indicates how similar these two vectors are. A high number would mean that both are very similar. Before we can use a similarity measure, though, we need to transform the lists of prominent words to these nifty vectors. There is a pretty easy way to do this for bag-words-models.

Transforming a bag-of-words model to a vector

A vector is basically a list of numbers. For a similarity measure to work, both vectors need to have the same number of items. For bag-of-words models, this can be done by labeling each position in a vector with a word that can occur on your website. The number located at that position in a vector would be the word's weight. In this case the weight would be the number of times that word occurs in the content.

For example, the word 'bear' may always be tied to position 5. A value of 0 in a specific vector would mean that it does not occur in the post, page or term tied to that vector. A weight of 3 means that it occurs 3 times.

The similarity measure

There are many similarity measures that are useful for text. We decided to use a commonly used one, the cosine similarity measure. The cosine similarity measure has two big advantages over other measures.

The first advantage is that the value is always bounded between 0 and 1. A value of 0 means that both vectors are not at all similar. Whereas a value of 1 means that both vectors are exactly the same. This means that interpreting a similarity score is relatively easy, even for humans.

The second advantage is that it automatically keeps the length of the two vectors into account. In turn, this means that the length of a text does not influence the resulting score. This can be a problem since you naturally use the prominent words more in a longer text. This may skew the suggestions in a bad way, giving more weight to longer texts.

A secret ingredient

Before we can fit the parts together to create the little search engine, we add another ingredient. We multiply the weight of each prominent word with its inverse document frequency. The document frequency of a word is the number of documents (in this case posts, pages and terms) a word occurs.

Why are we interested in a word's document frequency? Because the more often you use a word, the less useful it is when generating internal linking suggestions. Let us take an example to illustrate this. Let us say that you have, like Yoast, a blog about SEO. The word 'SEO' would occur in almost every post, page or term on your blog. The fact that two posts share the word SEO does not add anything useful at all. In fact, it leads to noise in the algorithm, with almost every piece of content matching with every other piece!

To fix this problem, we multiply the prominent word's weights with the inverse of its document frequency. This is just a fancy way of saying that the more often a word occurs on your site, the less overall weight it gets when generating internal linking suggestions.

Tying it all together

Now that we have all the parts of the search engine, we can tie them together:

  1. You open a post, page or term page.
  2. While writing your awesome content, a list of its most prominent words are calculated.
  3. This list is sent to your website's server.
  4. The server computes a list of the most similar content.
  5. This list gets sent back to your internet browser.
  6. The internal linking suggestion tool shows you a list of other awesome content that you can link too.
  7. Whenever you save the post, page or term page, its prominent words are saved to your website's database.

That is it. Now you know how the internal linking suggestions are generated. With this useful tool, you can add links without needing to rummage through your website for useful content to link too. Happy blogging!

Read more: A much-improved internal linking tool: What’s new? »