Indexables: Functional specification

Yoast SEO's "Indexables" framework provides an abstraction layer for interacting with post metadata relating to SEO.

A page-centric model of the web

A large part of what our software does is store, manage, and evaluate information relating to pages. Each of these pages has a unique (canonical) URL.

This is how most search engines and systems 'think' about the web. They build a map of all the pages they know about, based on their URLs. We do the same thing. When we have that map, we can easily check, update, and manage information about a given page.

On the surface, this seems like a straightforward concept. But words like 'page' have hidden complexity and nuance - especially in the context of WordPress.

For example, in WordPress, posts stored in the database don't get stored with a URL. Every time the system needs to know the URL of a page, it has to be calculated (based on the user-defined URL structure settings for the site). That's computationally expensive.

But processing overheads aren't the only challenge here - there are also scenarios where it's not clear what we mean by 'page'.

But what's a page?

Beyond what we might concieve to be a conventional 'page' on a website, we might also have archive views (e.g., all posts published by a given author), alternate content formats (e.g., RSS feeds), taxonomies (e.g., tags and categories), error templates (e.g., 404 pages), paginated results, and other esoteric types of content. These are all 'pages', as far as search engines are concerned.

From an SEO perspective, each of these scenarios must be handled differently - each with its own rules and conditions. Even a simple blog post may have dozens of values that we need to consider and evaluate. These range from crawling and indexing controls, to content evaluation scores, keywords, presentation settings, media, and beyond. We must consider all of these fields and the relationships between them, in the process of determining what SEO metadata should be output on the page.

For example, simply determining the appropriate canonical URL of a page requires extensive querying and evaluation.

For larger sites, all of that logic, storage, and processing can impact performance - particularly in WordPress, where the database structure isn't designed or optimized for this kind of requirement.

Furthermore, websites contain many 'pages' which we don't want to evaluate for SEO purposes. Some content types may exist within the system (eg., to be used solely within an admin view), but are never exposed on a public URL. It doesn't make sense for us to store and process information about these, because they're not indexable by search engines.

Knowing what is and isn't an indexable is key to performant metadata management.

What's an indexable?

An indexable is any resource that can (theoretically) be indexed by a search engine, against a given URL. That includes many content types beyond just 'pages' - like categories, author archives, paginated states of date archives, media files, and more.

Examples:

https://www.example.com/example-page/ - A conventional webpage.
https://www.example.com/example-category/page/2/ - A paginated state of a category archive.
https://www.example.com/2018/10/20/ - A date archive.
https://www.example.com/author/laura/ - An author archive.
https://www.example.com/colors/red/ - A custom taxonomy term archive.

NB, we intentionally exclude any non-public pages, as well as pages which return errors.

Yoast SEO's Indexables table(s) in WordPress

Yoast SEO creates and manages indexables in WordPress with a dedicated database table. This stores all of the information we might need from an SEO perspective, about every indexable we know about. That means that when we want to query a given page to determine what the SEO metadata should be, we can do so extremely efficiently.

This process operates silently in the background, and seamlessly syncronises with WordPress' native metadata fields and processes.

The table also automatically populates and updates itself. When we encounter an indexable that we don't know about, we create a new record, so that the data is available on subsequent requests. We also provide a (re)indexing process in our admin tools, which proactively builds our indexables table from the site's database.

With the indexables table in place, we have an 'SEO-centric' view of the website, which is focused on pages (and the metadata which should be output on them).

Indexing

Our indexables table is constructed and maintained via two methods:

Various optimization processes in the Yoast SEO interface will prompt users to undertake an 'indexing' process, as a prerequiste for various tools and controls.
Requests to previously undiscovered indexables will trigger a lazy generation process.

These processes ensure that the indexables table is always a complete and accurate representation of the site.

What types of indexables does Yoast SEO store?

Types of indexables we store include:

All public* posts and taxonomies
The homepage
Author archives (for authors with published, public posts)

We also store several 'patterns' which represent template and content types where it isn't valuable or necessary to include discrete indexables for every possible permutation. These include:

Post type, taxonomy and date archives
Error pages
Internal search results

*We consider a page to be 'public' when the public attribute for the post/taxonomy type is set to true in register_post_type/register_taxonomy.

Use-cases

When we have a robust understanding of all of the public pages on a site, we can use our database to power functionality and tools. For example:

When retrieving metadata for a page's <head>, we can make a single database request for all of the relevant, pre-calculated fields.
When constructing in an XML sitemap, we can instantly determine which indexables should or shouldn't be included.
Other software and systems can easily integrate with, modify, and build on our logic.

Altering indexables behavior

Most users won't ever need to interact directly with the indexables table or logic. However, advanced users may wish to customize the behaviour to fit their needs. To enable this, we provide a range of filters to alter the default behaviour or interact with the table:

You can disable the creation of new indexables.
You can exclude specific post type and taxonomies.
You force a (re)indexing process via WP CLI.

A page-centric model of the web​

But what's a page?​

What's an indexable?​

Yoast SEO's Indexables table(s) in WordPress​

Indexing​

What types of indexables does Yoast SEO store?​

Use-cases​

Altering indexables behavior​