Although many SEOs are familiar with the concepts of optimizing for crawl budget and blocking bots from indexing pages, the details are often overlooked. Best practices have significantly changed in recent years, so it is important to stay up-to-date.

A small modification to your robots.txt file or robots tags can have a significant impact on your website.

In this article, we will explain how to use robots meta tags and the x-robots-tag to control the indexation of your website’s content. This will help you ensure that your content is being indexed correctly and improve your website’s SEO.

Robots Meta Tags

The tag can be used to control how search engine crawlers index and follow the links on a webpage. This tag is placed in the section of HTML code.

This is what a robots meta tag looks like in the source code of a page:

<meta name="robots" content="noindex" />

The tags allow you to control how search engines handle your page, including whether or not to include it in the index.

Robots Meta Tags Usage

Robots meta tags are used to control how Google indexes your web page’s content. This includes:

  • Whether or not to include a page in search results
  • Whether or not to follow the links on a page (even if it is blocked from being indexed)
  • Requests not to index the images on a page
  • Requests not to show cached results of the web page on the SERPs
  • Requests not to show a snippet (meta description) for the page on the SERPs

The robots meta tag controls how search engine crawlers index and follow the links on your website. You can use the robots meta tag to tell crawlers what to do with your website’s pages and content. There are different attributes and directives that you can use to control how the robots meta tag works. We’ll share some code examples that you can use to request that the search engines index your pages in a certain way.

The “meta name=”robots”” tag placed in the HTML of each URL tells search engine crawlers if they are allowed to index the content of the page, and whether they are allowed to follow any links on the page. This allows search engines to understand what content on the page is important, and helps them to determine which links on the page are worth crawling, in order to pass along link equity.

The “robots” meta tag applies to all website crawlers, however you can also specify a certain user agent. For example, “googlebot”. It is rare to need to use multiple meta robots tags to set instructions for specific spiders.

There are two important considerations when using meta robots tags:

  • Similar to robots.txt, the meta tags are directives, not mandates, so may be ignored by some bots.
  • The robots nofollow directive only applies to links on that page. It’s possible a crawler may follow the link from another page or website without a nofollow. So the bot may still arrive at and index your undesired page.

Here’s the list of all meta robots tag directives:

  • index: Tells search engines to show this page in search results. This is the default state if no directive is specified.
  • noindex: Tells search engines not to show this page in search results.
  • follow: Tells search engines to follow all the links on this page and pass equity, even if the page isn’t indexed. This is the default state if no directive is specified.
  • nofollow: Tells search engines not to follow any link on this page or pass equity.
  • all: Equivalent to “index, follow”.
  • none: Equivalent to “noindex, nofollow”.
  • noimageindex: Tells search engines not to index any images on this page.
  • noarchive: Tells search engines not to show a cached link to this page in search results.
  • nocache: Same as noarchive, but only used by Internet Explorer and Firefox.
  • nosnippet: Tells search engines not to show a meta description or video preview for this page in search results.
  • notranslate: Tells search engine not to offer translation of this page in search results.
  • unavailable_after: Tell search engines to no longer index this page after a specified date.
  • noodp: Now deprecated, it once prevented search engines from using the page description from DMOZ in search results.
  • noydir: Now deprecated, it once prevented Yahoo from using the page description from the Yahoo directory in search results.
  • noyaca: Prevents Yandex from using the page description from the Yandex directory in search results.

Because search engines have different capabilities, not all of them can use all robots meta tags, as explained by Yoast. Furthermore, some search engines are not transparent about what they do and don’t support in terms of these tags.

X?Robots-Tag

An alternative way to control how search engines crawl and index your webpages is to use the x-robots-tag instead of the standard meta robots tags.

While it is relatively easy to add meta robots tags to HTML pages, it is more complicated to do so for x-robots-tag. If you want to be able to control how non-HTML content, such as PDFs, are handled, you will need to use x-robots-tag.

This is an HTTP header response, which can contain directives that can be used as a meta robots tag. These directives can also be used as an x-robots-tag.

Here’s an example of what an x-robots-tag header response looks like:

x-robots-tag: noindex, nofollow

To use the x-robots-tag, you will need to be able to access your website’s header .php, .htaccess, or server configuration file. If you do not have access to this, you can use meta robots tags to instruct crawlers.

When to use the X?Robots-tag

The x-robots-tag is not as easy to use as the meta robots tags, but it allows you to tell search engines how to index and crawl other file types.

Use the x-robots-tag when:

  • You need to control how search engines crawl and index non-HTML file types
  • You need to serve directives at global level (sitewide) rather than at page-level

How to Set up Robots Meta Tags and X?Robots-Tag

The following text is discussing two methods of controlling how search engines index your site – robots meta tags and x-robots-tag. It states that, generally, setting up robots meta tags is easier than x-robots-tag, but the implementation can differ depending on your CMS and/or server type.

Here’s how yo use meta robots tags and the x-robots-tag on common setups:

Using Robots Meta Tags in HTML Code

Add your robots meta tags to the section of your page’s HTML code.

If you want search engines not to index the page but want links to be followed, as an example, use:

<meta name="robots" content="noindex, follow" />

Using Robots Meta Tags on WordPress

Open up the ‘advanced’ tab in the block below the page editor if you’re using Yoast SEO.

To prevent a page from being indexed by search engines, you can either set the “noindex” directive or the “Allow search engines to show this page in search results?” dropdown to no. To prevent links from being followed, you can set the “Should search engines follow links on this page?” to no.

Using Robots Meta Tags on Shopify

You can add robots meta tags to your Shopify site by editing the section of your theme.liquid layout file.

To set the directives for a specific page, add the below code to this file:

{% if handle contains 'page-name' %}
<meta name="robots" content="noindex, follow">
{% endif %}

This code indicates to search engines that they should not index the specific page, but follow all of the links on that page.

You will need to set the directives on each page separately.

Using X-Robots-Tag on an Apache Server

To use the x-robots-tag on an Apache web server, you need to add the following to your site’s .htaccess file or httpd.config file.

<Files ~ ".pdf$">
Header set X-Robots-Tag "noindex, follow"
</Files>

Above we see an example of how to set the file type to be .pdf and to tell search engines not to index it, but to instead follow any links on the page.

Using X-Robots-Tag on an Nginx Server

If you’re running an Nginx server, add the below to your site’s .conf file:

location ~* .pdf$ {
add_header X-Robots-Tag "noindex, follow";
}

This means that the search engine will not index the PDF file and will follow any links on it.

Robots directives and SEO

The robots.txt file is used to prevent bots from accessing a website before the page is requested. It is the first gatekeeper of your website.

There are two types of robots tags- those that focus on controlling indexing and those that focus on passing link equity. Robots meta tags are only effective after the page has loaded, while X-Robots-Tag headers offer more granular control and are effective after the server responds to a page request.

Blocking Bots to Save Server Bandwidth

You will see many user-agents in your log files that are taking up bandwidth but not providing much value in return.

  • SEO crawlers, such as MJ12bot (from Majestic) or Ahrefsbot (from Ahrefs).
  • Tools that save digital content offline, such as Webcopier or Teleport.
  • Search engines that are not relevant in your market, such as Baiduspider or Yandex.

A suboptimal solution would be to block these spiders with robots.txt. This is not guaranteed to be effective, and it is a fairly public declaration that could give competitors insight into what you are doing.

One way to block unwanted spider requests is by editing your .htaccess file to redirect them to a 403-Forbidden page.

Internal Site Search Pages Using Crawl Budget

If a website’s internal search results are generated on static URLs, this can take up a lot of the website’s “crawl budget” and may cause problems like thin or duplicate content.

One sub-optimal solution to preventing crawler traps is to disallow the directory with robots.txt. However, this limits your ability to rank for key customer searches and for such pages to pass link equity.

The best way to handle this is to map high-volume, relevant queries to existing URLs that are already friendly to search engines. For example, if someone searches for “samsung phone,” instead of creating a new page for /search/samsung-phone, you would redirect them to the /phones/samsung page.

You can tell Google not to crawl a URL by creating a parameter-based URL in Google Search Console.

If you allow crawling, you should check if the pages are high quality and likely to rank well. If they’re not, you can add a “noindex, follow” directive to prevent them from being indexed while you work on improving the quality.

Blocking Parameters with Robots

The use of query string parameters, such as those generated by faceted navigation or tracking, can lead to decreased crawl budget, duplicate content URLs, and splitting of ranking signals.

The best solution is to disallow crawling of parameters with robots.txt or with a “noindex” robots meta tag. This will prevent the flow of link equity and make your site more efficient.

It’s generally best practice to ensure that each parameter has a clear purpose and to implement ordering rules that only use keys once and prevent empty values. Adding a rel=canonical link attribute to relevant parameter pages can help combine ranking ability. You can also configure all parameters in Google Search Console, which provides more granular options for communicating crawling preferences. For more information, see the Search Engine Journal’s parameter handling guide.

Blocking Admin or Account Areas

The search engine should not be able to access any private content.

A sub-optimal solution to keeping private pages out of the SERPs would be to use robots.txt to block the directory. However, this is not guaranteed to work.

To prevent crawlers from accessing your pages, use password protection or the “noindex” directive in the HTTP header.

Blocking Marketing Landing Pages and Thank You Pages

Issue: You may need to exclude certain URLs from your organic search results if you don’t want people to stumble upon them accidentally. For example, you wouldn’t want people to find your thank you pages through a search engine if they haven’t converted yet. Similarly, you may want to exclude dedicated email or CPC campaign landing pages from your organic search results.

A sub-optimal solution would be to disallow the files with robots.txt, as this wouldn’t prevent the link from being included in the search results.

Best practice approach: Use a “noindex” meta tag.

Manage On-Site Duplicate Content

Some websites need a printer friendly version of a page, but they want to make sure that the original page is the one that is recognized by search engines. Other websites have duplicate content because of poor development practices, such as having the same item for sale on multiple category URLs.

A sub-optimal solution to preventing duplicate pages from passing along ranking signals is to disallow them in the robots.txt file. Noindexing the duplicate pages will eventually lead to Google treating the links as “nofollow” as well, which will prevent the duplicate pages from passing along any link equity.

The best way to deal with duplicate content is to remove the source and redirect it to the search engine friendly URL. If there is a reason for the duplicate content to exist, then add a rel=canonical link attribute to consolidate ranking signals.

Thin Content of Accessible Account Related Pages

There is an issue with pages that are content light but necessary for users, such as login, register, shopping cart, and checkout pages. These pages don’t offer much value to search engines.

A sub-optimal solution would be to disallow the files with robots.txt, as this wouldn’t prevent the link from being included in the search results.

The best way to deal with these pages on most websites is to keep them to a minimum, and you’re unlikely to see any noticeable impact on your KPIs by implementing robots handling. If you feel the need to do something, it’s usually best to use a “noindex” directive, unless there are people actually searching for these pages.

Conclusion

You can prevent technical SEO mistakes by taking the time to understand the different directives.

Having control over how your pages are crawled and indexed can help to keep unwanted pages out of the search engine results pages (SERPs), prevent search engines from following unnecessary links, and give you control over how your site’s snippets are displayed, among other things.

and all of your content is being read by the search engines. You can improve your site’s communication with search engines by configuring your robots meta tags and x-robots-tags. This helps ensure that the search engines can properly read and index your content.

About the Author Brian Richards

See Brian's Amazon Author Central profile at https://amazon.com/author/brianrichards

Connect With Me

Share your thoughts

Your email address will not be published. Required fields are marked

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}