Last updated on March 29th, 2022

Beginners Guide to How to use Robots txt file?

What is a Robots txt?

A robots txt file is a plain text file that follows the robots exclusion standard protocol to communicate with web crawlers to block or allow access to a specified file path or pages.

Why Robots txt file?

 It provides web admins with an opportunity to disallow crawlers from indexing unimportant pages, 

To stop web crawlers from crawling the sensitive, important information on sites.

To avoid overloading your site with many crawl requests (crawl traffic).

How does a robot’s text file work?

Robots.txt is a normal text file that is placed in the root directory of a site. Every bot will look for the robot’s text file of a site before they start crawling the site. These bots will follow the instructions specified in the robots txt file while crawling the site. Suppose if it is instructed with the following command 

user-agent: *
Disallow: /confidential

Then all the bots will not crawl the directory confidential and will crawl the remaining directories.

Robots txt file consists of two basic parts. The first part is user-agent and the following part is directives.

User-agent:

User-agent line is used to mention the name of the search engine spider to which the instructions mentioned in the directives are applicable and the rest of the search engine spiders can ignore them.

Directives:

These directive lines come immediately after the user-agent and there can be multiple directives for one user-agent. These directives inform the user-agent about the pages, directories, and sub-directories that aren’t allowed to crawl. Some of the popular directives are 

Allow, Disallow, Crawl-delay, Sitemap.

User-agent: Bingbot
Disallow: /

The above directive disallows the user-agent bingbot not to crawl the entire site.

 Allow:

This directive tells search engines spiders that they can crawl a page or subfolder even though their parent subfolder or directory is disallowed. Note: it is only applicable for the user-agent Googlebot.

Disallow:

This directive tells search engines spiders not to crawl the given page, folders, or sub folders.

Crawl-delay:

Time the bots have to wait before loading and crawling the content.

Note: It is not acceptable by Googlebot.

Sitemap:

This directive is independent of user-agent and it is used to specify the location of the XML sitemap to search engine spiders for better crawling of your webpages.

List of search engine user agents. They are:

Useragent   Search EnginePurpose

Googlebot

GoogleGeneral
Googlebot-ImageGoogleImages
Googlebot-MobileGoogleMobile
Googlebot-NewsGoogleNews
Googlebot-VideoGoogleVideo
Mediapartners-GoogleGoogleAdSense
AdsBot-GoogleGooglegoogleads
baiduspider-imageBaiduImages
baiduspider-mobileBaiduMobile
baiduspider-newsBaiduNews
baiduspider-videoBaiduVideo
bingbotBingGeneral
msnbotBingGeneral
msnbot-mediaBingImages & Video
adidxbotBingAds

slurp

Yahoo!General
yandexYandexGeneral

How to check a robot’s txt file?

You can check whether a text file for robot’s is present on a website using one of the methods below.

  1. Visit a root domain like example.com then append robots.txt to it
https://www.example.com/robots.txt

Upon entering the above URL in the search bar you may get two results

  1. If robots txt file already exists then the default written code for most of the WordPress hosted domains is
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
  1. If the site doesn’t contain a robots.txt file then it may return a 404 error HTTP status code. This can be fixed by creating a robots.txt file using a notepad and must be uploaded to the root directory.

How to create and use a Robots txt file?

Create a plain robots.txt file with UTF-8 encoding and save it as a .txt file.

Add instructions on the above create the txt file.

Sample robots.txt file with instructions

# Comment Example 1: Block only Googlebot

User-agent: Googlebot
Disallow: /

# Comment Example 2: Block Googlebot and Adsbot

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

# Comment Example 3: Block all but AdsBot crawlers

User-agent: *
Disallow: /

Upload it root directory

Upload the robot’s txt file at the root of the website host

ex: https://www.example.com/robots.txt

Testing robots.txt file

You can use the robot’s txt tester in the search console to test the robots.txt file.

Some Useful robot’s txt rules:

To disallow crawling of the entire site:

User-agent: *
Disallow: /

To disallow the crawling of a specified directory and its contents.

User-agent: *
Disallow: /calendar/
Disallow: /junk/
Disallow: /books/fiction/contemporary/

Limitations of the robots text file

  1. A disallowed page in the robot’s txt file can still be indexed in SERPs if linked on other sites.
  2. Some crawlers may not obey the instructions specified in robots.txt. So, it is better to safeguard the information with secure password protection or add NoIndex meta tags on pages.
  3. Must use proper syntax to instruct the different web crawlers as some crawlers might not understand and obey them.

Author

Pandith,
Is the creative head of the Web Marketers Guide (WMG) and an Experienced Digital Marketing Analyst. Worked as a marketing professional for various start ups and Immigration Industries.

Reading time: 4 minutes