What is the Robot.txt file | Francesco Baldini

The Robots.txt file is one of the most important but often underestimated files a website should have.

Robots.txt can:

tell search engines which files of your site they can index
make some folders or type of file not visible to Google
help search engines with the crawling of your website
save resources of your web server
tell where your sitemap is
much more

This file tells the bots of a search engine, and other software, what they should or shouldn’t analyse.
A misuse of this file can easily bring a website to almost totally disappear from the Google index. It happens more frequently than you think, especially when a development version of the site goes live. In fact, developers and webmasters can forget to remove the directive which inhibits the website crawling.

A quick look at Robots.txt

Robots.txt is a simple text file, placed in the top-level directory of a website. Its goal is to suggest search engines and other spiders what files, pages or directories they can or can’t crawl.

The reason for a file not be crawled can be multiple:

- low-quality content
- competitive reasons
- not relevant pages

thank you pages
dynamically generated pages
admin pages
shopping cart
….

You need to consider that the robots.txt file can be read by anyone. In fact, it can be reached on /robot.txt on any domain and subdomain.
This means that you should NOT include sensitive pages in this file.

Most of the time the content of the robots.txt file is respected but, since they’re “directives”, suggestions, they can use pages included in this list even if you ask them not to.
Google, the main traffic driver of my readers, usually honours this file.

How you can use Robots.txt file

The main use of Robots.txt file is to “allow” or “disallow” URLs, such as pages, folders, certain types of file, to be crawled from search engines. Google, Bing, Yahoo and Yandex keep these directives into consideration, with some exception.

The file uses three main directives:
User-agent:
It tells which bot the following instructions are referred to.

Disallow:
Specifies the paths the bot should not be able to access to. If the path is not specified, this instruction is ignored.

Allow:
Specifies the paths the bot can have access to. If the path is not specified, this instruction is ignored.

This is how you should implement the robots.txt
User-agent: [bot name, * represent all the bots] Disallow: [file(s), file types, folder(s) you don't want to be crawled] Allow: [file(s), file types, folder(s) you want to be crawled]

Within the Google Search Console, you have the chance to verify if everything is fine or if the file blocks some important resources.

Disallow any bot to crawl your site
User-agent: * Disallow: /

Disallow one of the Google bots to crawl your cart page
User-agent: Googlebot Disallow: /cart

Indexing vs Crawling

Indexing and crawling have different meaning and it’s important to know the difference between the two processes.

Indexing is the process to include an URL to the search engine index, the page can be found with a specific query.
Crawling is the process of crawling a file (web page, image, document, etc), analysing it and, eventually, include the file to the search engine index.

This means that, if you don’t want to make an URL available in the search results, you should avoid the file to be crawled. If your file has been crawled once, you can still hide it from appearing in the search engine results page (with some other methods I’ll talk about in a future post).

Not crawling a file doesn’t automatically mean that the URL is not indexed. In fact, you can still see your file in the search results because of, for example, some backlinks it receives.

If the crawler finds an internal or external link to a file you labelled as “disallow”, the bot will analyse the file. But since it’s forbidden from the crawler analysis, it’ll not insert all the information (meta title, meta description, and other data) it usually includes.

This kind of indexing happens once the bot finds the file before it knows that this is a blocker resource. Once the bot finds this URL in the robots.txt, it’ll forbid the regular crawling, although the file is still technically indexed.
The URL can be reached via the site: query, without all the information a regular indexed page includes.

Sitemap in Robots.txt

To make sure Google and other search engines can find your sitemap, you can add it to the Robots.txt with the following directive:

Sitemap: https://www.domain.com/sitemap.xml