No matter how old SEO gets, the robots.txt tool remains relevant and will continue to be for a long time. Let’s explore what this tool is and how to use it correctly.
What is robots.txt?
Robots.txt is a file located in the root directory of a website that defines rules for search engine bots on how to scan and index specific sections of the site.
Why do you need robots.txt?
👉 It allows you to permit or restrict search engine bots from scanning certain folders or sections of your website and indexing them in the search engine.
Key commands in robots.txt
- User-agent: Specifies which search engine bot the rules below apply to. If a specific bot name is mentioned, the rules apply only to that bot. If "*" is used, the rules apply to all search bots.
- Disallow: Prohibits the indexing of a specified section or folder of the site.
- Allow: Permits the indexing of a specified section of the site.
- Sitemap: Specifies the path to the sitemap.
- Host: Defines the primary domain of the website.
- Crawl-delay: Sets a delay for loading pages, often used for large websites. Although this command is considered outdated by many sources, it is still sometimes used as a legacy practice.
How to block your website from indexing via robots.txt?
User-agent: *
Disallow: /
This "/" means that everything starting from the root folder is blocked from being indexed.
If you want to block just a specific folder from indexing, you can specify it like this:
User-agent: *
Disallow: /admin
In this case, the entire site will be indexed except for the "admin" folder.
To block a specific pattern of URLs, you can use a wildcard (e.g., "*"):
User-agent: *
Disallow: /*parts_of_url*
This tells bots that any URL containing the specified pattern should not be indexed.
How to allow website indexing via robots.txt?
User-agent: *
Disallow:
The absence of "/" allows the bot to scan everything within the domain.
How to allow only specific folders to be indexed via robots.txt?
User-agent: *
Disallow: /
Allow: /admin
Here, the entire site is blocked from indexing except for the "admin" folder.
The "allow" command is usually used when you have a complex website structure where certain items within restricted folders need to be accessible to bots.
How to create robots.txt?
Robots.txt is typically created manually based on the following considerations:
👉 Analysis of the website root;
👉 Analysis of the website’s URL structure.
Usually, SEO specialists handle this when launching a website. If you don’t create a robots.txt file, your website will still be indexed by default.
How to test robots.txt?
Testing your robots.txt file can be done using the dedicated tool in Google Search Console, which allows you to check the effectiveness of the directives:
It’s important to note that the directives in robots.txt to block indexing are not always 100% followed. To ensure additional protection, you can also inject directives into the HTML code like this:
<meta name="robots" content="noindex, nofollow">
This command is immediately recognized by the bot when the HTML code is loaded and scanned, even if the bot doesn’t reference the robots.txt file.
These are the most important aspects of working with robots.txt. It’s a crucial step in launching a project and performing a technical SEO audit, one that should not be overlooked. We hope this material has been helpful to you. Until next time!