In short, a Robots.txt file controls how search engines access your website.
This text file contains “directives” which dictate to search engines which pages are to “Allow” and “Disallow” search engine access.
Screenshot of our Robots.txt file
Adding the wrong directives here can negatively impact your rankings as it can hinder search engines from crawling pages (or your entire) website.
Robots are applications that “crawl” through websites, documenting (i.e. “indexing”) the information they cover.
In regards to the Robots.txt file, these robots are referred to as User-agents.
You may also hear them called:
These are not the official User-agent names of search engines crawlers. In other words, you would not “Disallow” a “Crawler”, you would need to get the official name of the search engine (the Google crawler is called “Googlebot”).
You can find a full list of web robots here.
Your Robots.txt file is a means to speak directly to search engine bots, giving them clear directives about which parts of your site you want crawled (or not crawled).
You need to understand the “syntax” in which to create you Robots.txt file.
State the name of the robot you are referring to (i.e. Google, Yahoo, etc). Again, you will want to refer to the full list of user-agents for help.
If you want to block access to pages or a section of your website, state the URL path here.
If you want to unblock a URL path within a blocked parent directly, enter that URL subdirectory path here.
In short, you can use robots.txt to tell these crawlers, “Index these pages but don’t index these other ones.”
It may seem counter intuitive to “block” pages from search engines. There’s a number of reasons and instances to do so:
Directories are a good example.
You’d probably want to hide those that may contain sensitive data like:
Google has stated numerous times that it’s important to keep your website “pruned” from low quality pages. Having a lot of garbage on your site can drag down performance.
Check out our content audit for more details.
You may want to exclude any pages that contain duplicate content. For example, if you offer “print versions” of some pages, you wouldn’t want Google to index duplicate versions as duplicate content could hurt your rankings.
However, keep in mind that people can still visit and link to these pages, so if the information is the type you don’t want others to see, you’ll need to use password protection to keep it private.
It’s because there are probably some pages that contain sensitive information you don’t want to show on a SERP.
Robots.txt is actually fairly simple to use.
You literally tell robots which pages to “Allow” (which means they’ll index them) and which ones to “Disallow” (which they’ll ignore).
You’ll use the latter only once to list the pages you don’t want spiders to crawl. The “Allow” command is only used when you want a page to be crawled, but its parent page is “Disallowed.”
Here’s what the robot.txt for my website looks like:
The initial user-agent command tells all web robots (i.e. *) – not just ones for specific search engines – that these instructions apply to them.
First, you will need to write your directives into a text file.
Next, upload the text file to your site’s top-level directory – this need to be added via Cpanel.
Your live file will always come right after the “.com/” in your URL. Ours, for example, is located at https://webris.org/robot.txt.
If it were located at www.webris.com/blog/robot.txt, the crawlers wouldn’t even bother looking for it and none of its commands would be followed.
If you have subdomains, make sure they have their own robots.txt files as well. For example, our training.webris.org subdomain has it’s own set of directives – this is incredibly important to check when running SEO audits.
Google offers a free robots.txt tester tool that you can use to check.
It is located in Google Search Console under Crawl > Robots.txt Tester.
Now that you understand this important element of SEO, check your own site to ensure search engines are indexing the pages you want and ignoring those you wish to keep out of SERPs.
Going forward, you can continue using robot.txt to inform search engines how they are to crawl your site.