The purpose of a robots.txt file is to present guidance to website robots such as Googlebot as to what parts of a website should be crawled and indexed and what parts should be excluded, this standard is known as The Robots Exclusion Protocol (REP), it is not a new concept and has been around since the early 1990's.
The basic idea of a robot.txt file is straight forward. By writing a precise text file we are able to instruct robots what particular areas of a website are off limits to some or all robots. The robots.txt file should be placed in the root directory of a website, this is where robots that adhere to the REP will look for it.
When a major search robot first visits a website it will first look for the robots.txt file and see if it is permitted to crawl the website, it will also look for exclusions that prevent it from crawling certain parts of the site. Comments can be placed in the robots.txt file by using the '#' character. Some examples of typical robots.txt files are listed below:
Example 1. Crawl everything. (This is the same as having no robots.txt file)
# Crawl everything User-agent: * Allow: /
Example 2. All robots excluded from crawling the entire site
# Crawl nothing User-agent: * Disallow: /
Example 3. All robots excluded from crawling a specific directory
# Exclude reports directory from being crawled User-agent: * Disallow: /reports/
Example 4. Exclude certain robots from crawling the entire site
# Exclude Googlebot from crawling User-agent: Googlebot Disallow: /
Example 5. Wildcards
# Block all query strings from being crawled (Google and Bing permit the use of wilcards) User-agent: * Disallow: /*?
Example 6. End of URL matching using $
# Block all php scripts from being crawled User-agent: * Disallow: /*.php$
Example 7. Mixed directives
# Block bing User-Agent: bingbot Disallow: / # Block all php scripts from being crawled by any robot User-agent: * Disallow: /*.php$ # Exclude reports directory from being crawled by any robot User-agent: * Disallow: /reports/
robots.txt file considerations
- Robots can disregard a robots.txt file. Specifically malware robots that search the web for security exposures, and those who are harvesting email addresses for spam purposes.
- The robots.txt file is publicly accessible. This means that anyone can see what areas of a website robots should not access. Information that should not be publicly searched should be protected in a more secure manner by perhaps using password protection or only allowing access to certain directories from a certain IP address or certain IP range. However it is advisable to never host information on a webserver that should not be allowed in the public domain.
- The robots.txt file name should be in lower case as it is case sensitive, do not use upper case text.
- If a page has already been indexed and is then added to the robots.txt file it may not get dropped from the index. To ensure it gets removed add the following to the page <meta name="robots" content="noindex,nofollow"> this will ensure the page gets dropped.
- Each subdomain of a root domain needs to apply an individual robots.txt file.
- During website development, developers often exclude all search engines from crawling the site via a robots.txt file. These exclusions should be removed before a website is launched.
During each website audit Seopler checks to make sure that a robots.txt file exists and that it is not inadvertently blocking any of the major search engine robots. Sign up to one of our free or paid plans to test your robots.txt file.