Jay Harris is Cpt. LoadTest

a .net developers blog on improving user experience of humans and coders
Home | About | Speaking | Contact | Archives | RSS
 
Filed under: Blogging | SEO

You may have heard of Robots.txt. Or, you may have seen requests for /Robots.txt in your web traffic logs, and if the file doesn't exist, a related HTTP 404. But what is this Robot file, and what does it do?

Introduction to Robots.txt

When on a web server, Robots.txt is a file that directs Robots (a.k.a. Spiders or Web Crawlers) on which files and directories to ignore when indexing a site. The file is located on the root directory of the domain, and is typically used to hide areas of a site from search engine indexing, such as to keep a page off of Google's radar (such as my DasBlog login page) or if a page or image is not relevant to the traditional content of a site (maybe a mockup page for a CSS demo contains content about puppies, and you don't want to mislead potential audience). Robots request this file prior to indexing your site, and its absence indicates that the robot is free to index the entire domain. Also, note that each sub-domain uses a unique Robots.txt. When a spider is indexing msdn.microsoft.com, it won't look for the file on www.microsoft.com; MSDN will need its own copy of Robots.txt.

How do I make a Robots.txt?

Robots.txt is a simple text file. You can create it in Notepad, Word, Emacs, DOS Edit, or your favorite text editor. Also, the file belongs in the root of the domain on your web server.

Allow all robots to access everything:

The most basic file will be to authorize all robots to index the entire site. The asterisk [*] for User Agent indicates that the rule applies to all robots, and by leaving the value of Disallow blank rather than including a path, it effectively disallows nothing and allows everything.

# Allow all robots to access everything
User-agent: *
Disallow:

Block all robots from accessing anything:

Conversely, with only one more character, we can invert the entire file and block everything. By setting Disallow to a root slash, every file and directory stemming from the root (in other words, the entire site) will be blocked from robot indexing.

# Block all robots from accessing anything
User-agent: *
Disallow: /

Allow all robots to index everything except scripts, logs, images, and that CSS demo on Puppies:

Disallow is a partial-match string; setting Disallow to "image" would match both /images/ and /imageHtmlTagDemo.html. Disallow can also be included multiple times with different values to disallow a robot from multiple files and directories.

# Block all robots from accessing scripts, logs,
#    images, and that CSS demo on Puppies
User-agent: *
Disallow: /images/
Disallow: /logs/
Disallow: /scripts/
Disallow: /demos/cssDemo/puppies.html

Block all robots from accessing anything, except Google, which is only blocked from images:

Just as a browser has a user agent, so does a robot. For example, "Googlebot/2.1 (http://www.google.com/bot.html)", is one of the user agents for Google's indexer. Like Disallow, the User-agent value in Robots.txt is a partial-match string, so simply setting the value to "Googlebot" is sufficient for a match. Also, the User-agent and Disallow entries cascade, with the most specific User Agent setting is the one that is recognized.

# Block all robots from accessing anything,
#    except Google, which is only blocked from images
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /images/

Shortcomings of Robots.txt

Similar to the Code of the Order of the Brethren, Robots.txt "is more what you'd call 'guidelines' than actual rules." Robots.txt is not a standardized protocol, nor is it a requirement. Only the "honorable" robots such as the Google or Yahoo search spiders adhere to the file's instructions; other less-honorable bots, such as a spam spider searching for email addresses, largely ignore the file.

Also, do not use the file for access control. Robots.txt is just a suggestion for search indexing, and will by no means block requests to a disallowed directory of file. These disallowed URLs are still freely available to anyone on the web. Additionally, the contents of this file can be used to against you, as it the items you place in it may indicate areas of the site that are intended to be secret or private; this information could be used to prioritize candidates for a malicious attack with disallowed pages being the first places to target.

Finally, this file must be located in the root of the domain: www.mydomain.com/robots.txt. If your site is in a sub-folder from the domain, such as www.mydomain.com/~username/, the file must still be on the root of the domain, and you may need to speak with your webmaster to get your modifications added to the file.

Other Resources:

Technorati Tags: ,
Friday, 15 May 2009 09:31:37 (Eastern Daylight Time, UTC-04:00)  #    Comments [1] - Trackback

Tuesday, 06 October 2009 16:55:01 (Eastern Daylight Time, UTC-04:00)
Just as you would validate your HTML and CSS with the W3C Validation Service, it is also a good idea to use one of the many robots.txt validators to make sure that you have got the syntax correct in your robots.txt file.

I strongly agree that robots.txt should not be used as a form of access control. From a web security testing point of view, one of the first things we would check is whether there are any sensitive directories listed in the robots.txt file. Sometimes we find some very interesting things. :)

Cheers,
Stuart.
OpenID
Please login with either your OpenID above, or your details below.
Name
E-mail
(will show your gravatar icon)
Home page

Comment (HTML not allowed)  

[Captcha]Enter the code shown (prevents robots):

Live Comment Preview