Robots.txt

From Hackepedia
Jump to navigationJump to search

If you're wonder how search engines get your website's information, it's through automated tools, or robots, called spiders. Once a spider discovers a new domain name, like onedaywebsite.ca, it will ask the domain for a unique file called robots.txt. The robots.txt file has several purposes, most notable to tell the robot what to exlcude. This standard is known as the Robots Exclusion Protocol or robots.txt protocol. You should take a second right now try https://www.google.com/robots.txt as well as add /robots.txt to the URL to the root of your website to see if it exists. (don't try a subdirectory like example.com/example/subdirectory/robots.txt)

If you don't yet have a robots.txt file, you can create one and just have the following two lines:

User-agent: *
Disallow: 

This tells all robots visiting your website that they are welcome to visit all files on your website. If you don't want any robots visiting your website to index you on their search engines, simply add a / after Disallow:

User-agent: *
Disallow: /

Robots all have a User-agent, and you can change the rules for each.

User-Agent: googlebot
Disallow: /private/

tells googlebot to ignore your /private/ directory

That's the basics. A few other nonstandard extensions: Allow, which makes an exemption from a Disallow extension. For example, to allow the robot access to a specific file in a disallowed extenstion:

Allow: /private/onepublicfile.pdf
Disallow: /private/

There's the Sitemap directive:

Sitemap: www.onedaywebsite.ca/sitemap.xml

Host allows websites with multiple mirror websites to specify their preferred domain:

Host: newsite.example.com


Now you can take another look at https://www.google.com/robots.txt and understand what's in this file. Take the time to build your robots.txt, so you don't have any surprises on which of your files are now indexed on the popular search engines.

It's worth noting that there's no technical requirement to follow robots.txt, it's just the commonly accepted behaviour by popular search engines. There's nothing stopping a bot, or a human, from looking at files and directories listed under Disallow: as that's where you can often find some interesting things! If you don't want your files or directories published, they should never be made available.

List of popular robots to consider when building your robots.txt file

I wonder if there is a search engine that only hosts Disallow information?