Robots.txt Database

Robots.txt files are used to tell search engines (and other robots) what content on a website they should "crawl" and show in search results. If a website doesn't want certain content to show up in search results (from Google, for example), that website can specify a set of rules telling search engines what content it should not crawl. While adding a page to the robots.txt file does not guarantee it won't show up in search results, it greatly decreases the likelihood - and the actual content of the page will not appear in search results. See Moz's detailed overview for more detail on what goes into a robots.txt file.

Robots.txt files are public (try viewing the CDC's robots.txt for example). This project has gathered robots.txt files across 9000+ government websites, and has made them available to search. For example, to find all of the election-related rules, try searching "election" below. In each search result, there is also a link to the Internet Archive's Wayback Machine. In some cases, you may be able to find content that has been removed from the live version of a website, but was previously archived.

For more information on how the 9000+ websites were collected, please see Github.

For example: covid, testimony, pdf, election