Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Google Open Source The Internet

Google's Robots.txt Parser is Now Open Source (googleblog.com) 32

From a blog post: For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large? Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.

We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

This discussion has been archived. No new comments can be posted.

Google's Robots.txt Parser is Now Open Source

Comments Filter:
  • To make sure they get their spam indexed more easily. While other people will use it to put public data behind a paywall.
    • There is valid reasons for your pages to be controlled what is indexed and what isn't. Granted I don't rely much on robots.txt but use authentication to prevent sites from hitting data I don't want to publicly share.
      However if you have data that is meant to be transient and real time a Google Search Result would be out of date the moment it collected the information. Then if you have some sort of usage statistics, you really don't want Google hitting your page to give you false numbers. Then they are some

      • There is valid reasons for your pages to be controlled what is indexed and what isn't.

        robots.txt does not control what is and is not indexed by Google.

        • by fred911 ( 83970 )

          'robots.txt does not control what is and is not indexed by Google.'

          Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.

          for those that care to read; https://support.google.com/web... [google.com]

          • 'robots.txt does not control what is and is not indexed by Google.'

            Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.

            The problem is nobody but Google a small minority of webmasters understand this.

    • by Ksevio ( 865461 )
      How would it be exploited? It's already an open standard and Google provides tools to make indexing easier.
  • Everyone except Google believes robots.txt control what search engines are allowed to index rather than simply crawl.

  • "...how should they deal with robots.txt files that are hundreds of megabytes large?"

    Holy crap, I never imagined that a robots.txt file could or would be "hundreds of megabytes in size".

    Maybe I need to get out more.

Never test for an error condition you don't know how to handle. -- Steinbach

Working...