Google's Robots.txt Parser is Now Open Source

Google's Robots.txt Parser is Now Open Source (googleblog.com) 32

Posted by msmash on Monday July 01, 2019 @01:20PM from the marching-forward dept.

From a blog post: For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large? Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.

We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

Google's Robots.txt Parser is Now Open Source

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 32 Comments Log In/Create an Account

Comments Filter:

- They replaced it with the Rust version (Score:5, Funny)
  
  by goombah99 ( 560566 ) writes: on Monday July 01, 2019 @01:39PM (#58855982)
  
  it runs 69 times faster.
  
  - Re: They replaced it with the Rust version (Score:1, Interesting)
    
    by Anonymous Coward writes:
    
    That comment is referring to this recent /. submission [slashdot.org].
    Frankly, I don't see why Rust gets so much ridicule, hate and downmodding here at /., especially with it being a community driven open source project.
    It's clear that Rust is an innovative language that is pushing the boundaries of security and performance.
    Maybe the Perl, PHP and C++ fanatics here feel threatened by it. If you know only one programming language, then I can see how something like Rust can be scary. It's like when horse carriage drivers fi
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      I don't have any problem with Rust (quite the opposite, I'm very interested in seeing how languages like Rust and Go affect software development going forward), but regarding that submission specifically, Brave's performance increases seem to largely have come from rewriting their adblocker to handle things in a uBlock Origin manner rather than an ABP manner.
    - Don't explain the joke (Score:1)
      
      by Anonymous Coward writes:
      
      The problem with rust is not so much the language, but everything else. The boasting, the "let's rewrite $whatever in rust!", the evangelising, the claims that don't hold water but the claimants cannot even (or just won't) understand why not even if you explain it in small words, and worst of all, the stronly SJW-flavoured toxic community complete with idiot "inclusive" "code of conduct". That claim of rust making an adblocker faster is plain false: The kicker is the replacement algorithm. But of course it
  - Re: (Score:2)
    
    by Holi ( 250190 ) writes:
    
    Wish I had mod points today
This will be exploited by SEO Spammers (Score:1)

by xack ( 5304745 ) writes:

To make sure they get their spam indexed more easily. While other people will use it to put public data behind a paywall.
- Re: (Score:2)
  
  by jellomizer ( 103300 ) writes:
  
  There is valid reasons for your pages to be controlled what is indexed and what isn't. Granted I don't rely much on robots.txt but use authentication to prevent sites from hitting data I don't want to publicly share.
  However if you have data that is meant to be transient and real time a Google Search Result would be out of date the moment it collected the information. Then if you have some sort of usage statistics, you really don't want Google hitting your page to give you false numbers. Then they are some
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    There is valid reasons for your pages to be controlled what is indexed and what isn't.
    robots.txt does not control what is and is not indexed by Google.
    - Re: (Score:2)
      
      by fred911 ( 83970 ) writes:
      
      'robots.txt does not control what is and is not indexed by Google.'
      Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.
      for those that care to read; https://support.google.com/web... [google.com]
      - Re: (Score:2)
        
        by WaffleMonster ( 969671 ) writes:
        
        'robots.txt does not control what is and is not indexed by Google.'
        Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.
        The problem is nobody but Google a small minority of webmasters understand this.
- Re: (Score:2)
  
  by Ksevio ( 865461 ) writes:
  
  How would it be exploited? It's already an open standard and Google provides tools to make indexing easier.
robots.txt is a scam (Score:2)

by WaffleMonster ( 969671 ) writes:

Everyone except Google believes robots.txt control what search engines are allowed to index rather than simply crawl.
What?? (Score:2)

by JustAnotherOldGuy ( 4145623 ) writes:

"...how should they deal with robots.txt files that are hundreds of megabytes large?"
Holy crap, I never imagined that a robots.txt file could or would be "hundreds of megabytes in size".
Maybe I need to get out more.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google's Robots.txt Parser is Now Open Source (googleblog.com) 32

Google's Robots.txt Parser is Now Open Source More Login

Google's Robots.txt Parser is Now Open Source

They replaced it with the Rust version (Score:5, Funny)

Re: They replaced it with the Rust version (Score:1, Interesting)

Re: (Score:1)

Don't explain the joke (Score:1)

Re: (Score:2)

This will be exploited by SEO Spammers (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

robots.txt is a scam (Score:2)

What?? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot