Google's Robots.txt Parser is Now Open Source (googleblog.com) 32
From a blog post: For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large? Today, we announced that we're spearheading the effort to make the REP an internet standard. While this is an important step, it means extra work for developers who parse robots.txt files.
We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
We're here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90's. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.
They replaced it with the Rust version (Score:5, Funny)
it runs 69 times faster.
Re: They replaced it with the Rust version (Score:1, Interesting)
That comment is referring to this recent /. submission [slashdot.org].
Frankly, I don't see why Rust gets so much ridicule, hate and downmodding here at /., especially with it being a community driven open source project.
It's clear that Rust is an innovative language that is pushing the boundaries of security and performance.
Maybe the Perl, PHP and C++ fanatics here feel threatened by it. If you know only one programming language, then I can see how something like Rust can be scary. It's like when horse carriage drivers fi
Re: (Score:1)
I don't have any problem with Rust (quite the opposite, I'm very interested in seeing how languages like Rust and Go affect software development going forward), but regarding that submission specifically, Brave's performance increases seem to largely have come from rewriting their adblocker to handle things in a uBlock Origin manner rather than an ABP manner.
Don't explain the joke (Score:1)
The problem with rust is not so much the language, but everything else. The boasting, the "let's rewrite $whatever in rust!", the evangelising, the claims that don't hold water but the claimants cannot even (or just won't) understand why not even if you explain it in small words, and worst of all, the stronly SJW-flavoured toxic community complete with idiot "inclusive" "code of conduct". That claim of rust making an adblocker faster is plain false: The kicker is the replacement algorithm. But of course it
Re: (Score:2)
This will be exploited by SEO Spammers (Score:1)
Re: (Score:2)
There is valid reasons for your pages to be controlled what is indexed and what isn't. Granted I don't rely much on robots.txt but use authentication to prevent sites from hitting data I don't want to publicly share.
However if you have data that is meant to be transient and real time a Google Search Result would be out of date the moment it collected the information. Then if you have some sort of usage statistics, you really don't want Google hitting your page to give you false numbers. Then they are some
Re: (Score:2)
There is valid reasons for your pages to be controlled what is indexed and what isn't.
robots.txt does not control what is and is not indexed by Google.
Re: (Score:2)
'robots.txt does not control what is and is not indexed by Google.'
Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.
for those that care to read; https://support.google.com/web... [google.com]
Re: (Score:2)
'robots.txt does not control what is and is not indexed by Google.'
Absolutely correct, it controls what should be crawled. The robots meta tag controls whether a page is indexed.
The problem is nobody but Google a small minority of webmasters understand this.
Re: (Score:2)
robots.txt is a scam (Score:2)
Everyone except Google believes robots.txt control what search engines are allowed to index rather than simply crawl.
What?? (Score:2)
"...how should they deal with robots.txt files that are hundreds of megabytes large?"
Holy crap, I never imagined that a robots.txt file could or would be "hundreds of megabytes in size".
Maybe I need to get out more.