Forgot your password?
typodupeerror
The Internet

Huge Site Ranking Dataset Donated to the Common Crawl Foundation 23

Posted by Unknown Lamer
from the fuzzy-feelings dept.
Greg Lindahl writes "blekko is donating search engine ranking data for 140 million domains and 22 billion urls to the Common Crawl Foundation. Common Crawl is a non-profit dedicated to making the greatest (yet messiest) dataset of our time, the web, available to everyone, including tinkerers, hackers, activists, and new companies. blekko's ranking data will initially be used to improve the quality of Common Crawl's 8 billion webpage public crawl of the web, and eventually will be directly available to the public."
This discussion has been archived. No new comments can be posted.

Huge Site Ranking Dataset Donated to the Common Crawl Foundation

Comments Filter:
  • by plover (150551) on Wednesday December 19, 2012 @10:01AM (#42336455) Homepage Journal

    I didn't realize the web wasn't available to everyone, including tinkerers, hackers, activists, and new companies. Thank $(DEITY) the Common Crawlers are here to make sure that my port 80 hasn't yet been pried from my cold, dead fingers.

    • by Nyder (754090)

      I didn't realize the web wasn't available to everyone, including tinkerers, hackers, activists, and new companies. Thank $(DEITY) the Common Crawlers are here to make sure that my port 80 hasn't yet been pried from my cold, dead fingers.

      I have no idea what is going on here. I am stoned (I live in Washington State, it's like law or something) so I'll admit that maybe I'm not in the right state for thinking. (I did a pun there, sorry)

      I did actually go to the web site, saw they were hiring, and read the FAQ.

      Still have no idea why they are doing what they are doing. Hoping someone will explain the purpose of Common Crawlers in terms I can understand and maybe a car analogy.

      • It is a data set about the internet. This saves you from having to crawl the web, analyze it, and build your own database.
      • Re:I didn't know (Score:5, Informative)

        by L1s4 (2798519) on Wednesday December 19, 2012 @10:53AM (#42336957)
        The idea is to give everyone access to crawl data. If you work at a large search company, you have access to crawl data. You can also set up crawlers to get the data yourself, but that is expensive and having countless crawlers doing duplicative work is not ideal. Our idea is that there should be one common repository for crawl data that anyone can use. Researchers are using it for NLP, IR, sentiment analysis and many other things like measuring the adoption of metadata formats http://www.webdatacommons.org/ [webdatacommons.org] Educators are using it as a real world dataset to teach big data techniques in the classroom. Developers and entrepreneurs are using it for startups. Sorry I don't have a car analogy :) Feel free to email me if you have any other questions lisa at commoncrawl dot org
        • by Nyder (754090)

          The idea is to give everyone access to crawl data. If you work at a large search company, you have access to crawl data. You can also set up crawlers to get the data yourself, but that is expensive and having countless crawlers doing duplicative work is not ideal. Our idea is that there should be one common repository for crawl data that anyone can use. Researchers are using it for NLP, IR, sentiment analysis and many other things like measuring the adoption of metadata formats http://www.webdatacommons.org/ [webdatacommons.org] Educators are using it as a real world dataset to teach big data techniques in the classroom. Developers and entrepreneurs are using it for startups.

          Sorry I don't have a car analogy :) Feel free to email me if you have any other questions lisa at commoncrawl dot org

          Thanks, that explains it better.

      • by plover (150551)

        I was actually mocking the slashdot story editor for claiming they were providing a copy of the web instead of providing a copy of web metadata obtained by crawling.

        Here's your car analogy: Web crawling is like a guy driving down every street in town and taking a picture of every vehicle he sees, and analyzing them to figure out make, model, year, license plate number, etc. The guy can either choose to sell the information, or he can make it freely available under a creative commons license and publish it

  • .... its better than being judged.

"Everything should be made as simple as possible, but not simpler." -- Albert Einstein

Working...