Common Crawl Foundation Providing Data For Search Researchers 61

Posted by Unknown Lamer on Monday November 14, 2011 @09:15PM from the doesn't-archive-dot-org-do-that dept.

mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it." Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

Common Crawl Foundation Providing Data For Search Researchers

This discussion has been archived. No new comments can be posted.

Search 61 Comments Log In/Create an Account

Comments Filter:

Re:Interesting, however (Score:5, Insightful)

by Gumber ( 17306 ) writes: on Monday November 14, 2011 @10:08PM (#38055192) Homepage

It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages. This likely lets people save a lot of time and effort which they can then devote to their unique research.
Maybe it will cost a fortune to analyze that much data, but there isn't really anyway of getting around the cost if you need that much data. Besides, for what its worth, the linked article suggests that a hadoop run against the data costs about $100. I'm sure the real cost depends on the extent and efficiency of your analysis, but that is hardly "a fortune."

Re:Saves you on bandwidth (Score:5, Insightful)

by Gumber ( 17306 ) writes: on Monday November 14, 2011 @10:11PM (#38055206) Homepage

Bitch moan, bitch moan. If I had a need for such a dataset, I think I'd be damn grateful that I didn't have to collect it myself. As for the cost of processing the pages, the article suggests that running a hadoop job on the whole dataset on EC2 might be in the neighborhood of $100. That's not that costly.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Common Crawl Foundation Providing Data For Search Researchers 61

Common Crawl Foundation Providing Data For Search Researchers More Login

Common Crawl Foundation Providing Data For Search Researchers

Re:Interesting, however (Score:5, Insightful)

Re:Saves you on bandwidth (Score:5, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot