Common Crawl Foundation Providing Data For Search Researchers 61
mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it."
Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.
Saves you on bandwidth (Score:3, Informative)
Re: (Score:1)
Re:Saves you on bandwidth (Score:5, Insightful)
Bitch moan, bitch moan. If I had a need for such a dataset, I think I'd be damn grateful that I didn't have to collect it myself. As for the cost of processing the pages, the article suggests that running a hadoop job on the whole dataset on EC2 might be in the neighborhood of $100. That's not that costly.
Re: (Score:3, Interesting)
Won't perform for that sort of money (Score:3)
Go build your own processing cluster and see how long it takes you to do that for less than what EC2 would charge. Once you're finished, you could make a business out of it and compete with Amazon. Th
Re: (Score:3)
If you're an academic, running a single hadoop job like that is not as useful as it sounds. In research, you never know what you want until you do something and realize that's not it. To write a paper you'd want to run at least 10-20 full jobs, all slightly different.
Luckily, lots of unis have their own clusters (aka beowulfs - I can't believe I have to point that out on slashdot...). It would really be great if the data could be duplicated so people could run the jobs on their own local setups.
Re: (Score:3)
No actually you cant.
Re: (Score:3)
I wonder how big of a torrent file that would make....
Is this an Amazon sponsor thingy? (Score:2)
I mean, hosting the stuffs on Amazon server is one thing - it gonna have to be hosted somewhere, but the thing that I feel uncomfortable is that if anyone wants to do any research on the info they end up have to pay Amazon.
Hmm ....
Re: (Score:3)
Re: (Score:1)
Re: (Score:3)
A conspiracy? You're going to have to pay someone for the compute time. It's not like a lot of people have big clusters lying around, so lot of people are going to opt to pay Amazon anyway.
As for selling access to the data on physical media, it doesn't look like there is anything to stop you from taking advantage of Amazon's Export Service to get the data set on physical media.
Re: (Score:2)
Must be a conspiracy set up by Amazon to get people to pay for vast amounts of compute time. Why now allow people to purchase copies of the data on hard disk or tape. 5 billion pages, at 100K each (high estimate perhaps) is 500 TB. If you zip it, you could probably get it down to 10 TB if you compress it with a good algorithm. Not "that much" if this is the kind of research you are interested in.
How much would that tape and tape drive or hard disk cost you to get started? How would that cost compare with the initial 750 hours of free compute time on EC2?
Re: (Score:2)
I don't get it. You are going to have to pay someone if you want to do any research on it. If you don't want to pay Amazon you could either crawl the data yourself, or pay the cost of transferring the data out of Amazon's cloud.
Re: (Score:1)
Re: (Score:3)
I mean, hosting the stuffs on Amazon server is one thing - it gonna have to be hosted somewhere, but the thing that I feel uncomfortable is that if anyone wants to do any research on the info they end up have to pay Amazon.
Hmm ....
So you expect the researchers to Fedex you 100000 2TB harddrive to you upon request? We're talking about 200 petabytes of data here. It's gonna take forever to transfer no matter how wide your intertubes are. A shipping container of harddrives is literally the only way to move this much data in a timely manner.
Since there's no easy way to move the data, it only makes sense to run your code on the cluster where the data is currently residing at.
Interesting, however (Score:4, Interesting)
Re:Interesting, however (Score:5, Insightful)
It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages. This likely lets people save a lot of time and effort which they can then devote to their unique research.
Maybe it will cost a fortune to analyze that much data, but there isn't really anyway of getting around the cost if you need that much data. Besides, for what its worth, the linked article suggests that a hadoop run against the data costs about $100. I'm sure the real cost depends on the extent and efficiency of your analysis, but that is hardly "a fortune."
Re: (Score:3)
It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages.
Indeed, and there are more crawlers on the net than might be commonly supposed. Our home site is regularly visited by bots from Google, Bing, and Yandex, and occasionally by several others. The entire site (10s of GB) was slurped in a single visit by an unknown bot at an EC2 IP address recently. That bot's [botsvsbrowsers.com] user-agent string was not the same as the string used by the Common Crawl Foundation's bot.
Re: (Score:2)
Re: (Score:1)
Which is why Google sucks. Nobody willing to compete except Microsoft. And well... that really isn't going to bring competition to the market.
It should be obvious (Score:5, Interesting)
Because being 'distributed' is not a magic wand. (Nor is 'crowdsourcing', nor 'open source', or half a dozen other terms often used as buzzwords in defiance of the actual (technical) meanings.) You still need substantial bandwidth and processing power to handle the index, being distributed just makes the problems worse as now you need bandwidth and processing power to coordinate the nodes.
Re:It should be obvious (Score:4, Informative)
Except the editor is wrong, since distributed search engines do exist [wikimedia.org].
Re: (Score:2)
Re: (Score:2)
also, check out yacy - yacy.de
very powerful, decentralised, open source. excellent. i ran a node for a while on my vps, the results were good although it struggled on 384MB of RAM
Fix GOOG's braindead pageranking system (Score:4, Interesting)
An essential improvement is coming up with a way to identify and rank by actual information content. No, I have no idea how to do that. I'm just a biologist, struggling with plain old "I." AI is beyond me.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Surely it would be possible to tweak the algorithm so outbound links don't detract from the site, and keep things mathematically sound?
Re: (Score:1)
It's got a long way to go. (Score:1)
Wait, what? (Score:5, Interesting)
It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2.
The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.
Each S3 storage bucket is 5TB. [amazon.com]
5TB * 40,000 / 5 billion = 42MB/web page
Either they made a typo, my math is wrong, or they started crawling the HD porn sites first. I really hope it's not the latter because 200 petabytes of porn will be the death of so many geeks that the year of Linux on the desktop might never come.
Re: (Score:2)
42 MB is not really that big for a "modern" webpage. People put a lot of images on their web pages these days. Add flash apps or forums to that, and many sites get quite big. Text only pages exist mainly in the realm of geeks. When you include sites like IBM, Apple, HP, Dell, etc... you're getting GBs of data.
Re:Wait, what? (Score:4, Informative)
for a modern "website" 42mb isn't large.. but for any single "webpage" it is quite large and not common - even with tones of images
Re: (Score:3)
Re: (Score:1)
200 petabytes of porn
We need a mascot for such an invaluable resource. I vote we call it Petabear
Re: (Score:1)
Re: (Score:2)
....so how much would it cost (dollars) to run a single map-reduce word-count against that?
Also, why not do torrent thing. e.g. 100gig torrent dumps, with more updated on regular basis?
Re: (Score:2)
Because:
a) They'd have to pay to seed it
b) The data changes frequently (it is a web crawler after all)
c) Not everyone has servers necessary to process that much data, while anyone can use hadoop on amazon
Re: (Score:1)
Hi, sorry, there is a typo on the CC website. There are currently 323694 items in the current commoncrawl bucket (commoncrawl-002), and each file is very close to 100MB in size( the total bucket size is 32.3 TB). There are also another 132133 items in our older bucket, which we will be moving over to current bucket shortly.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)