Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

MapReduce For the Masses With Common Crawl Data 29

Posted by timothy on Sunday December 18, 2011 @10:54PM from the gotta-be-here-some-place dept.

New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."

This discussion has been archived. No new comments can be posted.

MapReduce For the Masses With Common Crawl Data

Search 29 Comments Log In/Create an Account

Comments Filter:

You know you're old when (Score:0, Interesting)

by Anonymous Coward writes: on Monday December 19, 2011 @01:48AM (#38421662)

more than 50% of any given sentence sounds like gibberish. And yet you know someone somewhere is as excited as you were when you got your first floppy drive...

Share
twitter facebook
Regarding crawling (Score:3, Interesting)

by gajop ( 1285284 ) writes: on Monday December 19, 2011 @06:16AM (#38422562)

Hmm, similar article so I'll ask a question of personal nature.
I've recently created a crawler to collect certain information from a website, that would help me gather data sets for a small machine learning project.
While I've followed robots.txt and nofollow links, site's TOU was against it. After confirming with the admin, I was told that it's not allowed to gather information, as the site owns it (as it's written in the TOU).
The data however is publicly available, so you actually wouldn't have to agree to a TOU to collect the data, and as it's some data I wanted, I still concluded I should get a small sample (less than 1% of the total data, around 200MB) at least, to see if something's even possible to be done with it.
What are your thoughts /.? Should I have abandoned the attempt, have I done right or even should I disregard their plead and simply get as much as I please (during a long period of time, as to not hammer on it's bandwidth)?

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

MapReduce For the Masses With Common Crawl Data 29

MapReduce For the Masses With Common Crawl Data More Login

MapReduce For the Masses With Common Crawl Data

You know you're old when (Score:0, Interesting)

Regarding crawling (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot