Forgot your password?
typodupeerror
Programming Software

MapReduce For the Masses With Common Crawl Data 29

Posted by timothy
from the gotta-be-here-some-place dept.
New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."
This discussion has been archived. No new comments can be posted.

MapReduce For the Masses With Common Crawl Data

Comments Filter:
  • by Anonymous Coward on Monday December 19, 2011 @01:48AM (#38421662)

    more than 50% of any given sentence sounds like gibberish. And yet you know someone somewhere is as excited as you were when you got your first floppy drive...

  • Regarding crawling (Score:3, Interesting)

    by gajop (1285284) on Monday December 19, 2011 @06:16AM (#38422562)

    Hmm, similar article so I'll ask a question of personal nature.

    I've recently created a crawler to collect certain information from a website, that would help me gather data sets for a small machine learning project.
    While I've followed robots.txt and nofollow links, site's TOU was against it. After confirming with the admin, I was told that it's not allowed to gather information, as the site owns it (as it's written in the TOU).

    The data however is publicly available, so you actually wouldn't have to agree to a TOU to collect the data, and as it's some data I wanted, I still concluded I should get a small sample (less than 1% of the total data, around 200MB) at least, to see if something's even possible to be done with it.

    What are your thoughts /.? Should I have abandoned the attempt, have I done right or even should I disregard their plead and simply get as much as I please (during a long period of time, as to not hammer on it's bandwidth)?

I have ways of making money that you know nothing of. -- John D. Rockefeller

Working...