Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Software Supercomputing The Internet

Brown Dog: a Search Engine For the Other 99 Percent (of Data) 23

aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.
This discussion has been archived. No new comments can be posted.

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

Comments Filter:
  • oblig xkcd (Score:2, Insightful)

    by irussel ( 78667 )
  • by GrpA ( 691294 ) on Tuesday October 07, 2014 @10:55PM (#48088601)

    The problem is that 99%* of data is actually trapped behind paywalls...

    Which is more of a problem than the format. If the data was available without the paywall, then the format probably wouldn't matter as much.

    GrpA

    *99% is a made-up statistic - just like the original article. I assume it means "lots..."

    • by Vesvvi ( 1501135 ) on Tuesday October 07, 2014 @11:48PM (#48088813)

      Although you have a point, you don't understand the realities of science, data, and publishing.

      Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.

      The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.

      There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.

      • it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.

        i used to make my living doing just such a thing! and hope to again one day...

        i was sort of an SPSS jockey and data interpreter for a geospatial HCI research project

        the supervising professor set the parameters for the study, i came on b/c he had absolutely no idea what to do with all his data...it was 8 years of a study that

  • by mlts ( 1038732 ) on Tuesday October 07, 2014 @10:59PM (#48088627)

    Isn't gathering, indexing, and trying to find heads/tails of data what Splunk is designed for? It is a commercial utility, and not cheap by any means... but at least this is one software package meant to sift through and generate reports/graphs/etc on stuff.

    Disclaimer: Not associated with them, but have ended up using their products at multiple installations with very good results (mainly keeping customers happy with a morning PDF report that all is well, with the charts to prove it.)

  • by swell ( 195815 ) <jabberwock@poetic.com> on Tuesday October 07, 2014 @11:24PM (#48088709)

    wow
    I'm still struggling with the first 99 percent, and now you tell me there's more?

    This must be the Dark Matter that the rumors are about. Oh, how the elusive tendrils of reality converge on the delicate neurons of the deranged mind.

  • What will this project do for me? How do I get old, worn-out data files converted out of dead proprietary formats into something usable or useful? Or is this project only for certain types of researcher? (aka those with oodles of money)

    • i can point you to Systems Science research projects that are based on testing high level data analysis algorithms and the data set they need is immaterial...it just has to fit certain parameters

      the data in TFA could be used for just such a task

      in one example, a PhD researcher was developing a speech recognition algorithm improvement and he actually used the entire digital catelogue of some musician as his data set...it produced interesting results that were not part of testing his hypothesis at all...hard

  • We all have? (Score:4, Interesting)

    by jader3rd ( 2222716 ) on Wednesday October 08, 2014 @12:53AM (#48089039)
    I honestly don't recall that ever being a problem. It may have happened to me, but it must have been so long ago and so infrequent I seriously can't recollect not finding something I was expecting to find.
  • I've been kicking around the internet since before the web, since one was delighted by the capabilities of gophur, etc.
    And:
    "We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata..."

    Nope, not even once.

    • I think the thing that bothers me most that might be closely related is finding information on the internet, and not having a date attached to it. You'll search for something, find something that looks like a news article, and there's no dates. No information about when it happened.
      • This times a BILLION.
        Just to have a crawlable time/date stamp on pages consistently....delightful.

        IIRC there's a google command-line tag like sort:date or something that will give you freshest pages first, but i've never found it useful - dunno, maybe changing ad-content 'refreshes' page ages to the point of meaninglessness?

  • Has this class of data been termed "Dark Data" yet?

  • So if I have a YAML or MP3 file, I can convert it...?

"If it ain't broke, don't fix it." - Bert Lantz

Working...