Brown Dog: a Search Engine For the Other 99 Percent (of Data) 23

Posted by Soulskill on Tuesday October 07, 2014 @10:18PM from the because-it-fetches-data dept.

aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 23 Comments Log In/Create an Account

Comments Filter:

oblig xkcd (Score:2, Insightful)

by irussel ( 78667 ) writes:

http://xkcd.com/979/ [xkcd.com]
The problem isn't the format of the data... (Score:5, Informative)

by GrpA ( 691294 ) writes: on Tuesday October 07, 2014 @10:55PM (#48088601)

The problem is that 99%* of data is actually trapped behind paywalls...
Which is more of a problem than the format. If the data was available without the paywall, then the format probably wouldn't matter as much.
GrpA
*99% is a made-up statistic - just like the original article. I assume it means "lots..."

- - - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      And you keep coming back to spam here. Cant be a very succesfull site. But I guess you now earned the right to spam and troll slashdot now that you left. being a part of the problem is so cool eh?
- The problem isn't the format of the data... (Score:5, Insightful)
  
  by Vesvvi ( 1501135 ) writes: on Tuesday October 07, 2014 @11:48PM (#48088813)
  
  Although you have a point, you don't understand the realities of science, data, and publishing.
  Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
  The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.
  There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.
  
  - academic dishonesty (Score:3)
    
    by globaljustin ( 574257 ) writes:
    
    it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
    i used to make my living doing just such a thing! and hope to again one day...
    i was sort of an SPSS jockey and data interpreter for a geospatial HCI research project
    the supervising professor set the parameters for the study, i came on b/c he had absolutely no idea what to do with all his data...it was 8 years of a study that
Isn't this what Splunk is for? (Score:5, Informative)

by mlts ( 1038732 ) writes: on Tuesday October 07, 2014 @10:59PM (#48088627)

Isn't gathering, indexing, and trying to find heads/tails of data what Splunk is designed for? It is a commercial utility, and not cheap by any means... but at least this is one software package meant to sift through and generate reports/graphs/etc on stuff.
Disclaimer: Not associated with them, but have ended up using their products at multiple installations with very good results (mainly keeping customers happy with a morning PDF report that all is well, with the charts to prove it.)

"the Other 99 Percent" (Score:3, Funny)

by swell ( 195815 ) writes: <jabberwock@poetic.com> on Tuesday October 07, 2014 @11:24PM (#48088709)

wow
I'm still struggling with the first 99 percent, and now you tell me there's more?
This must be the Dark Matter that the rumors are about. Oh, how the elusive tendrils of reality converge on the delicate neurons of the deranged mind.

I have plenty of old scientific data files (Score:2)

by jd ( 1658 ) writes:

What will this project do for me? How do I get old, worn-out data files converted out of dead proprietary formats into something usable or useful? Or is this project only for certain types of researcher? (aka those with oodles of money)
- Re: (Score:2)
  
  by globaljustin ( 574257 ) writes:
  
  i can point you to Systems Science research projects that are based on testing high level data analysis algorithms and the data set they need is immaterial...it just has to fit certain parameters
  the data in TFA could be used for just such a task
  in one example, a PhD researcher was developing a speech recognition algorithm improvement and he actually used the entire digital catelogue of some musician as his data set...it produced interesting results that were not part of testing his hypothesis at all...hard
We all have? (Score:4, Interesting)

by jader3rd ( 2222716 ) writes: on Wednesday October 08, 2014 @12:53AM (#48089039)

I honestly don't recall that ever being a problem. It may have happened to me, but it must have been so long ago and so infrequent I seriously can't recollect not finding something I was expecting to find.

huh? (Score:2)

by argStyopa ( 232550 ) writes:

I've been kicking around the internet since before the web, since one was delighted by the capabilities of gophur, etc.
And:
"We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata..."
Nope, not even once.
- Re: (Score:2)
  
  by CastrTroy ( 595695 ) writes:
  
  I think the thing that bothers me most that might be closely related is finding information on the internet, and not having a date attached to it. You'll search for something, find something that looks like a news article, and there's no dates. No information about when it happened.
  - Re: (Score:2)
    
    by argStyopa ( 232550 ) writes:
    
    This times a BILLION.
    Just to have a crawlable time/date stamp on pages consistently....delightful.
    IIRC there's a google command-line tag like sort:date or something that will give you freshest pages first, but i've never found it useful - dunno, maybe changing ad-content 'refreshes' page ages to the point of meaninglessness?
Dark Data (Score:2)

by tverbeek ( 457094 ) writes:

Has this class of data been termed "Dark Data" yet?
- Re: (Score:2)
  
  by neo-mkrey ( 948389 ) writes:
  
  It has now ;-)
How do you spell microfiche again? (Score:1)

by mrego ( 912393 ) writes:

So if I have a YAML or MP3 file, I can convert it...?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Brown Dog: a Search Engine For the Other 99 Percent (of Data) 23

Brown Dog: a Search Engine For the Other 99 Percent (of Data) More Login

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

oblig xkcd (Score:2, Insightful)

The problem isn't the format of the data... (Score:5, Informative)

Re: (Score:1)

The problem isn't the format of the data... (Score:5, Insightful)

academic dishonesty (Score:3)

Isn't this what Splunk is for? (Score:5, Informative)

"the Other 99 Percent" (Score:3, Funny)

I have plenty of old scientific data files (Score:2)

Re: (Score:2)

We all have? (Score:4, Interesting)

huh? (Score:2)

Re: (Score:2)

Re: (Score:2)

Dark Data (Score:2)

Re: (Score:2)

How do you spell microfiche again? (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot