Brown Dog: a Search Engine For the Other 99 Percent (of Data) 23
aarondubrow writes: We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent. Led by Kenton McHenry, a team at the National Center for Supercomputing Applications is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the team is developing software that allows researchers to manage and make sense of vast amounts of digital scientific data that is currently trapped in outdated file formats. The NCSA team recently demonstrated two publicly-available services to make the contents of uncurated data collections accessible.
oblig xkcd (Score:2, Insightful)
The problem isn't the format of the data... (Score:5, Informative)
The problem is that 99%* of data is actually trapped behind paywalls...
Which is more of a problem than the format. If the data was available without the paywall, then the format probably wouldn't matter as much.
GrpA
*99% is a made-up statistic - just like the original article. I assume it means "lots..."
Re: (Score:1)
And you keep coming back to spam here. Cant be a very succesfull site. But I guess you now earned the right to spam and troll slashdot now that you left. being a part of the problem is so cool eh?
The problem isn't the format of the data... (Score:5, Insightful)
Although you have a point, you don't understand the realities of science, data, and publishing.
Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.
The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.
There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.
academic dishonesty (Score:3)
i used to make my living doing just such a thing! and hope to again one day...
i was sort of an SPSS jockey and data interpreter for a geospatial HCI research project
the supervising professor set the parameters for the study, i came on b/c he had absolutely no idea what to do with all his data...it was 8 years of a study that
Isn't this what Splunk is for? (Score:5, Informative)
Isn't gathering, indexing, and trying to find heads/tails of data what Splunk is designed for? It is a commercial utility, and not cheap by any means... but at least this is one software package meant to sift through and generate reports/graphs/etc on stuff.
Disclaimer: Not associated with them, but have ended up using their products at multiple installations with very good results (mainly keeping customers happy with a morning PDF report that all is well, with the charts to prove it.)
"the Other 99 Percent" (Score:3, Funny)
wow
I'm still struggling with the first 99 percent, and now you tell me there's more?
This must be the Dark Matter that the rumors are about. Oh, how the elusive tendrils of reality converge on the delicate neurons of the deranged mind.
I have plenty of old scientific data files (Score:2)
What will this project do for me? How do I get old, worn-out data files converted out of dead proprietary formats into something usable or useful? Or is this project only for certain types of researcher? (aka those with oodles of money)
Re: (Score:2)
i can point you to Systems Science research projects that are based on testing high level data analysis algorithms and the data set they need is immaterial...it just has to fit certain parameters
the data in TFA could be used for just such a task
in one example, a PhD researcher was developing a speech recognition algorithm improvement and he actually used the entire digital catelogue of some musician as his data set...it produced interesting results that were not part of testing his hypothesis at all...hard
We all have? (Score:4, Interesting)
huh? (Score:2)
I've been kicking around the internet since before the web, since one was delighted by the capabilities of gophur, etc.
And:
"We've all experienced the frustration of trying to access information on websites, only to find that the data is trapped in outdated, difficult-to-read file formats and that metadata..."
Nope, not even once.
Re: (Score:2)
Re: (Score:2)
This times a BILLION.
Just to have a crawlable time/date stamp on pages consistently....delightful.
IIRC there's a google command-line tag like sort:date or something that will give you freshest pages first, but i've never found it useful - dunno, maybe changing ad-content 'refreshes' page ages to the point of meaninglessness?
Dark Data (Score:2)
Has this class of data been termed "Dark Data" yet?
Re: (Score:2)
How do you spell microfiche again? (Score:1)