Monday, January 30, 2012

Open data initiatives

There has always been a vague desire on part of experimentalists to have publicly obtainable data that's already been published and is in the public domain.

In psycholinguistics, the first such database I know of is Reinhold Kliegl's PMR2.  

In my lab, we are (or rather I am) also thinking of some way to provide easy access to our published lab data (in addition to listing it on PMR2, it would be good to have a local copy so that one is not 100% dependent on an archive maintained by someone else). 

I just heard about another data repository, directed more to linguistics (but also to psychologists): CLARIN-D.

In this context, I've been thinking about what properties our local repository should have (this is only about our own public repository). Here is a preliminary list:

1. Data access should require login and registration, as well as an "I agree" button to get an agreement of the terms of the data release (below).
2. The data should be released on the condition that any new result derived from the data should be uploaded there so that people can follow the history of what happened with that data. The people downloading the data should cite the original work where the data were reported.
3. Once the re-analysis of the original data is in the public domain and the new analysis has been uploaded, it would be ideal if there were room for comments from others (e.g., a response from the original authors). I.e., this would be like a blog, but the blog should be integrated seamlessly with the repository, and not be a separate interface (see PRM2 for an example of what I would not like to have). Downloaders and users of our data should also agree to show us their reanalysis before publishing it, so that we have the chance to respond if we find something that we disagree about (e.g., how to remove extreme values).
4. We should release full data for the published study. E.g., all items used, all fillers (excluding filler experiments that are not published yet), and the raw as well as analyzed data (which could be non-raw, e.g., aggregated).  People often dislike releasing all their data (I have had several people refuse to release the raw data, making re-analysis effectively impossible), they limit the release to just the data in the format that allows exactly the analysis already done, nothing further. What's the use of that kind of a data release? Suppose I want to look for a particular kind of confound in the data, and I can only do it if I have the raw data, a data release of the reduced dataset would be useless (I have been in this situation, and I could not use the released data).
5. Our own analysis for a particular dataset should be an Sweave'd document, with .Rnw, .R source, the data itself, a pdf. Ideally the paper should be the Sweave file. If every downloadable item has this collection of items, it will have a completely predictable structure, easy to understand for the outsider. I know that developing standards is hard, even within the confines of our own lab, but it might be worth it. 
6. The data should not be in .Rda files, but rather as text files. I have had some problems accessing .Rda files in a new version of R that were created with an older version of R.
7. There has to be a contact person locally whom people from outside can contact (and it's not gonna be me!). That's the central problem with good ideas; they always require some work.
8. There should be a possibility to upload a new, improved data analysis even after the data is published. For example, I published a paper in 2004, when the state of my statistical knowledge was even more miserable than it is right now. I would like to post a revised analysis, done to the best of my current ability. There should be space for that in the interface. This cannot count as a re-analysis of the dataset by an outside, third party, and therefore it should be presented as part of the lab data set but marked as "revised data analysis", or something like that.
9. What about our re-analyses of *other* people's data? For example, Titus reanalyzed the Meseguer et al dataset; this should be presented not as original data from our lab; there should be a separate section for showcasing reanalyses that we did.

I'm looking forward to suggestions for further improvements (from anyone, not just lab members, that's why it's in the public domain).


No comments: