Seeding the Commons Data Sharing Project Complete.

The federally funded component of the UWS Seeding the Commons data description and sharing project has been completed. A comprehensive description of the Project is available via a previous blog post. Descriptions of 21 UWS data collections are now available in Research Data Australia. Some of the collections are open access, some are available via mediated access (contact the researcher to discuss access conditions) and some are metadata (description) only. The collections are also represented in Trove, the National Library’s single search portal, and discoverable by Google, Google Scholar and other search engines. Data Collections with a DOI will be indexed in the new Thomson Reuters Data Citation Index , allowing them to be formally cited in research papers. Congratulations to all participating researchers who now have their data/data description accessible to a vast audience of international scholars including potential collaborators.

Lessons Learned

A shift in the culture of data sharing is required to ensure that data does not remain the lost output of research. Whilst some have embraced sharing, others still insist they ‘just don’t want to’ share their data. A concerted effort is required to raise the awareness of UWS researchers on the benefits of data sharing through a campaign of communication, education and engagement for all in the research data lifecycle. If you know of a data sharing success story don’t be shy, spread the word.

Data description is complex and can appear daunting until all the pieces fall into place. A cheat sheet is in development and available on request to assist. Once refined the sheet will be published.

What’s Next?

Researchers may self-submit data descriptions and/or small data sets into the UWS Research Data Catalogue. Library staff will complete the metadata, confirm the record with the submitter and make it available in Research Data Australia.

Researchers wanting to share data/descriptions but who are unsure about self-submission may contact Susan Robbins at s.robbins@uws.edu.au or 9852 5458.

UWS is currently working on Cr8It (pronounced Crate-it) – a web based packaging application for research data. It will give users an organised view of their files and as much metadata as possible automatically extracted from the files. Cr8it it will let researchers identify related objects and organise them into a data ‘package’, adding more metadata and context if required, such as associating a package with a research institute, facility or experiment. Researchers will then be able to send the packages to the Research Data Catalogue and eventually push it out to a variety of other destinations, such as blogs or discipline repositories. Cr8it is currently at the proof of concept stage. To try out this service or learn more, please contact eResearch@uws.edu.au.

Currently we are investigating ‘use cases’ for depositing data at various stages of the research life cycle (eg. At the inception of a research idea, when applying for grant, when it is funded etc) These are mentioned in a previous blog post.

What’s in it for Me?

The opportunity to stay ahead of the pack. Aside from the practical issues of data storage and preservation, you could increase opportunities for collaboration and the impact of your research globally.

To arrange storage space for working data, or a secure space to archive and preserve data, contact: Toby O’Hara from eResearch at t.ohara@uws.edu.au or 4736 0928

To arrange for your data collection to be described and reflected in Research Data Australia (and associated locations) contact:

Susan Robbins Research Coordinator (Library) at s.robbins@uws.edu.au or 9852 5458

‘Moving Forward’

A series of UWS data related webinars and workshops will soon be available to assist anyone/everyone involved in the research data lifecycle. To be informed when they are scheduled please email Susan Robbins s.robbins@uws.edu.au research comic

“Piled Higher and Deeper” by Jorge Cham
www.phdcomics.com

No Such Thing as a Dumb Question

I’m working with external collaborators – can I give them access to our Research Data Repository?

This is being investigated as a priority, but unavailable at present.

I trip over boxes of old interviews in my lounge room – can you take them, digitise them?

UWS archives is able to store these for the duration of the ethics application contact RAMS. Post ethics expiry date, the data collection will be evaluated to determine the next step.

I’m retiring and have 20 years worth of research data on floppy discs. Can I give them to you to digitise, preserve and archive?

At present we don’t offer this option, but you can self-submit them and (soon) utilise Cr8it (see above) to manage the collection. Assistance is available. Contact Toby OHara t.ohara@uws.edu.au x2928

Data are the New Black: Data Sharing in the National/International Arena

A selection of data related activities occurring internationally.

CSIRO to embrace open access Hare, Julie. The Australian [Canberra, A.C.T] 11 July 2012: 31

THE CSIRO is making freely available 200,000 research papers dating back to the 1920s on its new, open-access repository. It is also creating a portal to contain most of the raw research data used by the organisation since its inception.

“It’s a massive job. We will eventually have 86 years of data in the repository,” said Jon Curran, CSIRO’s general manager of communications.

“We are anticipating this is where the world of science is heading.

“The mood is there. And we know the more visible the work the more excitement and energy that is generated.” …

GigaScience (http://www.gigasciencejournal.com/) , an innovative new journal handling ‘big-data’ from the entire spectrum of life sciences, has now been launched by BGI

Geoscience Data Journal New Wiley open access data journal

“It is becoming increasingly important that the data which underpins key findings should be made more available to allow for the further analysis and interpretation of those results,” said Mike Davis, Vice President and Managing Director, Life Sciences Wiley. “The ability of researchers to create and collect often huge new data sets has been growing rapidly in parallel with options for their storage and retrieval in a wide range of data repositories. We are launching the Geoscience Data Journal in response to these important developments.”

http://au.wiley.com/WileyCDA/PressRelease/pressReleaseId-104139.html

Hindawi Datasets International

Publishing a Dataset Paper in Datasets International is all about the underlying raw and tabular data that the author has obtained during his experiment. Every table or image should be accompanied with a full description of how this data has been obtained, for instance, if you provide us with a graph; you should provide us with the tabular data you have used to draw this graph.

Datasets should contain detailed explanation of the methodology and materials used in conducting the experiment/observation and no final results or conclusions. Accordingly, manuscripts should be submitted along with all the relevant data.

Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for the central access to data in a similar vein as Wikimedia Commons does for multimedia files.

Google Scholar already contains citations to datasets represented by a DOI.

An example of a data citation in a reference list

Birgisdottir, L., and Thiede, J.R.N.
(2002) 
Carbon and density analysis of sediment core PS1243-1 PANGAEA. doi:10.1594PANGAEA.87536.

Cited in Jourabchi, P., L’Heureux, I., Meile, C., & Cappellen, P. V. (2010). Physical and chemical steady-state compaction in deep-sea sediments: Role of mineral reactions. Geochimica Et Cosmochimica Acta, 74(12), 3494-3513. Retrieved from www.scopus.com

Creative Commons License
Seeding the Commons Data Sharing Project Complete. by Susan Robbins is licensed under a Creative Commons Attribution 3.0 Unported License.

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond

About this post

During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR) which we outlined in a previous post. In this post we will dig deeper into the architecture and look at how a couple of the components interact, specifically; how does a lab-level data management application talk to the institution-level Research Data Repository when a researcher wants to archive a data set for reuse and citation? This work is a partnership with researchers and technicians at the Hawkesbury Institute for the Environment (HIE), our NSW eResearch partner Intersect, the UWS library and IT, and the UWS eResearch team.

Non-technical summary: The data capture application for environmental scientists at HIE will be aimed at obtaining and managing data for immediate use and re-use. This post describes the technical approach we will use to allow researchers to create a data set from one or more data sources, ask the system to keep it for the long term in the UWS Research Data Repository, and issue an identifier they can use to cite it in a research publication. Keeping data in the RDR means both adding data to the Research Data Storage (RDS) component and maintaining a record about the data in the Research Data Catalogue (RDC).

Technical summary (contains jargon which is explained below): The data-curation interface between the ANDS-funded Data Capture (DC21) and Seeding the Commons (SC20) projects at UWS has now been specified. Data sets identified by researchers as important in the DC21 application will be harvested by the institutional Research Data Repository using the OAI-PMH protocol with a RIF-CS payload. Data librarians will check and improve collection descriptions and, for those of significant re-use potential, publish them to Research Data Australia. On publication, the Research Data Repository application will move data from a pre-published to a published state. Pre-published data may be openly accessible for collaboration purposes but will not have DOI identifiers or guaranteed persistence.

Data capture and seeding the commons

We have two Australian National Data Service (ANDS) projects running a UWS at the moment.

  1. There’s a Data Capture project, which, amongst other capabilities, is designed to capture some of the ‘wild’ data, organizing it into collections that can be secured, referenced and re-used by others. This is known as DC21, AKA Climate Change and Energy Research Data Capture Project (DC21).

    Data might be considered ‘wild’ if there questions about its long term management (will we be able to find it ten years from now?), short term safety (is it backed up?), or its status is not know (is it raw or cleansed?).

  2. There’s a Seeding the Commons project which, amongst other things,  is aimed at establishing a catalogue application which publishes descriptions of collections of data available for re-use on a search site; Research Data Australia.

Here’s what the DC21 application is doing:

This project will develop the data architecture and associated software systems to automatically capture data and meta-data from three instruments. The motivation for the project is that on completion the systems developed will serve as a basis for including the additional instruments utilised by CCERF and other research groups at UWS.

And it has a close connection to the Seeding the Commons project SC20.

The project is closely aligned and is partly dependent on the UWS Seeding the Commons project (SC20). The meta-data collected in this project will be contributed to the UWS eResearch Metadata Store. SC20 will be developing RIF-CS and OAI-PMH compliance for the UWS eResearch Metadata Store to allow for it to be harvested into the ARDC.

OAI-PMH, RIF-CS?

  1. OAI-PMH is a web protocol allowing one service to pull data from another. It’s very similar to RSS and Atom used to keep track of updates on websites by software like Google Reader.

  2. RIF-CS is the data format used to publish catalogue descriptions of research data and associated entities like people and projects to Research Data Australia. RIF-CS is an ANDS-specific format which is not sufficient on its own to capture a full set of archival and management data about research data collections, but our initial analysis is that it will be sufficient to communicate between the data capture application and the centralised research data repository. 

From data capture – to data embalming, er, preservation and re-consumption

Luc Small of Intersect has written up the DC21 application.

While it’s called a ‘capture’ application, with connotations of Gerald Durrell style antics in the wilds, trapping temperature readings and soil moisture readings with tranquilizer darts, DC21 is really about data domestication. Sure we need to obtain data, but it’s not just about raw, untamed, data; technicians and researchers do things to the data. They clean it and analyse it, and make useful collections out of data from different sources.

The bit we’re interested in here in this post is the point at which someone says “I’m ready to write this up” – at this point they will want to make sure their research is defensible, reproducible and, perhaps most importantly, citable. Before we go on to talk about this process, lets look at some of the assumptions we’re making about the application DC21.

Design Considerations

  • Data capture applications contain working data that might be reworked, cleaned or deleted before it is published or used as the basis for a publication or report.

  • Research projects are born, they run and they get completed. Research facilities are built and will eventually become obsolete. Data capture systems which service these projects and facilities are likely to suffer the same fate – they will not always have governance in place to ensure that they persist over long periods of time. (Yes, we know it’s in the requirements spec that every app is ‘sustainable’ but let’s be realistic).

  • The Research Data Repository (RDR) and its sub parts (the data storage system and the Research Data Catalogue RDC) capture important institutional assets.  To maintain these research data assets, the RDR will need to have governance in place to ensure its long term persistence.

  • The RDR will have RIF-CS-over-OAI-PMH and other interfaces that are needed for compliance and data discovery, meaning that data capture applications need not have these (but they can, of course).

  • A data set that is required for validation of research should have a persistent identifier expressed as an HTTP URI.  (Handles and DOIs can both be used to make URIs, with some benefits and attendant risks).

  • Publicly accessible data sets, as well as those that are expected to be cited even if not available as Open Access

  • And an implementation detail: At UWS, the ReDBox Research Data Catalogue application will be the software that runs the Seeding the Commons and RDC projects.

Rules of Engagement

Here are some rules of engagement, which are emerging as we get further into the design process for the Research Data Repository (RDR), data capture (DC21) and Research Data Catalogue applications (SC20). These rules are helping to ensure that the research data being captured is robust and well managed.  Data sets that are needed to validate research, and which researchers want to be citable:

  • Must be deposited in the Research Data Storage component (RDS) of the RDR or another persistent store that meets the same standards for data preservation. Note that much data will be in the RDS already, deposit is then a state-change rather than a move.

  • Must be described in the Research Data Catalogue (RDC) with a link to where the data resides. (Support will be available for this from the library).

  • Data capture applications must have a mechanism for a researcher to ask for a data set to be ‘curated’ so it is available for a defined period and correctly described, for example if they want to use it as the basis of a publication.

The current solution

Against the background of our medium-terms plans for a UWS Research Data Repository, and the above design considerations, rules of engagement and requirements, the technical teams from the Data Capture project and the Seeding the Commons project spent the best part of a day working out a white-board sketch of the interfaces between the lab-level working data management application and the repository.

While this high level solution design assumes ReDBox, other metadata store applications could be slotted in instead – the interface is standards based (RIF-CS over OAI-PMH).

The whiteboard looked like this. Below, we’ll simplify that with a proper diagram made on a computer.

Figure 1 Interface between data capture application and the Research Data Repository (using OAI-PMH and the RIF-CS standard for metadata about research data)

There are two main interface points:

  1. Name authority lookup, where every bit of metadata entered into DC21 is as high as possible in quality, via:

    1. A linked-data approach using HTTP URIs (AKA URLs) as names for things, as per the Gospel According to Tim.

    2. A single source of truth via the Mint component of ReDBox for data like subject codes, people, organisations etc.

  2. The ‘curation boundary’ where DC21 hands-over metadata to the Research Data Catalogue, and when that’s been curated by data librarians, data is pulled into the public-facing facet of the Research Data Store.

The first of these is already done in DC21 – as far as we know this is the first time a service other than ReDBox has been connected to an instance of the Mint as an authority. We will talk more about the importance of name authorities as ‘sources of institutional truth’ and the use of identifiers as our Research Data Repository project proceeds. For now, we will note that as far as possible every time someone fills out a form with something the institution already knows (a name of a person, a grant-code etc) then the data is looked up in the name authority, rather than replying on people typing strings, or local look-up tables. The UWS Research Data Catalogue is going to be ‘no strings attached’, as in text-strings. URIs all the way!

The more important interface is the second, the main subject of this post, handles deposit of data collections into the trusted Research Data Repository.

Based on all the design considerations and rules of engagement outlined above, the ‘curation boundary’ needs to be crossed when a researcher wants to keep an archival snapshot of a particular data set.

The story here is designed for data sets of moderate size, like those we’re getting from the Hawkesbury Institute for the Environment.

So, here’s the story:

  1. A researcher uses the DC21 application to find a number of data files from across two of the facilities at the institute, conducts some analysis and writes n article. (The system remembers every download from the data store).

    The researcher asks for the particular data set used for the article to be published/curated, either by uploading the data back into the system, or clicking on a search history.

    The DC21 application bundles the requested data, with as much provenance and metadata as possible, such as adding raw data.

    The DC21 application sets a flag against that downloaded collection to mark it as ready for publication – meaning it will start appearing in the OAI-PMH feed. The DC21 application will also remember that the data behind the collection has been referenced in a collection. This is to ensure that the data is not subsequently deleted or modified without due consideration for the collection.

  2. The Research Data Catalogue, which is part of the Research Data Repository picks up the new collection record from the OAI-PMH feed and puts in in the ‘ReDBox inbox’..

  3. The team of data librarians see the new data set in the inbox, add missing metadata for management and discovery purposes, maybe contacting the researcher for more information, and publishes the data.

  4. The Data Catalogue application mints a new DOI for the data set, and causes the data to be copied into the public part of the research data store. (Yes, we have to work out some of the details about when IDs get minted in this process – this step might need to happen earlier.)

  5. Later, another researcher can discover the data, via searching the web, a discovery service like Research Data Australia or via the Research Data Catalogue directly, they get a URL version of the DOI for the data set.

  6. When someone downloads the data using the DOI-URL, they’re redirected to the data in the Research Data Store.

Figure 2 Step-by step data curation and publishing process

Copyright Peter Sefton and Peter Bugeia, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>