Seeding the Commons Data Sharing Project Complete.

The federally funded component of the UWS Seeding the Commons data description and sharing project has been completed. A comprehensive description of the Project is available via a previous blog post. Descriptions of 21 UWS data collections are now available in Research Data Australia. Some of the collections are open access, some are available via mediated access (contact the researcher to discuss access conditions) and some are metadata (description) only. The collections are also represented in Trove, the National Library’s single search portal, and discoverable by Google, Google Scholar and other search engines. Data Collections with a DOI will be indexed in the new Thomson Reuters Data Citation Index , allowing them to be formally cited in research papers. Congratulations to all participating researchers who now have their data/data description accessible to a vast audience of international scholars including potential collaborators.

Lessons Learned

A shift in the culture of data sharing is required to ensure that data does not remain the lost output of research. Whilst some have embraced sharing, others still insist they ‘just don’t want to’ share their data. A concerted effort is required to raise the awareness of UWS researchers on the benefits of data sharing through a campaign of communication, education and engagement for all in the research data lifecycle. If you know of a data sharing success story don’t be shy, spread the word.

Data description is complex and can appear daunting until all the pieces fall into place. A cheat sheet is in development and available on request to assist. Once refined the sheet will be published.

What’s Next?

Researchers may self-submit data descriptions and/or small data sets into the UWS Research Data Catalogue. Library staff will complete the metadata, confirm the record with the submitter and make it available in Research Data Australia.

Researchers wanting to share data/descriptions but who are unsure about self-submission may contact Susan Robbins at s.robbins@uws.edu.au or 9852 5458.

UWS is currently working on Cr8It (pronounced Crate-it) – a web based packaging application for research data. It will give users an organised view of their files and as much metadata as possible automatically extracted from the files. Cr8it it will let researchers identify related objects and organise them into a data ‘package’, adding more metadata and context if required, such as associating a package with a research institute, facility or experiment. Researchers will then be able to send the packages to the Research Data Catalogue and eventually push it out to a variety of other destinations, such as blogs or discipline repositories. Cr8it is currently at the proof of concept stage. To try out this service or learn more, please contact eResearch@uws.edu.au.

Currently we are investigating ‘use cases’ for depositing data at various stages of the research life cycle (eg. At the inception of a research idea, when applying for grant, when it is funded etc) These are mentioned in a previous blog post.

What’s in it for Me?

The opportunity to stay ahead of the pack. Aside from the practical issues of data storage and preservation, you could increase opportunities for collaboration and the impact of your research globally.

To arrange storage space for working data, or a secure space to archive and preserve data, contact: Toby O’Hara from eResearch at t.ohara@uws.edu.au or 4736 0928

To arrange for your data collection to be described and reflected in Research Data Australia (and associated locations) contact:

Susan Robbins Research Coordinator (Library) at s.robbins@uws.edu.au or 9852 5458

‘Moving Forward’

A series of UWS data related webinars and workshops will soon be available to assist anyone/everyone involved in the research data lifecycle. To be informed when they are scheduled please email Susan Robbins s.robbins@uws.edu.au research comic

“Piled Higher and Deeper” by Jorge Cham
www.phdcomics.com

No Such Thing as a Dumb Question

I’m working with external collaborators – can I give them access to our Research Data Repository?

This is being investigated as a priority, but unavailable at present.

I trip over boxes of old interviews in my lounge room – can you take them, digitise them?

UWS archives is able to store these for the duration of the ethics application contact RAMS. Post ethics expiry date, the data collection will be evaluated to determine the next step.

I’m retiring and have 20 years worth of research data on floppy discs. Can I give them to you to digitise, preserve and archive?

At present we don’t offer this option, but you can self-submit them and (soon) utilise Cr8it (see above) to manage the collection. Assistance is available. Contact Toby OHara t.ohara@uws.edu.au x2928

Data are the New Black: Data Sharing in the National/International Arena

A selection of data related activities occurring internationally.

CSIRO to embrace open access Hare, Julie. The Australian [Canberra, A.C.T] 11 July 2012: 31

THE CSIRO is making freely available 200,000 research papers dating back to the 1920s on its new, open-access repository. It is also creating a portal to contain most of the raw research data used by the organisation since its inception.

“It’s a massive job. We will eventually have 86 years of data in the repository,” said Jon Curran, CSIRO’s general manager of communications.

“We are anticipating this is where the world of science is heading.

“The mood is there. And we know the more visible the work the more excitement and energy that is generated.” …

GigaScience (http://www.gigasciencejournal.com/) , an innovative new journal handling ‘big-data’ from the entire spectrum of life sciences, has now been launched by BGI

Geoscience Data Journal New Wiley open access data journal

“It is becoming increasingly important that the data which underpins key findings should be made more available to allow for the further analysis and interpretation of those results,” said Mike Davis, Vice President and Managing Director, Life Sciences Wiley. “The ability of researchers to create and collect often huge new data sets has been growing rapidly in parallel with options for their storage and retrieval in a wide range of data repositories. We are launching the Geoscience Data Journal in response to these important developments.”

http://au.wiley.com/WileyCDA/PressRelease/pressReleaseId-104139.html

Hindawi Datasets International

Publishing a Dataset Paper in Datasets International is all about the underlying raw and tabular data that the author has obtained during his experiment. Every table or image should be accompanied with a full description of how this data has been obtained, for instance, if you provide us with a graph; you should provide us with the tabular data you have used to draw this graph.

Datasets should contain detailed explanation of the methodology and materials used in conducting the experiment/observation and no final results or conclusions. Accordingly, manuscripts should be submitted along with all the relevant data.

Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for the central access to data in a similar vein as Wikimedia Commons does for multimedia files.

Google Scholar already contains citations to datasets represented by a DOI.

An example of a data citation in a reference list

Birgisdottir, L., and Thiede, J.R.N.
(2002) 
Carbon and density analysis of sediment core PS1243-1 PANGAEA. doi:10.1594PANGAEA.87536.

Cited in Jourabchi, P., L’Heureux, I., Meile, C., & Cappellen, P. V. (2010). Physical and chemical steady-state compaction in deep-sea sediments: Role of mineral reactions. Geochimica Et Cosmochimica Acta, 74(12), 3494-3513. Retrieved from www.scopus.com

Creative Commons License
Seeding the Commons Data Sharing Project Complete. by Susan Robbins is licensed under a Creative Commons Attribution 3.0 Unported License.

Research Data Repository (RDR) progress report, May 2013

The RDR project at UWS started in 2010 with the purchase of some storage infrastructure, and was expanded in scope in 2012, based on this scoping document. Work began in earnest in June 2012 when project manager Toby O’Hara joined the team. We set out with these broad principles in mind:

The repository will consist of two main components:

  1. A scalable storage service linked to a combination of local and cloud-based high performance computing. Some data may also reside in other, trusted storage systems such as national infrastructure or discipline repositories with suitable governance in place.

  2. A catalogue of research data for internal use in management, and external use in dissemination and collaboration.

But the project is about much more than supplying storage and computing. It is about creating an organisational capability and culture of managing research data throughout the research lifecycle. We aim:

  • To enable research in all disciplines at UWS to take place efficiently and effectively on existing and new data sets.

  • To enable the validation of research through appropriate management of data inputs and outputs.

  • For re-use in new research which will cite the creators of data sets at UWS.

  • For compliance with funder requirements and codes of practice.

Those two main components are now established. We have both working storage (RDS) and archival storage (RDR) now commissioned and working on a small scale. (Note that terminology on this project has changed a bit – the RDR used to refer to all the components but it became quite clumsy to talk about ‘the archival repository part of the broader Research Data Repository’).

Figure – Super-simple view of the Research Data Repository with the two main kinds of storage – Working vs Archival

On top of that simple view, we can show how the RDR sits with other systems.

Figure RDR interaction with two other services. Dropbox.com integration is a simple one-way approach while the HIEv data capture application interacts with both working and archival storage via the Catalogue

There are many, many ways that these services could be extended but we have identified three high priorities from consulting with UWS researchers, and talking to other eResearch teams, which we’ll talk about in more detail below:

  1. Adding support for distributed version control systems used by tech-savvy researchers to manage software code and documents.

  2. Adding more support for distributed file-systems like Dropbox, but with better support for data security, access control and the ability to add eResearch applications over the top of the storage.

  3. Dealing with the looming ‘feral file’ problem, where data storage tends to fill up, and there are a lack of options for researchers to hand-over data to an archival store.

Dealing with source-code and document version-control systems

There are two widely used distributed version control systems: git and Mercurial. Many researchers use these to manage program code and/or document sources for publications in text-markup such as LaTeX and increasingly MarkDown, via tools like KintR in the R environment. We are working to add support for this class of repository in our repository, which should be fairly straightforward, as the modern distributed code repositories support the key use-case by design. That is, they allow you to ‘push’ code changes to more than one repository, so a UWS member of a team that is already happily working with say BitBucket could push repository changes to a UWS archival repository for safe keeping, as well as the team repository. Why would they want to do this? It’s not about short term risk, but about having copies of data that are independent of service providers that might come and go in the medium to long term. And it’s about exactly the same use-cases for packaging data and depositing in an archival repository as with any other data project, when projects end, articles are published etc. More on this in a post soon.

Future file systems

The Dropbox.com file sync-and-share product is a clear winner in the distributed file-system stakes. It has a low-friction viral quality that lets it spread in ways that permeated and subverted our institutional networks and command-and-control structures. And it has an unparalleled ease of use1. But there are two major problems:

  1. There are some kinds of data for which one should NOT use Dropbox.com: the researcher has to decide if they are meeting ethical standards, funder requirements and layers of institutional policy.

  2. And while Dropbox.com has an API – an interface against which third parties can write software applications, it is severely limited for doing the kind of ‘bridging’ work we want to between the RDS working-data store and the RDR archival store.

So, the fact that Dropbox.com is so popular, and so good, makes it clear that even if we can’t match it completely, we should be thinking about how to provide a similar service so research teams can:

  • Store stuff on all their devices and have it automatically synchronise between them, with some limits about re-sharing..

  • Invite others that they identify as collaborators to see the files. (No, that does not mean getting them to fill and sign a form apply for a university account, the way I have heard it described at a big university not far from here, it means I send you an invitation by email, you log in using something that (a) suits you and (b) works, for example, a gmail account, and once I’m sure that you are you, then the sharing starts. Yes, there are exceptions where we need higher-levels of assurance but for most collaborations too many barriers mean people will revert to Dropbox and smuggled USB drives.)

And, beyond what Dropbox.com can provide:

  • Store stuff in the right jurisdiction.

  • Allow eResearch tools, such as the one we cover next to access data via full-service machine interfaces (APIs).

There is a promising new application in this space now, run by AARNET called Cloudstor+. This gives Australian Researchers 100GB of free storage which can be expanded at low cost. This runs on the open source OwnCloud platform.

But note that there are many kinds of data that should NOT be placed in sharing-syncing services for various privacy and other legal reasons.

Creating a bridge between working file-storage and the archive.

We are now starting to hand out file-shares, which will, of course, fill up with files as researchers begin to take advantage of the storage space. But what will happen to those files when articles are published, projects and grants finish, research staff leave the institution? There are good reasons in all these situations to make sure that data are catalogued, and stuff is transferred to the Archival Store.

But it would be naïve to think that just because there are good reasons for these things to happen that they will. That’s why we have been working out how to encourage researchers to deposit data at various points in the existing research lifecycle – see our previous post on data management use-cases when we look at how and more importantly why people might be motivated to catalogue and deposit data.

Some data will come to the catalogue via applications like HIEv – the environmental data capture application. At the Hawkesbury Institute for the Environment (which is where the HIE in the name comes from) data is captured by technical research infrastructure and routed automatically to HIEv, where institute staff and collaborators can work with it. When they use a data set and publish an article or create a data set for re-use then they can trigger the process of having it sent to archival storage and cataloguing.

But for data that is not coming through a data capture application, uncatalogued, ‘wild’ or ‘feral’ data we want to provide a way for research teams to:

  • Look at their file-share and see all their (file-based) stuff.

  • Select groups of things that belong together, by directory, by file-type, by a search query, or by picking them out manually.

  • Add metadata to contextualise and explain the files, to support future re-use, and to explain how data supports published finding.

  • Publish/archive the data by sending to ReDBOX, the archival part of the overall Research Data Repository, where librarians will help optimise metadata and mind the data for the appropriate length of time.

Enter CrateIt (or Cra8it – (that’s Crate-it), an application to enable a user to pack-and-label-and-send as just described. In this part of the RDR project Lloyd is writing an OwnCloud plugin which can be used to find, preview, describe, pack and send research data files from the working store to the Research Data Repository for archival storage (or in the case of very large data sets, send links to the files).

We have written previously about a prototype application that does a lot of this already but the OwnCloud version is promising because it is integrated with OwnCloud’s existing sharing and replication services so Cr8it can take advantage of its access control services.

What next?

Work is proceeding now on the three priorities mentioned above; integration with version control systems, file-sharing and synchronisation and the Cr8it application for corralling files.

Beyond that, the future is less certain; the roadmap for eResearch at UWS, which is now more or less complete, but yet to be approved by the eResearch Steering committee calls for a steady roll-out of:

  • More data capture applications at more sites, including research institutes and research groups.

  • Developing institute and school level data management plans following the lead of the Hawkesbury Institute for the Environment.

  • Further integrating data management services into the research lifecycle.

  • Improved integration with computing resources and collaboration tools.

  • Incremental improvements and upgrades to all of existing services.

    1For a quirky take on this, consider Les Orchard’s musing on how it treats him like he treats his pets. This is a interesting way to think about service provision:

    consider these pointers for being nice to animals:

    • Give them a reason to come to you. Don’t chase after and grab.

    • If they want to leave, let them. Don’t hold on and squeeze tight.

    • If you are allowed to pick them up, hold them gently yet offer enough support to make them feel safe.

    • Pay attention to their reactions, learn what kind of attention they like. This gives them a reason to come back when you let them leave.

    Les lives with bunnies, I live with a dog. With dogs you need to show them very explicitly they rank in the family pack (ie below the humans). That’s not a strategy I’d recommend IT or eResearch staff take with your local institute director!

Creative Commons License
Research Data Repository (RDR) progress report, May 2013 by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.