Potential Research Data Repository data-management use-cases. For discussion.

Posted on 2012-08-17 by pt

By Peter Sefton and Toby O’Hara

This post looks at some of the ways that researchers and the staff who support them might interact with the new Research Data Repository (RDR) we are building at the University of Western Sydney (UWS). This is a document for discussion. As implementation manager, Toby has consulted with several stakeholders on how the RDR will be used and we’d like feedback preferably via the comments on this blog or by email if that suits you better. This post will be of most interest to those involved in the RDR project at UWS and those who are currently engaged in setting up or improving an institutional research repositories as well as the potential ‘customers’ of a data repository.

And it might be of interest to anonymous blogger the Library Loon who doesn’t approve of most research data management planning. She points out:

Most data-management models (the one she is thinking of no exception) map extremely poorly to the research-project cycles and timelines that researchers are accustomed to. The milestones researchers think about—grant applications, awards, data capture, data analysis, interim-report writing, article authoring, renewal applications, and so forth—barely appear in data-management models.
http://gavialib.com/2012/08/data-lifecycles-versus-research-lifecycles/

What we’re going to talk about here is quite similar to what I think is happening at many Australian institutions. A lot of the thinking here follows the lead of Vicki Picasso’s team at Newcastle and their work on build an institutional research data catalogue which responds to institutional triggers, including grant applications and awards.

The goal of our RDR project is to provide the benefits we described before: backed-up, well-described reusable data to foster collaboration and data citations, not to mention keeping funding bodies happy. All with the least possible negative impact or the research community. To accomplish this, the RDR project aims to fit in with the existing research lifecycle. Below, we outline several different scenarios where various participants interact with the RDR. If these scenarios make sense, after further consultation then we will use them to inform our data management planning at the university.

A word about the diagrams

The diagrams in this post all have slightly differing levels of detail – the idea is to illustrate each point once, rather than repeat things. A few things to keep in mind when reading the diagrams:

Research data and the methods for producing data come in all varieties. The diagrams are necessarily simple, so as not to exclude data types, or different ways of handling data.
While not all the diagrams include a data management they will be a part of our new repository workflows; the intention is to start with human-readable plans and then progress to machine-driven planning. For example making it possible for research management systems to send reminders when data deposits are due.
The Research Data Repository is in two components. The Research Data Store (RDS) is for storing the data itself, in groups, packages, notebooks, bundles, or “collections”. The Research Data Catalogue (RDC) is for storing descriptions of the data, and these descriptions tie back to the data which is deposited, in the RDS or in trusted store elsewhere.

The library is responsible for management of the Research Data Catalogue component of the RDR. In each case, there is a loop back from a data librarian to the Researcher in the event that the data description needs to be improved to maximise its ability to advertise the existence of data, promote re-use and assist in data management.

Information from the university’s research management systems about researchers, grants and publications add information and trigger notifications to the RDR, or to a designated librarian mailbox.

As mentioned elsewhere in this blog, there is an overarching objective to raise awareness of what data has been collected and may be available for reuse. One of the ways this can happen is by sending the data descriptions to the ANDS Research Data Australia service (the pink boxes, below).

Library-led deposit

The first use case, the library-led deposit deals with:

historical research completed some time ago,
research that has recently completed, or
research that is ongoing, but has been flagged as a potential candidate for inclusion in the repository for strategic reasons.

Our first scenario is already happening; as a way of introducing the concept of a Research Data Catalogue to the university and getting experience in managing one, the library is leading a project, funded by the Australian National Data Service (ANDS) known as “Seeding the commons”. This project involves the Library team identifying places that data might reside via discussions with research administration, searching for grants awarded and looking at UWS’s publications.

The Seeding the Commons program is a way of getting some experience with the business of describing research data, relating it to other entities such as people, organisational units, research projects and grants. Many of the processes we’re going through at this stage are not scalable to the whole enterprise over the long term; a librarian will not be this directly involved in every deposit of data in the future, but the library does figure in all the stories we’re telling in this post and the next about how the RDR will work. Nor will we make the mistake of thinking that if we build it they will come and expect researchers to start spontaneously depositing data.

Data Capture deposits

A lot of research data is generated by machines such as sensors or instruments in such a way that it can be captured, labelled and described close to the data source in such a way that people other than the original researchers involved in collecting in it know what it is and how it might be reused. ANDS have funded a number of projects in this area under their Data Capture stream. We have one of these projects at UWS, being built for the Hawkesbury Institute for the Environment – it will capture data coming from large forest-based experiments, house it in a working data store where researchers can interrogate it and work with it, then send it to the Research Data Repository for archiving. In this workflow, there will be some automated data processing – once a certain class of data is well described it can flow from a facility automatically, for example a months worth of weather data does not need a research librarian to describe it every month but for some other scenarios, particularly compiling a data set to support a publication (which is covered below), there will still be humans involved in selecting and describing data.

There is an additional level of detail in this diagram, which is not in the Library-led process above, which simply makes explicit the fact that the RDR encompasses the storage and the catalogue. The storage is where the data is deposited and the catalogue is where the descriptions of the data are kept, and (where permissible) shared with other systems.

Publishing/citation-driven deposit

We don’t need to point out that publishing is one of the key parts of the research life cycle. In our initial work on the UWS Research Data Repository, we have been approached by researchers wanting to deposit data somewhere accessible (anywhere!) because it is required by the journal to which they’re submitting. This will become more and more important as funders start to mandate open access data along with open access publications, and the scholarly process (not just the scholarly communications process) re-forms itself. Note that in this scenario there is a DOI – a Digital Object Identifier – created for a data set so it can be cited like a publication. This is not hugely important to researchers yet, but we’re betting that it will rapid become so as systems to collect data citation metrics start coming on line and almost certainly counting towards government reporting.

The Library Loon’s timely post about making sure that data management plans align with research practice aligns with our thinking on this. We’re working with researchers as the Hawkesbury Institute for the Environment on how data capture and repository systems need to be configured to support their publications, for example where they use R to fetch data, clean it, run models, then generate the figures for an article. More on that soon.

Grant-driven data deposit

Grants are key to the research lifecycle. At the University of Western Sydney work has already been done on integrating data management into the research lifecycle, starting with applications for internal grants. We know that changing the research culture will take some time, but eventually thinking about eResearch requirements, not just data management but computing and collaboration needs will become normal for all researchers, just as ethics forms are for many now.

We present two scenarios here, one when a grant starts, and the researcher is prompted to finish and deposit a data management plan, and another when the grant finishes and there is a check to make sure the data management plan has been followed. In between, of course there might be other research-lifecycle-events that trigger data deposits.

Reporting-driven data deposit

In Australia one of the key drivers, and a driver with a very heavy right foot, is the government. All universities have to report on publications via HERDC and research excellence via ERA, this means that there are processes in place for reporting that form a large part of the research lifecycle. This is another place to tie-in data management processes for data that is of strategic significance. The next diagram shows a couple of scenarios that could be driven by the reporting cycle, either a significant publication, where it is important to make sure that data is kept for reproducibility and reuse, or when reporting on research of global significance, where advertising data might lead to more such research.

Talk back

We recognise that there’s room for improvement in each of the scenarios above, and not just because we’ve kept the sequences high level and skipped over some intricacies. It is hoped that this will drive some discussion and exploration of options. Perhaps your university is also implementing similar processes and technologies, please feel free to use us as a sounding board. If there is sufficient interest in any of the scenarios above, we’re happy to organise a forum for discussion.

Copyright Peter Sefton and Toby O’Hara 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

My trip to Open Repositories 2012, Edinburgh

Posted on 2012-08-07 by pt

By Peter Sefton

In the week of 9^th – 13^th of July I attended the Open Repositories 2012 conference in Edinburgh. I’m on the organising committee, taking a special interest in the ‘developer challenge’ event, which has become an important strand of the conference DNA. I was chair of the judging panel, and tried to help Mahendra Mahey of UKOLN and team encourage entrants, provide feedback on ideas and so on. So that’s a contributing factor to me not getting to that many sessions, but coming away with a good sense of what repository developers are thinking about and working on. I’ve had quite a few exciting discussion about what we can do for Open Repositories 2013 at Prince Edward Island.

The good news is that the sessions are online in video form, so I can go back to the ones that I wanted to see and missed, particularly the session on name and data identifiers. If the ORCID ID system works, then that will be a very good thing, watch the presentation for why. Simeon Warner makes the point that if repositories want to play an important role in scholarship then they need to engage with ORCID. I think this is a strong argument for repositories getting involved early, rather than sitting back and waiting to see if ORCID fails, which it might if everyone sits back and waits. But it might not fail, and then the repositories that stay out will be marginalised in a world of research metrics glued together by ORCID IDs.

It’s a bit difficult to correlate sessions from the program with the YouTube recordings but you can search for the ones of interest.

It’s all about Research Data

OK, so it’s not all about data, but overall, the conference continued the trend of increasing attention on Research Data Repositories (RDRs) and the research process. And that’s not just me, that’s the data speaking (although the data are of dubious quality). Peter Burnhill wrapped-up proceedings with a talk illustrated by a tweet-driven word-cloud, built from tweets about the conference. Data was big, along with the obvious terms like open and repository.

Figure 1 Image via http://cdrs.columbia.edu/cdrsmain/2012/07/the-news-from-edinburgh-and-open-repositories-2012/ attributed to Adam Field

When the conference first started in Sydney 7 years ago it was mostly about Institutional Publications Repositories (IPRs), with some discussion about how they might be better integrated into research processes such as data collection and publishing. Back in those days presentations even started out with helpful definitions of the word repository. IPRs are still with us, of course but the discussion of integration with scholarly processes of all kinds has moved from “we should” to “we are”.

There are now a lot more general digital library type systems being discussed too. The change in approach as we move beyond IPRs to RDRs was the topic of a presentation that I gave in the short-papers stream at the conference, prepared with four other Australians from three other institutions. I have posted the presentation including the speaker notes on my blog. By the way, this presentation won best-in-session.

Research data management had a couple of sessions, including this one with an appearance by Natasha Simons from Griffith (about six minutes in) talking about their Research Hub and a very different perspective from Anthony Beitz at Monash (at about 58minutes), where there is no hub. Anthony makes the point that a single research data management system is never going to suit all researchers and emphases the importance of dealing with research communities. In between is Sally Rumsey talking about Oxford’s developing institutional approach. All three are worth watching for those of us implementing research data systems, whether centralised or not.

The challenge

One of the big words in the word cloud is “challenge”, which reflects the amount of effort that JISC put in to promoting the developer challenge. I’ll say a bit about this years winners, using the notes I gave to Mahendra.

The winners

The winning entries in the developer competition both came from the data side. The runners up, Keith Gilbertson & Linda Newman, showed an idea for a small simple mobile app to capture video or audio and deposit it to a repository, but with a twist – the option to send it to a machine or human transcription service. The idea of using Microsoft’s speech conversion service got this one the special Microsoft prize, some .NET Gadgeteer hardware.

Patrick McSweeney’s winning entry “Data Engine” tackled research data management by bringing useful tools for data wrangling and visualisation into the repository. Patrick picked the challenges facing one PhD student in Engineering to illustrate a prevalent problem, a lack of generic tools for managing tabular data. There’s plenty of action in this space at the moment – for example the DC21 Data Capture app being developed for the Hawkesbury Institute for the Environment at UWS has some things in common with Pat’s app as does Orbital, which Nick Jackson and Joss Winn from Lincoln talked about in the ‘non traditional content’ section, where I guess ‘traditional’ means PDF versions of papers.

I am hoping that Pat can leverage his win to make it out to Australia for eResearch Australasia, and we can get him talking with some of our local developers.

From the runners up I was very excited by “Is this research readable?” an idea by Cameron Neylon, implemented by Ben O’Steen. The idea is to get some hard data on the accessibility of research: take a statistically significant number of research articles via a randomly generated set of DOIs and get people word-wide in different places and on different networks to report if they can access them. This entry is a very important contribution to the ongoing debate about the evolution of scholarly publishing. If implemented fully it will allow a global crowd-sourced statistically significant survey of how much of the online scholarly record is actually accessible to various people in various parts of the world, a topic about which much has been said, but for which we have very little hard data. I hope that Cameron along with his employer PLOS will continue this important campaign – perhaps declaring a world-wide ‘Is this research readable’ month late
r in 2012.

Back home

I’ll be feeding insights and information from the conference into the various projects we have running at UWS and talking with the library systems team about the latest in Fedora Commons repositories. We’re running the Fedora-compatible ReDBox system as part of our Research Data Repository infrastructure – there was some interesting discussion in the Fedora Commons session about how to align the way various applications can play nicely together, this is definitely something the Australian community should get involved in.

Figure 2 Me, chairing a session, photo by Jonathan Markow from the Duraspace Foundation

Copyright Peter Sefton, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

Connecting Data Capture applications to Research Data Catalogues/Registries/Stores

Posted on 2012-08-06 by pt

By Peter Sefton

This document is a write-up from a presentation given by Peter Sefton to a round-table discussion on Metadata Stores projects hosted by Intersect, our NSW state eResearch organisation. It is probably of most interest to people involved in Australian National Data Service (ANDS) and NeCTAR, or similar projects in the area of research data management.

We’re posting it here on the UWS eResearch blog so we can get feedback from others in the community, including eResearch groups, data librarians, users of data capture applications, and so on, as well as participants from the meeting – as there was very limited time for discussion on the day.

Please use the comments. Did I misrepresent the way things are at your institution? Did I leave out any design patterns?

There are two parts.

A quick survey of architectural approaches to connecting Data Capture applications to Research Data Catalogues and central repositories. Data Capture applications are designed to manage research data as it comes off machines, out of experiments or is otherwise collected by researchers.
A short discussion of requirements for Data Capture application that would run on a file store and allow researchers to create useful collections of data from any file-based data. We don’t want to accumulate petabytes of research data and not know what it all is, how long to keep it and so on. This follows up a couple of blog posts over on Peter Sefton’s site speculating about this kind of application.

Connecting Data Capture to Catalogues/Repositories

ANDS has funded new infrastructure in Australian research institutions via a number of programs. This document looks at how the software solutions produced by two of those programs talk to each other; how Data Capture applications feed data to what ANDS calls Metadata Stores. ANDS recommends that institutions maintain a central research data registry, which is where this term “Metadata Store” is used:

ANDS has already funded development of various metadata stores solutions (VIVO, ReDBox, TARDIS). Different metadata store solutions have different ranges of functionality and fill different niches. ANDS strongly recommends a central metadata hub for institutions to track their data collections as intellectual assets.
http://metadata-stores.blogspot.com.au/p/solutions.html

There doesn’t seem to be a single ANDS definition of the term Metadata Store, but for the purposes of this exercise we will concentrate on its role in storing descriptions of research data collections, and use the term we use at UWS for the part of the metadata store that covers the all-important data collections; Research Data Catalogue (RDC).

Not all institutions are building central RDCs – notably in Australia, Monash University works with a decentralised model where research groups have their own infrastructure with feeds directly to the national discovery service, Research Data Australia. Data Capture applications are owned by research groups. This model reflects the reality the researchers work across institutional boundaries.

Figure 1 The simplest model for hooking-up Data Capture to “Metadata Stores” – Leave out the central catalogue! OAI-PMH is a Protocol for Metadata Harvesting and RIF-CS is the ANDS format for metadata-about-data interchange

Most other institutions in Australia seem to be taking ANDS’s advice and installing central catalogues or registries (metadata stores if you must) of research data, and associated entities such as people, organisations, projects and grants. As part of the planning for our Research Data Catalogue project at UWS and community meetings hosted by Intersect NSW I did a quick investigation via mailing lists and personal emails to see if I could find out what kinds of ways people are moving data from Data Capture applications to their central research data infrastructure. The following rather busy diagram shows some of the various ways that data and metadata are being shifted from one place to another.

This diagram is informed by real use-cases but as I put it together quickly it almost certainly has mistakes and omissions. I didn’t have time to do all the information people sent me justice, but it would be good to spell out each case in detail and get some data on which patterns work for which kinds of data, and start to think about standards so Data Capture Application X can be hooked up to Catalogue Y without too much engineering. Maybe ANDS or Intersect or one of the other state eResearch bodies can help with this?

Figure 2 Some of the patterns for moving data from capture to data store and captured metadata to catalogue

The diagram shows systems as rectangles and interfaces as circles, with arrows showing which way metadata and data gets moved. The Research Data Catalogue is where metadata goes, and the Research Data Store is where data goes. Data Capture to RDC is a push, RDC requesting metadata from Data Capture is a pull. You’ll see there are a mixture of push and pull protocols in this diagram.

For example the DC21 application used at UWS is like DC_A. The Catalogue periodically pull-polls DC21 using OAI-PMH to ask for new metadata. If there is any then the Research Data Store pulls the data via HTTP, the standard web protocol for fetching resources.

At the other end of the spectrum applications like one Conal Tuohy told me about at Latrobe (similar to DC_F) use the SWORD push protocol (which is built on the Atom Publishing Protocol) also shown in the diagram to push both metadata and data in a single package (it does more than that of course). There are also some instances of mixed approaches like DC_B where an application pushes metadata and data into a staging directory and both get pulled from there.

One protocol not yet seen is ODATA – another AtomPub variant like SWORD tuned for data deposit.

Part 2: The gap: generic Data Capture for files

The second part of the discussion was about Data Capture for files that are being put straight onto a research data store. This follows-on from a presentation I made previously “File wrangling for researchers / Feral-data capture and a follow-up Watching the file watcher: more on capturing feral research data for long term curation. These are just notes, but I hope to convene a meeting soon to start discussing how to meet these requirements. How do we make sense of the data accumulating in research data stores? We can’t automate everything for every new project (Data Capture apps run at around $100,000 to write).

At UWS we are continuing to explore what this kind of application would look like. We have a group of third-year computer science students working on a project in this area this semester.

So what do we need for a generic file based DC app?

Requirements:

Dropbox.com style simplicity for basic collab;
- Simple traditional-style storage
- Easy sharing with collaborators
Simple support for identifying and describing ‘obvious’ collections like “everything in this directory”
Support for making collections from disparate resources such as linking videos to transcripts, or gathering all the data, scripts and generated graphs for an article.

Drivers:

Backup! Researchers know they need it, often don’t have it.
Compliance with policy on data management, and funder mandates (at UWS this is being introduced via internal grants)
Publication-driven;
- Publisher requires data
- Researcher wants to do reproducible research
- Citable data (maybe, but we need a culture of data citation to drive the practice of data-citation)

I suggest that we start working with researchers who are wanting to publish data collections to go with journal publications; they are motivated to get this done, in many cases by journal requirements.

What to do?

Is there an existing web application we can run over the top of a data store we can build on? (There’s one at the University of Sydney that I hope to get a demo of soon.)
And depending on the answer to (1) is there support for building or adapting a Storage-coupled data capture app as part of the Metadata Stores project being run right now at Australian Institutions?

Comments?

Figure 3 Remember, capturing stuff is one thing, but once it’s caught you need to figure out what to do with it.

[Update 2012-08-07]

If anyone feels moved to draw their own diagram of their data capture app and how it connects to a catalogue/RDA then you can do so using PlantUML an open source UML diagramming app. There is an online form http://www.plantuml.com/plantuml/form, where you can type in source like the component diagram http://plantuml.sourceforge.net/component.html I used for Figure 2 above:

@startuml 
() "OAI-PMH + RIF-CS" as OAIPMH
() "Curated -OAI-PMH + RIF-CS" as OAIPMH1
() "Staging area" as DB
()  "Atom Feed" as Atom
() "Atom Publishing Protocol" as Atompub
() "File copy" as cp
() HTTP
() "Web form" as web
() SWORD 

package "Data Capture Apps" {
 [Web upload] --> web
 [DC_A] <-- OAIPMH
 [DC_B] --> DB
 [DC_C] <-- OAIPMH
 [DC_D] <-- Atom
 [DC_E] --> Atompub
 [DC_F] --> SWORD

}



component "Research Data Australia" as RDA

package "Research Data Repository" {
   component "Research Data Catalogue" as RDC
   component "Research Data Store" as RDS
}
DB <-- RDC
web --> RDC
OAIPMH  <-- RDC
Atom <-- RDC
Atompub --> RDC
SWORD --> RDC
SWORD --> RDS
RDC -> OAIPMH1
OAIPMH1 -> RDA
RDS <-- web
RDS --> HTTP
RDS --> DB
HTTP --> DC_A
RDS --> cp
cp --> DC_B
@enduml