By Peter Sefton
This document is a write-up from a presentation given by Peter Sefton to a round-table discussion on Metadata Stores projects hosted by Intersect, our NSW state eResearch organisation. It is probably of most interest to people involved in Australian National Data Service (ANDS) and NeCTAR, or similar projects in the area of research data management.
We’re posting it here on the UWS eResearch blog so we can get feedback from others in the community, including eResearch groups, data librarians, users of data capture applications, and so on, as well as participants from the meeting – as there was very limited time for discussion on the day.
Please use the comments. Did I misrepresent the way things are at your institution? Did I leave out any design patterns?
There are two parts.
A quick survey of architectural approaches to connecting Data Capture applications to Research Data Catalogues and central repositories. Data Capture applications are designed to manage research data as it comes off machines, out of experiments or is otherwise collected by researchers.
A short discussion of requirements for Data Capture application that would run on a file store and allow researchers to create useful collections of data from any file-based data. We don’t want to accumulate petabytes of research data and not know what it all is, how long to keep it and so on. This follows up a couple of blog posts over on Peter Sefton’s site speculating about this kind of application.
Connecting Data Capture to Catalogues/Repositories
ANDS has funded new infrastructure in Australian research institutions via a number of programs. This document looks at how the software solutions produced by two of those programs talk to each other; how Data Capture applications feed data to what ANDS calls Metadata Stores. ANDS recommends that institutions maintain a central research data registry, which is where this term “Metadata Store” is used:
ANDS has already funded development of various metadata stores solutions (VIVO, ReDBox, TARDIS). Different metadata store solutions have different ranges of functionality and fill different niches. ANDS strongly recommends a central metadata hub for institutions to track their data collections as intellectual assets.
There doesn’t seem to be a single ANDS definition of the term Metadata Store, but for the purposes of this exercise we will concentrate on its role in storing descriptions of research data collections, and use the term we use at UWS for the part of the metadata store that covers the all-important data collections; Research Data Catalogue (RDC).
Not all institutions are building central RDCs
– notably in Australia, Monash University works with a decentralised model
where research groups have their own infrastructure with feeds directly to the
national discovery service, Research
Data Australia. Data Capture applications are owned by research groups.
This model reflects the reality the researchers work across institutional
boundaries.
Figure 1 The simplest model for hooking-up Data Capture to “Metadata Stores” – Leave out the central catalogue! OAI-PMH is a Protocol for Metadata Harvesting and RIF-CS is the ANDS format for metadata-about-data interchange
Most other institutions in Australia seem to be taking ANDS’s advice and installing central catalogues or registries (metadata stores if you must) of research data, and associated entities such as people, organisations, projects and grants. As part of the planning for our Research Data Catalogue project at UWS and community meetings hosted by Intersect NSW I did a quick investigation via mailing lists and personal emails to see if I could find out what kinds of ways people are moving data from Data Capture applications to their central research data infrastructure. The following rather busy diagram shows some of the various ways that data and metadata are being shifted from one place to another.
This diagram is informed by real use-cases but as I put it together quickly it almost certainly has mistakes and omissions. I didn’t have time to do all the information people sent me justice, but it would be good to spell out each case in detail and get some data on which patterns work for which kinds of data, and start to think about standards so Data Capture Application X can be hooked up to Catalogue Y without too much engineering. Maybe ANDS or Intersect or one of the other state eResearch bodies can help with this?
Figure 2 Some of the patterns for moving data from capture to data store and captured metadata to catalogue
The diagram shows systems as rectangles and interfaces as circles, with arrows showing which way metadata and data gets moved. The Research Data Catalogue is where metadata goes, and the Research Data Store is where data goes. Data Capture to RDC is a push, RDC requesting metadata from Data Capture is a pull. You’ll see there are a mixture of push and pull protocols in this diagram.
For example the DC21 application used at UWS is like DC_A. The Catalogue periodically pull-polls DC21 using OAI-PMH to ask for new metadata. If there is any then the Research Data Store pulls the data via HTTP, the standard web protocol for fetching resources.
At the other end of the spectrum applications like one Conal Tuohy told me about at Latrobe (similar to DC_F) use the SWORD push protocol (which is built on the Atom Publishing Protocol) also shown in the diagram to push both metadata and data in a single package (it does more than that of course). There are also some instances of mixed approaches like DC_B where an application pushes metadata and data into a staging directory and both get pulled from there.
One protocol not yet seen is ODATA – another AtomPub variant like SWORD tuned for data deposit.
Part 2: The gap: generic Data Capture for files
The second part of the discussion was about Data Capture for files that are being put straight onto a research data store. This follows-on from a presentation I made previously “File wrangling for researchers / Feral-data capture and a follow-up Watching the file watcher: more on capturing feral research data for long term curation. These are just notes, but I hope to convene a meeting soon to start discussing how to meet these requirements. How do we make sense of the data accumulating in research data stores? We can’t automate everything for every new project (Data Capture apps run at around $100,000 to write).
At UWS we are continuing to explore what this kind of application would look like. We have a group of third-year computer science students working on a project in this area this semester.
So what do we need for a generic file based DC app?
Requirements:
Dropbox.com style simplicity for basic collab;
Simple traditional-style storage
Easy sharing with collaborators
Simple support for identifying and describing ‘obvious’ collections like “everything in this directory”
Support for making collections from disparate resources such as linking videos to transcripts, or gathering all the data, scripts and generated graphs for an article.
Drivers:
Backup! Researchers know they need it, often don’t have it.
Compliance with policy on data management, and funder mandates (at UWS this is being introduced via internal grants)
Publication-driven;
Publisher requires data
Researcher wants to do reproducible research
Citable data (maybe, but we need a culture of data citation to drive the practice of data-citation)
I suggest that we start working with researchers who are wanting to publish data collections to go with journal publications; they are motivated to get this done, in many cases by journal requirements.
What to do?
Is there an existing web application we can run over the top of a data store we can build on? (There’s one at the University of Sydney that I hope to get a demo of soon.)
And depending on the answer to (1) is there support for building or adapting a Storage-coupled data capture app as part of the Metadata Stores project being run right now at Australian Institutions?
Comments?
Figure 3 Remember, capturing stuff is one thing, but once it’s caught you need to figure out what to do with it.
[Update 2012-08-07]
If anyone feels moved to draw their own diagram of their data capture app and how it connects to a catalogue/RDA then you can do so using PlantUML an open source UML diagramming app. There is an online form http://www.plantuml.com/plantuml/form, where you can type in source like the component diagram http://plantuml.sourceforge.net/component.html I used for Figure 2 above:
@startuml () "OAI-PMH + RIF-CS" as OAIPMH () "Curated -OAI-PMH + RIF-CS" as OAIPMH1 () "Staging area" as DB () "Atom Feed" as Atom () "Atom Publishing Protocol" as Atompub () "File copy" as cp () HTTP () "Web form" as web () SWORD package "Data Capture Apps" { [Web upload] --> web [DC_A] <-- OAIPMH [DC_B] --> DB [DC_C] <-- OAIPMH [DC_D] <-- Atom [DC_E] --> Atompub [DC_F] --> SWORD } component "Research Data Australia" as RDA package "Research Data Repository" { component "Research Data Catalogue" as RDC component "Research Data Store" as RDS } DB <-- RDC web --> RDC OAIPMH <-- RDC Atom <-- RDC Atompub --> RDC SWORD --> RDC SWORD --> RDS RDC -> OAIPMH1 OAIPMH1 -> RDA RDS <-- web RDS --> HTTP RDS --> DB HTTP --> DC_A RDS --> cp cp --> DC_B @enduml
Copyright Peter Sefton, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>