Connecting Data Capture applications to Research Data Catalogues/Registries/Stores

By Peter Sefton

This document is a write-up from a presentation given by Peter Sefton to a round-table discussion on Metadata Stores projects hosted by Intersect, our NSW state eResearch organisation. It is probably of most interest to people involved in Australian National Data Service (ANDS) and NeCTAR, or similar projects in the area of research data management.

We’re posting it here on the UWS eResearch blog so we can get feedback from others in the community, including eResearch groups, data librarians, users of data capture applications, and so on, as well as participants from the meeting – as there was very limited time for discussion on the day.

Please use the comments. Did I misrepresent the way things are at your institution? Did I leave out any design patterns?

There are two parts.

  1. A quick survey of architectural approaches to connecting Data Capture applications to Research Data Catalogues and central repositories. Data Capture applications are designed to manage research data as it comes off machines, out of experiments or is otherwise collected by researchers.

  2. A short discussion of requirements for Data Capture application that would run on a file store and allow researchers to create useful collections of data from any file-based data. We don’t want to accumulate petabytes of research data and not know what it all is, how long to keep it and so on. This follows up a couple of blog posts over on Peter Sefton’s site speculating about this kind of application.

Connecting Data Capture to Catalogues/Repositories

ANDS has funded new infrastructure in Australian research institutions via a number of programs. This document looks at how the software solutions produced by two of those programs talk to each other; how Data Capture applications feed data to what ANDS calls Metadata Stores.  ANDS recommends that institutions maintain a central research data registry, which is where this term “Metadata Store” is used:

ANDS has already funded development of various metadata stores solutions (VIVO, ReDBox, TARDIS). Different metadata store solutions have different ranges of functionality and fill different niches. ANDS strongly recommends a central metadata hub for institutions to track their data collections as intellectual assets. 

http://metadata-stores.blogspot.com.au/p/solutions.html

There doesn’t seem to be a single ANDS definition of the term Metadata Store, but for the purposes of this exercise we will concentrate on its role in storing descriptions of research data collections, and use the term we use at UWS for the part of the metadata store that covers the all-important data collections; Research Data Catalogue (RDC).

Not all institutions are building central RDCs – notably in Australia, Monash University works with a decentralised model where research groups have their own infrastructure with feeds directly to the national discovery service, Research Data Australia. Data Capture applications are owned by research groups. This model reflects the reality the researchers work across institutional boundaries.

Figure 1 The simplest model for hooking-up Data Capture to “Metadata Stores” – Leave out the central catalogue! OAI-PMH is a Protocol for Metadata Harvesting and RIF-CS is the ANDS format for metadata-about-data interchange

Most other institutions in Australia seem to be taking ANDS’s advice and installing central catalogues or registries (metadata stores if you must) of research data, and associated entities such as people, organisations, projects and grants. As part of the planning for our Research Data Catalogue project at UWS and community meetings hosted by Intersect NSW I did a quick investigation via mailing lists and personal emails to see if I could find out what kinds of ways people are moving data from Data Capture applications to their central research data infrastructure. The following rather busy diagram shows some of the various ways that data and metadata are being shifted from one place to another.

This diagram is informed by real use-cases but as I put it together quickly it almost certainly has mistakes and omissions. I didn’t have time to do all the information people sent me justice, but it would be good to spell out each case in detail and get some data on which patterns work for which kinds of data, and start to think about standards so Data Capture Application X can be hooked up to Catalogue Y without too much engineering. Maybe ANDS or Intersect or one of the other state eResearch bodies can help with this?

Figure 2 Some of the patterns for moving data from capture to data store and captured metadata to catalogue

The diagram shows systems as rectangles and interfaces as circles, with arrows showing which way metadata and data gets moved. The Research Data Catalogue is where metadata goes, and the Research Data Store is where data goes. Data Capture to RDC is a push, RDC requesting metadata from Data Capture is a pull. You’ll see there are a mixture of push and pull protocols in this diagram.

For example the DC21 application used at UWS is like DC_A. The Catalogue periodically pull-polls DC21 using OAI-PMH to ask for new metadata. If there is any then the Research Data Store pulls the data via HTTP, the standard web protocol for fetching resources.

At the other end of the spectrum applications like one Conal Tuohy told me about at Latrobe (similar to DC_F) use the SWORD push protocol (which is built on the Atom Publishing Protocol) also shown in the diagram to push both metadata and data in a single package (it does more than that of course). There are also some instances of mixed approaches like DC_B where an application pushes metadata and data into a staging directory and both get pulled from there.

One protocol not yet seen is ODATA – another AtomPub variant like SWORD tuned for data deposit.

Part 2: The gap: generic Data Capture for files

The second part of the discussion was about Data Capture for files that are being put straight onto a research data store. This follows-on from a presentation I made previously “File wrangling for researchers / Feral-data capture and a follow-up Watching the file watcher: more on capturing feral research data for long term curation. These are just notes, but I hope to convene a meeting soon to start discussing how to meet these requirements. How do we make sense of the data accumulating in research data stores? We can’t automate everything for every new project (Data Capture apps run at around $100,000 to write).

At UWS we are continuing to explore what this kind of application would look like. We have a group of third-year computer science students working on a project in this area this semester.

So what do we need for a generic file based DC app?

Requirements:

  • Dropbox.com style simplicity for basic collab;

    • Simple traditional-style storage

    • Easy sharing with collaborators

  • Simple support for identifying and describing ‘obvious’ collections like “everything in this directory”

  • Support for making collections from disparate resources such as linking videos to transcripts, or gathering all the data, scripts and generated graphs for an article.

Drivers:

  • Backup! Researchers know they need it, often don’t have it.

  • Compliance with policy on data management, and funder mandates (at UWS this is being introduced via internal grants)

  • Publication-driven;

    • Publisher requires data

    • Researcher wants to do reproducible research

    • Citable data (maybe, but we need a culture of data citation to drive the practice of data-citation)

I suggest that we start working with researchers who are wanting to publish data collections to go with journal publications; they are motivated to get this done, in many cases by journal requirements.

What to do?

  1. Is there an existing web application we can run over the top of a data store we can build on? (There’s one at the University of Sydney that I hope to get a demo of soon.)

  2. And depending on the answer to (1) is there support for building or adapting a Storage-coupled data capture app as part of the Metadata Stores project being run right now at Australian Institutions?

Comments?

Figure 3 Remember, capturing stuff is one thing, but once it’s caught you need to figure out what to do with it.

[Update 2012-08-07]

If anyone feels moved to draw their own diagram of their data capture app and how it connects to a catalogue/RDA then you can do so using PlantUML an open source UML diagramming app. There is an online form http://www.plantuml.com/plantuml/form, where you can type in source like the component diagram http://plantuml.sourceforge.net/component.html I used for Figure 2 above:

@startuml 
() "OAI-PMH + RIF-CS" as OAIPMH
() "Curated -OAI-PMH + RIF-CS" as OAIPMH1
() "Staging area" as DB
()  "Atom Feed" as Atom
() "Atom Publishing Protocol" as Atompub
() "File copy" as cp
() HTTP
() "Web form" as web
() SWORD 

package "Data Capture Apps" {
 [Web upload] --> web
 [DC_A] <-- OAIPMH
 [DC_B] --> DB
 [DC_C] <-- OAIPMH
 [DC_D] <-- Atom
 [DC_E] --> Atompub
 [DC_F] --> SWORD

}



component "Research Data Australia" as RDA

package "Research Data Repository" {
   component "Research Data Catalogue" as RDC
   component "Research Data Store" as RDS
}
DB <-- RDC
web --> RDC
OAIPMH  <-- RDC
Atom <-- RDC
Atompub --> RDC
SWORD --> RDC
SWORD --> RDS
RDC -> OAIPMH1
OAIPMH1 -> RDA
RDS <-- web
RDS --> HTTP
RDS --> DB
HTTP --> DC_A
RDS --> cp
cp --> DC_B
@enduml

Copyright  Peter Sefton, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond

About this post

During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR) which we outlined in a previous post. In this post we will dig deeper into the architecture and look at how a couple of the components interact, specifically; how does a lab-level data management application talk to the institution-level Research Data Repository when a researcher wants to archive a data set for reuse and citation? This work is a partnership with researchers and technicians at the Hawkesbury Institute for the Environment (HIE), our NSW eResearch partner Intersect, the UWS library and IT, and the UWS eResearch team.

Non-technical summary: The data capture application for environmental scientists at HIE will be aimed at obtaining and managing data for immediate use and re-use. This post describes the technical approach we will use to allow researchers to create a data set from one or more data sources, ask the system to keep it for the long term in the UWS Research Data Repository, and issue an identifier they can use to cite it in a research publication. Keeping data in the RDR means both adding data to the Research Data Storage (RDS) component and maintaining a record about the data in the Research Data Catalogue (RDC).

Technical summary (contains jargon which is explained below): The data-curation interface between the ANDS-funded Data Capture (DC21) and Seeding the Commons (SC20) projects at UWS has now been specified. Data sets identified by researchers as important in the DC21 application will be harvested by the institutional Research Data Repository using the OAI-PMH protocol with a RIF-CS payload. Data librarians will check and improve collection descriptions and, for those of significant re-use potential, publish them to Research Data Australia. On publication, the Research Data Repository application will move data from a pre-published to a published state. Pre-published data may be openly accessible for collaboration purposes but will not have DOI identifiers or guaranteed persistence.

Data capture and seeding the commons

We have two Australian National Data Service (ANDS) projects running a UWS at the moment.

  1. There’s a Data Capture project, which, amongst other capabilities, is designed to capture some of the ‘wild’ data, organizing it into collections that can be secured, referenced and re-used by others. This is known as DC21, AKA Climate Change and Energy Research Data Capture Project (DC21).

    Data might be considered ‘wild’ if there questions about its long term management (will we be able to find it ten years from now?), short term safety (is it backed up?), or its status is not know (is it raw or cleansed?).

  2. There’s a Seeding the Commons project which, amongst other things,  is aimed at establishing a catalogue application which publishes descriptions of collections of data available for re-use on a search site; Research Data Australia.

Here’s what the DC21 application is doing:

This project will develop the data architecture and associated software systems to automatically capture data and meta-data from three instruments. The motivation for the project is that on completion the systems developed will serve as a basis for including the additional instruments utilised by CCERF and other research groups at UWS.

And it has a close connection to the Seeding the Commons project SC20.

The project is closely aligned and is partly dependent on the UWS Seeding the Commons project (SC20). The meta-data collected in this project will be contributed to the UWS eResearch Metadata Store. SC20 will be developing RIF-CS and OAI-PMH compliance for the UWS eResearch Metadata Store to allow for it to be harvested into the ARDC.

OAI-PMH, RIF-CS?

  1. OAI-PMH is a web protocol allowing one service to pull data from another. It’s very similar to RSS and Atom used to keep track of updates on websites by software like Google Reader.

  2. RIF-CS is the data format used to publish catalogue descriptions of research data and associated entities like people and projects to Research Data Australia. RIF-CS is an ANDS-specific format which is not sufficient on its own to capture a full set of archival and management data about research data collections, but our initial analysis is that it will be sufficient to communicate between the data capture application and the centralised research data repository. 

From data capture – to data embalming, er, preservation and re-consumption

Luc Small of Intersect has written up the DC21 application.

While it’s called a ‘capture’ application, with connotations of Gerald Durrell style antics in the wilds, trapping temperature readings and soil moisture readings with tranquilizer darts, DC21 is really about data domestication. Sure we need to obtain data, but it’s not just about raw, untamed, data; technicians and researchers do things to the data. They clean it and analyse it, and make useful collections out of data from different sources.

The bit we’re interested in here in this post is the point at which someone says “I’m ready to write this up” – at this point they will want to make sure their research is defensible, reproducible and, perhaps most importantly, citable. Before we go on to talk about this process, lets look at some of the assumptions we’re making about the application DC21.

Design Considerations

  • Data capture applications contain working data that might be reworked, cleaned or deleted before it is published or used as the basis for a publication or report.

  • Research projects are born, they run and they get completed. Research facilities are built and will eventually become obsolete. Data capture systems which service these projects and facilities are likely to suffer the same fate – they will not always have governance in place to ensure that they persist over long periods of time. (Yes, we know it’s in the requirements spec that every app is ‘sustainable’ but let’s be realistic).

  • The Research Data Repository (RDR) and its sub parts (the data storage system and the Research Data Catalogue RDC) capture important institutional assets.  To maintain these research data assets, the RDR will need to have governance in place to ensure its long term persistence.

  • The RDR will have RIF-CS-over-OAI-PMH and other interfaces that are needed for compliance and data discovery, meaning that data capture applications need not have these (but they can, of course).

  • A data set that is required for validation of research should have a persistent identifier expressed as an HTTP URI.  (Handles and DOIs can both be used to make URIs, with some benefits and attendant risks).

  • Publicly accessible data sets, as well as those that are expected to be cited even if not available as Open Access

  • And an implementation detail: At UWS, the ReDBox Research Data Catalogue application will be the software that runs the Seeding the Commons and RDC projects.

Rules of Engagement

Here are some rules of engagement, which are emerging as we get further into the design process for the Research Data Repository (RDR), data capture (DC21) and Research Data Catalogue applications (SC20). These rules are helping to ensure that the research data being captured is robust and well managed.  Data sets that are needed to validate research, and which researchers want to be citable:

  • Must be deposited in the Research Data Storage component (RDS) of the RDR or another persistent store that meets the same standards for data preservation. Note that much data will be in the RDS already, deposit is then a state-change rather than a move.

  • Must be described in the Research Data Catalogue (RDC) with a link to where the data resides. (Support will be available for this from the library).

  • Data capture applications must have a mechanism for a researcher to ask for a data set to be ‘curated’ so it is available for a defined period and correctly described, for example if they want to use it as the basis of a publication.

The current solution

Against the background of our medium-terms plans for a UWS Research Data Repository, and the above design considerations, rules of engagement and requirements, the technical teams from the Data Capture project and the Seeding the Commons project spent the best part of a day working out a white-board sketch of the interfaces between the lab-level working data management application and the repository.

While this high level solution design assumes ReDBox, other metadata store applications could be slotted in instead – the interface is standards based (RIF-CS over OAI-PMH).

The whiteboard looked like this. Below, we’ll simplify that with a proper diagram made on a computer.

Figure 1 Interface between data capture application and the Research Data Repository (using OAI-PMH and the RIF-CS standard for metadata about research data)

There are two main interface points:

  1. Name authority lookup, where every bit of metadata entered into DC21 is as high as possible in quality, via:

    1. A linked-data approach using HTTP URIs (AKA URLs) as names for things, as per the Gospel According to Tim.

    2. A single source of truth via the Mint component of ReDBox for data like subject codes, people, organisations etc.

  2. The ‘curation boundary’ where DC21 hands-over metadata to the Research Data Catalogue, and when that’s been curated by data librarians, data is pulled into the public-facing facet of the Research Data Store.

The first of these is already done in DC21 – as far as we know this is the first time a service other than ReDBox has been connected to an instance of the Mint as an authority. We will talk more about the importance of name authorities as ‘sources of institutional truth’ and the use of identifiers as our Research Data Repository project proceeds. For now, we will note that as far as possible every time someone fills out a form with something the institution already knows (a name of a person, a grant-code etc) then the data is looked up in the name authority, rather than replying on people typing strings, or local look-up tables. The UWS Research Data Catalogue is going to be ‘no strings attached’, as in text-strings. URIs all the way!

The more important interface is the second, the main subject of this post, handles deposit of data collections into the trusted Research Data Repository.

Based on all the design considerations and rules of engagement outlined above, the ‘curation boundary’ needs to be crossed when a researcher wants to keep an archival snapshot of a particular data set.

The story here is designed for data sets of moderate size, like those we’re getting from the Hawkesbury Institute for the Environment.

So, here’s the story:

  1. A researcher uses the DC21 application to find a number of data files from across two of the facilities at the institute, conducts some analysis and writes n article. (The system remembers every download from the data store).

    The researcher asks for the particular data set used for the article to be published/curated, either by uploading the data back into the system, or clicking on a search history.

    The DC21 application bundles the requested data, with as much provenance and metadata as possible, such as adding raw data.

    The DC21 application sets a flag against that downloaded collection to mark it as ready for publication – meaning it will start appearing in the OAI-PMH feed. The DC21 application will also remember that the data behind the collection has been referenced in a collection. This is to ensure that the data is not subsequently deleted or modified without due consideration for the collection.

  2. The Research Data Catalogue, which is part of the Research Data Repository picks up the new collection record from the OAI-PMH feed and puts in in the ‘ReDBox inbox’..

  3. The team of data librarians see the new data set in the inbox, add missing metadata for management and discovery purposes, maybe contacting the researcher for more information, and publishes the data.

  4. The Data Catalogue application mints a new DOI for the data set, and causes the data to be copied into the public part of the research data store. (Yes, we have to work out some of the details about when IDs get minted in this process – this step might need to happen earlier.)

  5. Later, another researcher can discover the data, via searching the web, a discovery service like Research Data Australia or via the Research Data Catalogue directly, they get a URL version of the DOI for the data set.

  6. When someone downloads the data using the DOI-URL, they’re redirected to the data in the Research Data Store.

Figure 2 Step-by step data curation and publishing process

Copyright Peter Sefton and Peter Bugeia, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

Climate Change and Energy Research – Intersect eResearch Project Summary

Climate Change and Energy Research – Intersect eResearch Project Summary

[This is document was prepared by Intersect for the DC21 project being run by the University of Western Sydney, funded by the Australian National Data Service. We’re posting this here as part of the story about how UWS is building a Research Data Repository.]

Dr Luc Small | 20 February 2012 | 4.1

Description: C:\Documents and Settings\spencer\My Documents\Intersect_Document_Logo.jpg

Intersect is developing and deploying technology aimed at assisting research into climate change and energy by enhancing the management of environmental sensor data. This document describes the core functionality of the proposed project and is targeted at researchers and their support teams who may wish to join the project as collaborators.

Background and Context

There has been a significant rise in the number of sensors and sensor networks used in environmental research in recent years. This growth has brought with it the challenge of managing sensor infrastructure and the data produced by the increasing numbers of deployed sensors.

Three classes of instruments are targeted for this project, from which data and meta-data will be collected:

  • Eddy Flux Towers – Collect meteorological and flux data (e.g. surface-atmosphere exchanges of CO2, water vapour and energy).

  • Whole tree chambers – Collect meteorological data regarding the environmental impact on a tree wholly encapsulated within the chamber. 

  • Weather stations – Collect meteorological data.

While these instruments are the current focus of the project, the project aims to be sensor/infrastructure agnostic and therefore more generally applicable to sensor data management.

Problem Statement

The problem of insufficient sensor infrastructure and data management affects researchers, data technicians and infrastructure managers the impact of which is:

  • lost or misplaced sensor data.

  • inadequate recording of how and where data was collected

  • inadequate recording of quality assurance, gap filling and other post-processing done to the data, and the assumptions made by the data technician during post-processing.

  • scientific conclusions are based on less than ideally managed source data that is prone to error.

A successful solution would:

  • store data in a secure, backed-up, centralised location.

  • record rich metadata about how and where data was collected.

  • record rich metadata about the post-processing done to the data.

  • provide an intuitive means by which researchers can access data and be fully informed about its nature by consulting its associated metadata.

Project Deliverables

  1. Infrastructure management: The ability to keep track of sensor infrastructure (for example, flux towers and weather stations) and individual sensors, and changes to the sensors and/or the infrastructure.

  2. Raw sensor data acquisition: Manual and automated data acquisition from files generated by sensors and/or their data-loggers.

  3. Versioned data storage: Permanent retention of raw sensor data. When datasets are quality assured, gap filled, or transformed, new versions of the datasets are created, time-stamped, and related back to the original raw sensor datasets. Data is stored in a centralised fashion that can be easily backed up.

  4. Data sharing: Data can be downloaded by those within the research group. Data comes with detailed meta-data describing the sensor and infrastructure used to acquire the dataset and any transformations that may have been done to it. Meta-data can be made available to Research Data Australia to make the research data more readily discoverable by other scientists.

  5. Data upload: As noted above, new versions of a dataset can be uploaded to the system and linked to the original raw dataset. This allows the process of data transformation to be tracked and ensures that it is a non-destructive process because all datasets created are retained.

More Information

If you face a similar problem and would like to join as a collaborator, or you face a related problem and are interested in understanding more about this project and potentially reusing components of it, please contact your local eResearch Analyst or email era@intersect.org.au.

Key Facts

Funding Source/Amount:

ANDS – Data Capture Program – $200k

Lead Organisation/CIs:

University of Western Sydney

Hawkesbury Institute of the Environment

Prof. Ian Anderson

Timeframe:

Development commenced in December 2011 and the system will go live in the latter half of 2012.

Related Projects:

TERN/OzFlux: The present project is best regarded as supporting the precursor activities that enable the delivery of quality assured data to a facility such as OzFlux.

A Day in the Life…

A new sensor is installed on a flux tower. Data files are retrieved from the associated datalogger once a day and placed on networked storage. The infrastructure manager:

  • Adds the sensor to the catalogue of sensors associated with the flux tower.

  • Associates the sensor data files with the sensor record.

  • Provides detailed meta-data about the sensor, such as its make, model, position on flux tower, etc.

  • Removes the sensor record for the faulty sensor that this new model has replaced.

Over the days that follow, data starts flowing in from the new sensor. The data technician:

  • Downloads the raw sensor data that has been collected.

  • Gap fills part of the data where the sensor has recorded readings that lie outside the band of expected values.

  • Uploads the gap filled data along with an explanation of the post-processing applied to the data.

The system stores the gap filled data as a new version. This new version is automatically associated with the original raw sensor data. The raw data remains available, unmodified, for future reference. The researcher:

  • Explores the sensors available on the flux tower and selects the one she’s interested in.

  • Browses the data available for the sensor and takes note of the data technician’s comments about any post-processing steps that have been performed.

  • Selects the gap filled version of the data since it is most appropriate in this instance.

  • Downloads the gap filled sensor data and commences analysis.

  • Is aided in analysis and write-up by having the full details of the flux tower, sensor, and post-processing step at hand.

  • Finds anomalies in the gap filled data and isolates the post-processing as the cause by looking at the raw sensor data.

  • Having decided this is “the” dataset the researcher asks the system to archive a copy and mint a new DOI so it can be cited like an article, and retrieved from the UWS Research Data Repository.

Please refer to the diagram below for an indication of how other stakeholders will interact with this project.


Description: DC21 Diagram.pdf

Description: by-nc-sa.eps

This document by Intersect Australia is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.