Who is DORA?

[Update 2014-09-11: fixed some grammatical misalignments]

Avid readers of this blog will have noticed that we’ve suddenly started talking about DORA quite a lot. So who, why and/or what is a DORA?

The simple answer is that it is a Digital Object Repository for Academe, which tells you that we came up with a snappy acronym but perhaps leaves you wanting a little more information.

As part of our mission of supporting (? encouraging, enabling, proselytizing, enforcing, …) eResearch at UWS, we’ve taken a step back from the coalface and tried to paint The Big PictureTM about what systems to support eResearch should look like. One output of this was the set of principles and practices, and another is a high-level architecture of an eResearch system, which looks like this:

Overview of an eResearch system, with DORA

Overview of an eResearch system, with DORA

A Little Mora1

The basic idea is that a DORA provides a good place on which to store research data while researchers are working on it. To this end there are a few key features any potential DORA must support, some of which come directly from our principles, but some of which are more process oriented. A DORA must:

  • safely store research data and its associated metadata in a way which keeps them linked. Conceptually there is a combined object in the DORA which contains both the data and metadata (No Data Without Metadata)
  • allow versioning of these combined data objects
  • allow search for data objects
  • support the Upload and Researcher APIs to allow scripted operations:
    • Upload API allows automated upload of research data and associated metadata
    • Researcher API allows searching for and downloading of object, and then uploading of modified or processed versions of them
  • support the Publisher API to allow a clean transfer of information on data objects to an institutional data catalogue, as well as a possible transfer of the the data to another repository (depending on the nature of the data, the repositories and an institution’s policies)

In an ideal world, there would be one DORA which would do everything for everyone, but honestly that seems so unlikely that we have to acknowledge that we will end up with a small number of DORAe and for any given research project we will pick the most appropriate one. This is another place where the APIs come in – if all DORAe support the same APIs then they become drop-in functional replacements for each other. Additionally, behind these APIs there could be a small ecosystem of cooperating tools – a simple repository for storing, an indexer for searching, a preview generator, etc – further reducing the need to find One Perfect Tool which Does Everything Brilliantly. (Separation of Concerns)

The catch here is, of course, that it is unlikely that two different potential DORAe will come out of the box supporting exactly the same APIs, so there’s a good chance that we will have to write some code to adapt the out-of-the-box API to the generic one we design. One possible light in this particular darkness is how much we can use something like Sword 2 as an API.

DORA and AAAA

So how does a DORA work with our AAAA data management methodology? To our great relief, pretty well:

Acquire
the getting of the data and metadata in the first place. It’s not really shown on it but essentially the output of the acquisition are the data and metadata on the filesystem at the bottom of the diagram.
Archive
the combining of the data and metadata and the uploading of it into the DORA, via the Upload API.
Act
the stuff a researcher does to the data. The data is fetched via the Researcher API and updated versions are written to the DORA also via the Researcher API. This is where the versioning capability of DORA comes into play.
Advertise
information about the data is packaged up and transferred into the Institutional Data Catalogue. Optionally, the data described in the catalogue may be transferred to the Institutional Data Store.

As you can see, a DORA sits at the heart of this, and is pretty key to making it all work, which is why we might start to seem as if we’re banging on about DORAe rather.

Creative Commons Licence
Who is DORA? by David Clarke is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

  1. I should take a moment to apologise for having opened this particular PanDORA’s Box of punnery.
    Sorry

Is HIEv aDORAble?

[Update 2014-09-04: added a definition of DORA]

This week we held another of our tool/hack days at UWS eResearch. This time it was at the Hawkesbury Campus, with Gerry Devine, the data manager for the Hawkesbury Institute for the Environment. This week the tool in question is the DIVER product (AKA DC21 and HIEv).

Where did Intersect DIVER come from?

Intersect DIVER was originally developed by Intersect in 2012 for the University of Western Sydney’s Hawkesbury Institute for the Environment as a means to automatically capture and secure time series and other data from the Institute’s extensive field-based facilities and experiments. Called “the HIEv”, HIE has adopted Intersect DIVER as the Institute’s primary data capture application for Institute data. For more information, see here. http://intersect.org.au/content/intersect-diver

We wanted to evaluate DIVER against our Principles for eResearch software with a view to using it as a generic DORA working data repository.

Hang on! A DORA? What’s that?

DORA is a term coined by UWS eResearch Analyst David Clarke for a generic Digital Object Repository for Academe (yes, Fedora‘s an example of the species). We expressed it thusly in our principles:

At the core of eResearch practice is keeping data safe (remember: No Data Without Metadata). Different classes of data are safest in different homes, but ideally each data set or item should live in a repository

  • It can be given a URI
  • It can be retrieved/accessed via a URI by those who should be allowed to see it, and not by those who should not
  • There are plans in place to make sure the URI resolves to something useful as long is it is likely to be needed (which may be "as long as possible").
DORA Diagram

DORA Diagram

The DIVER software is running at HIE, with more than 50 "happy scientists" (as Gerry puts it) using it to manage the research data files, including those automatically deposited from the major research facility equipment.

HIEv Shot

HIEv Shot

So, what’s the verdict?

Is DIVER a good generic DORA?

The DIVER data model is based entirely on files, which is quite a different approach from CKAN, which we looked at a few weeks ago, or Omeka, which we’re going to look at in a fortnight’s time which both use a ‘digital object’ model where an object has metadata with zero or more files.

DIVER does many things right:

  • It has metadata, so there’s No Data without Metadata (but with some limitations, see below)

  • It has API access to for all the main functionality , so researchers doing reproducible research can build recipes to fetch and put data, run models and so on from their language of choice.

  • The API works well out of the box with hardly any fuss.

  • It makes some use of URIs as names for things in the data packages it produces, so that published data packages do use URIs to describe the research context.

  • It can extract metadata from some files and make it searchable.

But there are some issues that would need to be looked at for deploying DIVER into new places:

  • The metadata model in DIVER is complicated – it has several different, non-standard, ways to represent metadata, most of which are not configurable or extensible, and a lot of the metadata is not currently searchable.

  • DIVER has two configurable ‘levels’ of metadata that automatically group files together. At HIE they are "Facility" and "Experiment". There’s no extensible metadata per-installation; like CKAN’s simple generic name/value user-addable metadata. This is the only major configuration change you can make to customise an installation. This is a very common issue with this kind of software, no matter how many levels of hierarchy there are a case will come along that breaks the built-in model.

    In my opinion the solution is not to put this kind of contextual stuff into repository software at all. Gerry Devine and I have been trying to address this by working out ways to separate out descriptions of research context from the repository, so the repository can worry only about keeping well-described content and the research context is described by a human-and-machine-readable website, ontology or database as appropriate; with whatever structure the researchers need to describe what they’re doing. Actually Gerry is doing all the work, building a new semantic CMS app that can describe research context independently of other eResearch apps.

  • There are a couple of hard-wired file preview functions (for images) and derived files (OCR and speech recognition) but no plugin system for adding new ones, so any new deployment that needed new derived file types would need a customisation budget.

  • The only data format from which DIVER can extract metadata is the proprietary TOA5 format owned by the company that produces the institute’s data-loggers. NETCDF would be more useful.

  • There are some user interface issues to address, such as making the default page for a data-file more compact.

Conclusion

There is a small community for the open source DIVER product, with two deployments, using it for very different kinds of research data. To date the DIVER community doesn’t have an agreed roadmap for where it might be heading and how the issues above might be addressed.

So at this stage I think it is suitable for re-deployment only into research environments which closely resemble HIE, probably including the same kinds of data-logger (I haven’t seen the other installation so can’t comment on that). It might be possible to develop DIVER into a more generic product, but there is no obvious business case for that at the moment over adapting a more widely adopted, more generic application. I think the way forward is for the current user-communities (of which I consider myself a member) to consider the benefits of incremental change, towards a more generic solution as they maintain and enhance the existing deployments, balancing local feature development over the potential benefits of attracting a broader community of users.

And another thing …

We discovered some holes in our end-to-end workflow for publishing data from HIEv to our Institutional Data Repository, and some gaps in the systems documentation, which we’re addressing as a matter of urgency.

Creative Commons License
Is HIEv aDORAble? by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.