Notes
At UWS, we haven’t tried to drive change with top-down policy. Instead, we’ve taken a practical, project-based approach which has allowed a data architecture to evolve. The eResearch Roadmap calls for a series of data capture applications to be developed for data-intensive research, along with a generic application to cover the long tail of research data.
The 4A Vision
For the purposes of this presentation we will talk about the ‘4A’ approach to research data management – Acquire, Act, Archive and Advertise. The choice of different terms from the 2Rs Reuse and Reproduce of the conference theme is intended to throw a slightly different light on the same set of issues. The presentation will examine each of these ‘A’s in turn and explain how they have helped us to organize our thinking in developing a target technical data architecture and integrated data-related end-to-end business processes and services involving research technicians and support staff, researchers and their collaborators, library staff, information technology staff, office of research services, and external service providers such as the Australian National Data Service and the National Library of Australia. The presentation will also discuss how all of this relates to the research project life cycle and grant funding approval.
Acquiring the data
We are attacking data acquisition (known as Data Capture by the Australian National Data Service, ANDS 1) in two ways:
With discipline specific applications for key research groups. A number of these have been developed in Australia recently (for example MyTARDIS 2), we will talk about one developed at UWS. With ANDS funding, UWS is building an open source automated research data capture system (the HIEv) for the Hawkesbury Institute for the Environment to automatically gather time-series sensor data and other data from a number of field facilities and experiments, providing researchers and their authorised collaborators with easy self-service discovery and access to that data.
Generic services for Data storage via simple file shares, Integration with cloud storage including Dropbox.com and other distributed file systems. And Source-code repositories such as public and private github and bitbucket stores for working code and textual data.
Acting on data
The data Acquisition services described above are there in the first instance to allow researchers to use data. With our environmental researchers, we are developing techniques for developing reusable data sets which include raw data, commented scripts to clean the data (eg a comment “filter out known bad-days when the facility was not operating”) then re-organize it via resampling or other operations into useful ‘clean’ data that can be fed to models, plotted etc and used as the basis of publications. Demo: the presentation will include a live demonstration of using HIEv to work on data and create a data archive.
From action to archive
Having created both re-usable base data sets and publication-specific operations on data to create plots etc there are several workflows where various parties trigger deposit of finished, fixed, citable data into a repository. Our project team mapped out several scenarios where data are deposited with different actors and drivers including motivations that are both carrot (my data set will be cited) and stick (the funder/journal says I have to deposit). Services are being crafted to fit in with these identified workflows rather than build new things and assume “they will come”.
Archiving the data
The University of Western Sydney has established a Research Data Repositoryi (RDR), the central component of which is a Research Data Catalogue, running on the ReDBOX open source repository platform. While individual data acquisition applications such as HIEv are considered to have a finite lifespan, the RDR will provide on-going curation of important research datasets. This service is set up to harvest data sets from the working-data applications, including the HIEv data-acquisition application and the CrateIt data packaging service using the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).
Advertising the data
As with Institutional Publications Repositories, one of the key functions of the Research Data Repository is to disseminate metadata about holdings to aggregation services and give data a web presence. Many Australian institutions are connected to the Research Data Australia discovery service 6, which harvests metadata via an ANDS-defined standard over the OAI-PMH harvesting protocol. There is so far no Google-Scholar-like service which is harvesting data about data sets via direct web crawling (that we know about), so there are no firm standards for how to embed data in a page, but we are tracking the developments of the Schema.org vocabulary, which is driven largely by Google’s group of companies which are Google’s peers, and the work described above on data packaging with RDFa metadata is intended to be consumed by direct crawlers. It is possible to unzip a CrateIt package and expose it to the web thus creating a machine-readable entry-point to the data within the Zip/BagIt archive.
Looking to the future, the University is also considering plans for an over-arching discovery hub, which would bring together all metadata data about research including information on publications, people, and organisation.
Technical architecture
The following diagram shows the first end-to-end data capture to archiving pathways to be turned on at the University of Western Sydney, covering Acquisition and Action on data (use) and Archiving and Advertising of data for reuse. Note the inclusion of a name-authority service which is used to ensure that all metadata flowing through the system is unambiguous and inked-data-ready 7. The name Authority is populated with data about people, grants and subject codes from databases within the research services section of the university and from community-maintained ontologies. A notable omission from the architecture is integration with the Institutional Publications Repository – we hope to be able to report on progress joining up that piece of the infrastructure via a Research Hub at Open Repositories 2014.
i Project materials refer to the repository as a project which includes both working and archival storage as well as some computing resources, drawing a line around ‘the repository’ that is larger than would be usual for a presentation at Open Repositories.