eResearch Tools day. Is CKAN aDORAble?

Posted on 2014-08-21 by David Clarke

On Tuesday August 12th the UWS eResearch team had one of our irregular-but-should-be-regular tool-hacking days, looking at the CKAN data repostiory software again There were three main aims:

Evaluate the software as a working-data repository for an engineering lab and maybe an entire institute, similar to the way HIEv sits in the Hawkesbury Insitute for the Environment.
Evaluate the software as a generic research data management solution for groups wanting to capture data into a reposiotry as part of their research. Does CKAN fit with our principles for eResearch Development and Selection?. Joss Winn wrote about CKAN as a research data store a couple of years ago, why they chose it at Lincoln, and there was a workshop lastyear in London which produced some requirements documents etc.
Provide a learning opportunity for staff, giving them a chance to try new things an develop skills (such as using an API, picking up a bit of Python etc).

What happened?

David demoed CKAN and showed:

Simple map-based vizualization using a spreasheet of capital cities he found on the Intents
Simple plotting of some maths-data

And then what?

We then broke up into small groups (mainly of size one, if we’re honest), to investigate different aspects of CKAN.

Katrina and Carmi: Looking at the abilities for uploading Excel files by ingesting some data.gov.au datasets. What can be done, what can’t? What happens with metadata?
David: Looking into the upload of a HIEv package/cr8it crate into CKAN. Can we automagically get the metadata out and stash it in CKAN. Can we represent the packages file structure in CKAN?
Alf: Document this instance and preview infrastructure needs.
PeterS: previews for markdown and other files. getting stuff out of files. events/queues. RDF and URIs.
PeterB: TOA5 uploads from HIEv
Lloyd: POST an ADELTA record into CKAN.

So how did we do?

Well, we got data moving around via a number of methods – spreadsheets went in via the web interface, documents went in over the API, documents came out over the API.

We learnt the differences between CKAN’s structured and unstructured data. "Structured" data is essentially tabular data: if you’re bringing it in via a CSV or a spreadsheet then it’s structured. What this means is that it gets stored as a relational table within CKAN and in principle this means you can access particular rows. Unstructured data is anything else, and you can access all of a blob or none of it.

We found gists handy for passing code snippets and wee “how-to” texts between the team on Slack.

My CKAN day…

Peter B

We had a reasonably successful day. I found the upload of a file resource through the CKAN API (from Python) worked a lot easier with the extra documenttion. We had some problems with the security key in that the API wouldn’t run for me or Peter S when using our own keys, but it all worked when we used each others’ – reason: unknown. From a python script we were able to open a specially formatted csv file (TOA5 format from Campbell Scientific having 2 additional rows of metadata at the top), decode the first 2 rows and turn the metadata into name/value pairs when we created the CKAN dataset. So this was fairly flexibly done. A lot of our HIE climate change data is formatted this way and it means we should be able to ingest records failry readily as csv.

Alf

I wrote some short instructions (in a gist) on how to start up our CKAN instance.

Unfortunately the rest of the time was more heat than fire, as I read up on CKAN’s web-based previewing feature which uses Recline.js as well as Data Proxy, but it still a little bit unclear to me how it’s tied together.

Peter B pointed out that extracting individual rows from datasets is possible if the dataset is kept in a database underneath CKAN rather than as a file "blob". So I did some reading and partial setup of the CKAN Data Storer Extension. The setup guide is aimed at someone with more Python experience than me, so I got trapped in "celery and pasta (paster) land" for most of the afternoon!

David

Initial success in dusting down my long-dormant Python skills and getting data in and out via the API was followed by losing a lot of time trying to extract the RDFa from the HIEv package’s HTML. Neither manual crufting nor Python’s [RDFaDict][https://pypi.python.org/pypi/rdfadict] could get it all out (in fact, the library got nothing. Nothing!). The lesson here is to be sure that we put metadata in a place and a form that we can get it out programmatically.

Notwithstanding that, CKAN had a lot going for it in terms of upload and access, but it wasn’t immediately clear how it would handle complex metadata within its data model.

Carmi

At Tools Day I learned to create a new dataset item plus upload a file with data to that item via the CKAN API using Python for the first time. That was the highlight for me. It was also interesting to see what is possible in terms of visualising data. I uploaded a few excel spreadsheets and the graphing interface was very user-friendly. I would like to see it utilised for data visualisation in the Centre for the Development of Western Sydney’s website.

Petie

This time posting actual data to CKAN seemed easier – I am assuming the documentation must have improved. I managed to put together something that could create new datasets and attach new files – a potential denial of service attack against CKAN or a tool for testing its scalability. And at Peter B’s suggestion worked on some very simple code to extract metadata and CSV from TOA5 files, as used by Campbell Scientific data loggers residing at the Hawkesbury Institute for the Environemnt.

The $64,000 Question: is CKAN up to it?

I general, yes CKAN seems to be a reasonable platform for data management that aligns well with our principles.

It has the basic features we need:

APIs for getting stuff in and out and searching
A discovery interface with faceted search
Previews for different file types

There are some limitations.

Despite what is says on the website and what Joss Winn reports, it’s not really ‘linked-data-ready’
It does have metadata and that is extensible but there’s not formal support of recognized ‘proper’ metadata schemas, jsut name-value pairs

There are some questions still to explore:

How well will it scale? We can probe this easily enough by pumping a lot of data into it
How robust and transactional is the data store? If we have different people or processes trying to act on the same objects at the same time will it cope or collapse?
Can we use more sophistcated metadata? We might look at stuff like the ability to add an RDF file that contains richer metadata than the built in stuff? How hard would this be? Could we allow richer forms for filling out, say, MODS metadata?
Ditto for using URIs. How easy would it be to add real linked data support? Would a hack do? ie instead of storing name/value pairs allow some conventions like name (URI)/value (URI). Again, how easy is it to hack the user interface to support stuff like autocomplete using name authorities rather than collecting yet more strings.

Lessons learned

We didn’t talk to each other as much as we should have. This possibly due the venue – our offices – which meant people went to their desks. Next time we’ll be in a more interactive venue.

David was held up by the design of the data packages from HIEv – we need to revise the data packaging so that it has metadata in easy-to-use JSON as well as metadata embedded in RDFa.

eResearch Tools day. Is CKAN aDORAble? by members of the UWS eResearch team is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Research Data Repository Services Delivered in Stage One

Posted on 2013-07-30 by admin

The Research Data Repository over the last year has delivered an impressive array of infrastructure and services.

Services that exist now

Service	How the service is delivered
A researcher can request a Research Shared Drive up to 1 TB with multiple users and access anywhere on UWS campus. FAQ is online.	The request can originate from the researcher or from eResearch, and then the ITS team provision the share in accordance with the Support plan. The request form is online.
A researcher can back up their git repository onto Research Data Store.	The service is delivered ad hoc by the eResearch team.
A researcher can request a virtual machine.	The request can originate from the researcher or from eResearch, and then the ITS team provision the virtual machine in accordance with the relevant SOP.
A researcher can deposit their research data in the Research Data Catalogue.	There are two ways to initiate the request, by Self-service by using an online form. Or, in discussions with eResearch and the Library Research Services team. Once initiated, the Research Services Coordinator – Library follows Library procedure in creating a new collection record and storing the data collection (as applicable).
Library systems can harvest metadata from UWS and web sources of truth, on a regular basis.	This metadata is stored in the Research Data Catalogue and provides lookup for applications like ReDBox and HIEv. The service is delivered in accordance with Library procedures.
A researcher can complete a Data Management Checklist.	This is self-service by obtaining a copy of the checklist online, with support from the eResearch team as needed.
A researcher can use a Data Management Plan Template to formalise their Data Management Plan.	This is self-service by obtaining a copy of the checklist online, with support from the eResearch team as needed. The eResearch team can and do occasionally write Data Management Plans on behalf of the researchers, using the same template.
A researcher can obtain information online about best practice in data management. http://uws.edu.au/research/researchers/data_management	This is self-service via reading website content and following links for more information with assistance from the eResearch team.

External services that we are supporting

Service	How the service is delivered
A researcher can obtain a NeCTAR virtual machine up to 2 cores at a time for up to 3 months (eResearch can assist with access and set up).	The service is provided through an online application using AAF credentials with assistance from the eResearch team where required. http://www.nectar.org.au/research-cloud
A researcher can apply for a medium and large (high intensity) virtual machines from NeCTAR.	The service is provided through an online application using AAF credentials with assistance from the eResearch team where required. http://www.nectar.org.au/research-cloud
A researcher can get a Cloudstor+ account through AARNET, this is cloud storage for research, located within Australia (eResearch are actively promoting this service and seeking user evaluations of it.	The service is provided through an online application at http://www.aarnet.edu.au/communities/research/eresearch-overview/cloudstorplus-information.aspx
A ReDBox administrator can initiate a bug fix or issue with QCIF for resolution.	QCIF provide support, with assistance from the eResearch team.

What infrastructure has been delivered

Infrastructure – Storage
127 Tb of high quality disk for researchers and research related-uses has been deployed. This storage is highly flexible and extensible and can be utilised as SAN or NAS depending on the need. > Migration of all data from old 70 Tb SAN Established new service, Research Shared Drive (SIF share) > New FAQ/README with instructions for install, and also best practices in data management > New support plan through close coordination with eResearch and ITS > 10 research teams are currently using the RDS. Storage has been connected to a number of virtual machines for research specific projects and applications. Collaborative Storage Explored and trialled several collaborative storage solutions, including Oxygen Cloud, WOS cloud, SparkleShare, and OwnCloud. Selected OwnCloud based on experience at other organisations (such as AARNET and Lincoln University in UK). A trial was conducted whereby a link was made between Dropbox and the Research Shared Drive. The team set up a Dropbox account which can receive a copy of a researcher’s Dropbox, and store that data on the same Researcher’s Shared Drive. This system is still in development stages. A trial was conducted whereby a link was made between Source Code Repositories (version control systems) and the Research Data Store. The link is demonstrated by a UWS git server which clones public access git repositories. By way of example, we cloned the eResearch-apps repository. Up Next: Trial a collaborative storage option based on OwnCloud. Establish a mechanism by which a user pushes their git repository to UWS storage. Serve the needs of researchers who use other version control systems such as Mercurial and SubVersion.
Infrastructure – Compute
4 servers have been provisioned for research use, 2 existing from HIE, and 2 provided through RDR, this is the Research Cluster. The Research Cluster comprises 160 processor cores and 1024 Gb of memory available. 6 vm’s which had been created previously were successfully migrated onto the Research Cluster. There are 9 virtual machines which have been created in the research cluster, with plans to migrate more virtual machines across from the School of Medicine and other schools and institutes. We can provision up to approximately 40 ‘medium intensity’ virtual machines. Up Next: Create canned virtual machines which comes ready-ready with tools needed to analyse data.
Infrastructure – Software
New packaging software was developed for research data, called CrateIt (Cr8it). Cr8it was started under two different approaches. The first approach was to leverage a toolset called The Fascinator, and the other approach was to incorporate new features into OwnCloud. Document conversion, such as ePub generation, was ported into OwnCloud-Cr8it. An automatic generation of a combined metadata catalogue record plus manifest was started. The manifest will be human and machine readable, leveraging work done by the HIEv (DC21) project. Up Next: Create a Cr8it trial and roll it out. Flesh out what metadata record needs to be created by the Cr8it packaging process.
Research Data Catalogue
A simple form was developed that a researcher can use to indicate that they have a data set they would like to archive. A pro forma questionnaire has been developed by the Research Services team at the Library. A process for including a new data set was also developed by the Library Research team. 3 new procedure documents were created which formalised the ingest of metadata from RHESYS (University Research Management System) and from external sources, such as ReDBox wiki, NHMRC and ARC. Approximately 1,500 researchers and 500 projects are in the Research Data Catalogue available via lookup when a new data collection record is created. New Research Data Catalogue entries (30+) were added to Research Data Australia, searchable by anyone with web access. The ReDBox application was set up so that people who create data sets at UWS also have their unique details merged with an existing (or newly created) record in the National Library of Australia database, which is linked to any other data sets or publications which they have created in the same field or under the same name. A new feature in ReDBox was added whereby an administrator can view the results of ingesting records about people and research projects. These results are presented in the form of ingest reports, describing what was ingested, modified, or removed, to support Quality Assurance going forward. A ReDBox support agreement was negotiated with QCIF, which provides bug fixes and technical support until December 2014. A new wizard for creating a data management plan inside the data catalogue is currently being trialled. The idea is that any data management plan which is created will be stored in the catalogue along with the data, and can be exported as a pdf if needed.
Services – Research Data Management
A new Data Management Plan Checklist was created. A new Data Management Plan Template was created. Additional page was added to the Office of Research Services pages, which included: > Data Management defined, > Data Management best practices, > Links to RDR services, > Links to external services and more information as applicable, and > Standard pro forma language that researchers can use to complete their research application forms. Internal application forms were improved to ask researchers to explain how data management will be addressed, including: > Internal grant application for UWS funded research, and, > Application form to start new external grant application through ORS. eResearch interviewed researchers with live projects and created 3 Data Management Plans using the Data Management Plan Template, plans which have been provided to the researchers. eResearch interviewed managers of research facilities and drafted 4 Data Management Plans thus far, which have been provided to the facility managers. Up Next: Finalise Data Management Plans for our research facilities. In addition eResearch is currently assisting with new shared drives for these facilities (this is really BAU but is within the scope of the project). Deposit the Data Management Plans in the Research Data Catalogue.

4A Data Management Acquiring, Acting-on, Archiving & Advertising research data at the University of Western Sydney

Posted on 2013-07-24 by ptsefton

This is a presentation with speaker notes from the Open Repositories 2013 conference at Prince Edward Island in Canada, as presented by Peter Sefton, written with Peter Bugeia.

[Update 2013-07-25 Added missing link to Kangaroo video

4a Data Management by Peter Sefton and Peter Bugeia is licensed under a Creative Commons Attribution 3.0 Unported License

Slide 1

Notes

Abstract

There has been significant Government investment in Australia in repository and eResearch infrastructure over the last several years, to provide all universities with an institutional repository for publications, and via the Australian National Data Service to encourage the creation of institution-wide Research Data Catalogues, and research Data Capture applications. Further rounds of funding have added physical data storage and cloud computing services. This presentation looks at an example of how these streams of money have been channeled together at the University of Western Sydney to create a joined-up vision for research data management across the institution and beyond, creating an environment where data may be used by research teams within and outside of the institution. Alongside of the technical services, we report on early work with researchers to create a culture of replicable use of data, towards the vision of truly reproducible research.

This presentation will show a proven end-to-end design for research data flows, starting from a research group, The Hawkesbury Institute for the Environment, where a large sensor network gathers data for use by institute researchers, in-situ, with data flowing-through to an institutional data repository and catalogue, and thence to Research Data Australia – a national data search engine. We also discuss a parallel workflow with a more generic focus – available to any researcher. We also report on work we have done to improve metadata capture at source, and to create infrastructure that will support the entire research data lifecycle. We include demonstrations of two innovations which have emerged from the associated project work: the first is of a new tool for researchers to find, organize, package and publish datasets; the second is of a new packaging format which has both human-readable and machine-readable components.

Slide 2

Notes

Some of the work we discuss here was funded by the Australian National Data Service. See:

Seeding the commons project to describe data sets at UWS and the Data catalogue project.

HIEv Data Capture at the Hawkesbury Institute for the Environment

The talk

Notes

We’ll use the four A’s to talk about some issues in data management.

We need a simple framework which covers it all, to capture how we work with research data from cradle to grave:

We need to Acquire the raw data and make it secure and available to be worked on.

We need to Act on the data to cleanse it while keeping track of how it was cleansed, analyse it using tools to support our research, while maintaining the data’s provenence.

We need to Archive the data from working storage to an archival store, making it citable

We need to Advertise that the data exists so that others can discover it and use it confidently with simple access mechanisms and simple tools.

4A must work for

high-intensity research data such as that from gene sequences, sensor networks, astronomy, medical diagnostic equipment, etc.

the long tail of unstructured research data.

For example

Notes

In the presentation, Peter Sefton used the short video linked here as an ice-breaker.

If only data capture were as simple as catching a kangaroo in a shopping bag!

Australian Government Initiatives in Research Data Management

Notes

There have been several rounds of investment in (e)research infrastructure in Australia over the last decade, including substantial investments to get institutional publications repositories established.

Australian National Data Service (ANDS) $50M (link)

National eResearch Collaboration Tools and Resources (NeCTAR) project (link) $50M

Research Data Storage Infrastructure (RDSI) $50M (link)

Implemented to date:

National Research Data Catalogue – Research Data Australia

Standard approach to updating the Catalogue (OAI-PMH and rif-cs)

10+ Institutional Metadata Repositories implemented

120+ data capture applications implemented across 30+ research organisations

Upgrade of High Performance Computing infrastructure

Colocation of data storage and computing

Slide 6

Notes

UWS is a young (~20years) university performing well above most of its contemporaries in research.

Slide 7

Notes

This slide by Prof Andrew Cheetham – the Deputy Vice Chancellor for Research shows that UWS performs very well at attracting competitive grant income from the Australian Research Council.

Slide 8

Notes

UWS is concentrating its research into flagship institutes – we will be talking in more detail about HIE, here, our environmental institute which does research from cutting across different disciplines spanning from the leaf level to the ecosystem level.

Slide 9

Notes

Slide 10

Notes

Intersect is the peak eResearch organisation in the state of NSW:

Intersect was formed in 2008 in response to research IT needs.

The term ‘eResearch’ is used to refer to the application of advanced information and communication technologies to the practice of research. It enhances existing research processes, making them more efficient and effective, and it enables new kinds of research processes. eResearch brings together the effective management and organisation of research data with computing infrasrcture and software applications to enable research and to facilitate collaboration between researchers.

eResearch loosely translates to e-Science and Cyber-infrastrcture, depending on which part of world you come from.

Intersect is a not for profit company which is owned by its members (see list on next page)

Intesect currently consists of 60 staff, with eResearch Analysts on-site at members (this is unique in Australian eResearch)

Services include: Data capture solutions / software development, high end data storage infrastrcture, research data management planning, high performance computing (Intersect administers its own supercomputing facility and provides a share of Australia’s leading computing infrastructure at Australian national University to its members, virtual computing, consulting, training, strategic advice.

UWS is a member of Intersect

Slide 11

Notes

These are Intersect’s members. Intersect also collaborates with other eResearch organisations throughout Australia.

The slide is a photo of at the recent Hackfest event. THis is an annual fun competition for software developers to use open government data in innovative ways. Intersect hosted the NSW chapter of the event.

eResearch @ UWS

Notes

The eResearch unit at UWS is a small team, currently reporting to the Deputy Vice Chancellor, Research. See our FAQ.

Slide 13

Notes

At UWS, we haven’t tried to drive change with top-down policy. Instead, we’ve taken a practical, project-based approach which has allowed a data architecture to evolve. The eResearch Roadmap calls for a series of data capture applications to be developed for data-intensive research, along with a generic application to cover the long tail of research data.

The 4A Vision

For the purposes of this presentation we will talk about the ‘4A’ approach to research data management – Acquire, Act, Archive and Advertise. The choice of different terms from the 2Rs Reuse and Reproduce of the conference theme is intended to throw a slightly different light on the same set of issues. The presentation will examine each of these ‘A’s in turn and explain how they have helped us to organize our thinking in developing a target technical data architecture and integrated data-related end-to-end business processes and services involving research technicians and support staff, researchers and their collaborators, library staff, information technology staff, office of research services, and external service providers such as the Australian National Data Service and the National Library of Australia. The presentation will also discuss how all of this relates to the research project life cycle and grant funding approval.

Acquiring the data

We are attacking data acquisition (known as Data Capture by the Australian National Data Service, ANDS 1) in two ways:

With discipline specific applications for key research groups. A number of these have been developed in Australia recently (for example MyTARDIS 2), we will talk about one developed at UWS. With ANDS funding, UWS is building an open source automated research data capture system (the HIEv) for the Hawkesbury Institute for the Environment to automatically gather time-series sensor data and other data from a number of field facilities and experiments, providing researchers and their authorised collaborators with easy self-service discovery and access to that data.

Generic services for Data storage via simple file shares, Integration with cloud storage including Dropbox.com and other distributed file systems. And Source-code repositories such as public and private github and bitbucket stores for working code and textual data.

Acting on data

The data Acquisition services described above are there in the first instance to allow researchers to use data. With our environmental researchers, we are developing techniques for developing reusable data sets which include raw data, commented scripts to clean the data (eg a comment “filter out known bad-days when the facility was not operating”) then re-organize it via resampling or other operations into useful ‘clean’ data that can be fed to models, plotted etc and used as the basis of publications. Demo: the presentation will include a live demonstration of using HIEv to work on data and create a data archive.

From action to archive

Having created both re-usable base data sets and publication-specific operations on data to create plots etc there are several workflows where various parties trigger deposit of finished, fixed, citable data into a repository. Our project team mapped out several scenarios where data are deposited with different actors and drivers including motivations that are both carrot (my data set will be cited) and stick (the funder/journal says I have to deposit). Services are being crafted to fit in with these identified workflows rather than build new things and assume “they will come”.

Archiving the data

The University of Western Sydney has established a Research Data Repositoryi (RDR), the central component of which is a Research Data Catalogue, running on the ReDBOX open source repository platform. While individual data acquisition applications such as HIEv are considered to have a finite lifespan, the RDR will provide on-going curation of important research datasets. This service is set up to harvest data sets from the working-data applications, including the HIEv data-acquisition application and the CrateIt data packaging service using the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).

Advertising the data

As with Institutional Publications Repositories, one of the key functions of the Research Data Repository is to disseminate metadata about holdings to aggregation services and give data a web presence. Many Australian institutions are connected to the Research Data Australia discovery service 6, which harvests metadata via an ANDS-defined standard over the OAI-PMH harvesting protocol. There is so far no Google-Scholar-like service which is harvesting data about data sets via direct web crawling (that we know about), so there are no firm standards for how to embed data in a page, but we are tracking the developments of the Schema.org vocabulary, which is driven largely by Google’s group of companies which are Google’s peers, and the work described above on data packaging with RDFa metadata is intended to be consumed by direct crawlers. It is possible to unzip a CrateIt package and expose it to the web thus creating a machine-readable entry-point to the data within the Zip/BagIt archive.

Looking to the future, the University is also considering plans for an over-arching discovery hub, which would bring together all metadata data about research including information on publications, people, and organisation.

Technical architecture

The following diagram shows the first end-to-end data capture to archiving pathways to be turned on at the University of Western Sydney, covering Acquisition and Action on data (use) and Archiving and Advertising of data for reuse. Note the inclusion of a name-authority service which is used to ensure that all metadata flowing through the system is unambiguous and inked-data-ready 7. The name Authority is populated with data about people, grants and subject codes from databases within the research services section of the university and from community-maintained ontologies. A notable omission from the architecture is integration with the Institutional Publications Repository – we hope to be able to report on progress joining up that piece of the infrastructure via a Research Hub at Open Repositories 2014.

i Project materials refer to the repository as a project which includes both working and archival storage as well as some computing resources, drawing a line around ‘the repository’ that is larger than would be usual for a presentation at Open Repositories.

Slide 14

Notes

There are a number of major research facilities at HIE, here are two whole-tree chambers which allow control over temperature, moisture and atmospheric CO2.

Slide 15

Notes

This diagram shows the end to end data and application architecture which Intersect and UWS eResearch built to capture data from HIE sensors and other sources. Each of the columns roughly equates to the four A model. Once data is packaged in the HIev, it is stored in the Research Data Store and there is a corresponding record for it in the Research Data Catalog. The data packaging format produced by the HIEv, along with the delivery protocol are key to the architecture: the data packaging format (based on bagit) is stand-alone from the HIEv and self-describing, the delivery protocol (OAI-PMH) is well-defined and standards based. THese are discussed in more detail in later slides. When other data capature applications are developed at UWS, to integrate into and extend the architecture they will simply need to package data in the same format and produce and deliver the same meta-data via the same delivery protocol as the HIEv.

Slide 16

Notes

This diagram shows how the four ‘A’s fit together for HIE. Acquisition and action are closely related – it is important to provide services which researchers actually want to use and to build in data publishing and packaging services rather than setting up an archive, and hoping they come to it with data.

Slide 17

Notes

The HIEv/DC21 application is available as open source:

Funded by ANDS

Developed by Intersect

Automated data capture

Ruby on Rails application

Agile development methodology

Went live in Jan 2013.

1200 files, 15 GB of RAW data, 25 users.

120 files auto-uploaded nightly, +1GB per week

Expected to reach 50,000 files in next couple of years

Now extended to include Eucface data

Possibly to be extended to include Genomic data (20TB per year)

Integrated with UWS data architecture

Supports the full 4 As – links Acquire to Act to Archive

Slide 18

Notes

Acting on data: our researchers are not staring to do work with the HIEv system: here’s an API developed by Dr Remko Duursma to consume data from R-stats.

Slide 19

Notes

Acting on data: researchers can pull data either manually of via API calls and do work, such as this R-plot.

From acting to archiving…

Notes

The following few slides show how a user can select some files…

Slide 21

Notes

… look at file metadata …

Slide 22

Notes

… add files to a cart …

Slide 23

Notes

… download the files in a zip package …

Slide 24

Notes

… inside the zip the files are structured using the bagit format …

Slide 25

Notes

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites) …

Slide 26

Notes

… with detail about every file as per the HIEv application itself

Slide 27

Notes

… and embedded machine readable metadata using RDFa lite attributes

Slide 28

Notes

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.

Slide 29

Notes

Advertising – data. This is a record about an experiment on Research Data Australia.

Slide 30

Notes

Acquiring the data – long tail.

We looked in some detail at how the HIEv data capture application works for environmental data – but what about researchers who are on the long tail, and who don’t have specific software applications for their group?

We are working on a similar Acquire and Act service that will operate with files and trying to make it as useful and attractive as possible. Most research teams we talk to at UWS are using Dropbox or one of the other ‘Share, Sync, See’ services. Dropbox has limitation on what we can do with its APIs and does not play nicely with authentication schemes other than its own, so we are looking at building ‘Acquire and Act’ services using an open source alternative; ownCloud.

Our application is known as Cr8it (Crate-it).

Slide 31

Notes

A number of techniques employed at UWS:

the “R” drive

research-project-oriented data shares

synchronisation with dropbox and owncloud

synchronisation with github and svn

References

1. Burton, A. & Treloar, A. Designing for Discovery and Re-Use: the ‘ANDS Data Sharing Verbs’ Approach to Service Decomposition. International Journal of Digital Curation 4, 44–56 (2009).

2. Androulakis, S. MyTARDIS and TARDIS: Managing the Lifecycle of Data from Generation to Publication. in eResearch Australasia 2010 (2010).at <http://ccaeducause1.caudit.edu.au/index.php/eraust/2010/paper/view/62>

3. Sefton, P. M. The Fascinator – Desktop eResearch and Flexible Portals. (2009).at <https://smartech.gatech.edu/handle/1853/28483>

4. Kunze, J., Boyko, A., Vargas, B., Madden, L. & Littman, J. The BagIt File Packaging Format (V0.97). at <http://tools.ietf.org/html/draft-kunze-bagit-06>

5. Group, W. W. & others RDFa Core 1.1 Recommendation. (2012).at <http://www.w3.org/TR/rdfa-syntax/>

6. Wolski, M., Richardson, J. & Rebollo, R. Shared benefits from exposing research data. in 32 nd Annual IATUL Conference (2011).at <http://iatul2011.bg.pw.edu.pl/proceedings/ft/Wolski_M.pdf>

7. Berners-Lee, T. Linked data, 2006. at <http://www.w3.org/DesignIssues/LinkedData.html>

Seeding the Commons Data Sharing Project Complete.

Posted on 2013-05-10 by ptsefton

The federally funded component of the UWS Seeding the Commons data description and sharing project has been completed. A comprehensive description of the Project is available via a previous blog post. Descriptions of 21 UWS data collections are now available in Research Data Australia. Some of the collections are open access, some are available via mediated access (contact the researcher to discuss access conditions) and some are metadata (description) only. The collections are also represented in Trove, the National Library’s single search portal, and discoverable by Google, Google Scholar and other search engines. Data Collections with a DOI will be indexed in the new Thomson Reuters Data Citation Index , allowing them to be formally cited in research papers. Congratulations to all participating researchers who now have their data/data description accessible to a vast audience of international scholars including potential collaborators.

Lessons Learned

A shift in the culture of data sharing is required to ensure that data does not remain the lost output of research. Whilst some have embraced sharing, others still insist they ‘just don’t want to’ share their data. A concerted effort is required to raise the awareness of UWS researchers on the benefits of data sharing through a campaign of communication, education and engagement for all in the research data lifecycle. If you know of a data sharing success story don’t be shy, spread the word.

Data description is complex and can appear daunting until all the pieces fall into place. A cheat sheet is in development and available on request to assist. Once refined the sheet will be published.

What’s Next?

Researchers may self-submit data descriptions and/or small data sets into the UWS Research Data Catalogue. Library staff will complete the metadata, confirm the record with the submitter and make it available in Research Data Australia.

Researchers wanting to share data/descriptions but who are unsure about self-submission may contact Susan Robbins at s.robbins@uws.edu.au or 9852 5458.

UWS is currently working on Cr8It (pronounced Crate-it) – a web based packaging application for research data. It will give users an organised view of their files and as much metadata as possible automatically extracted from the files. Cr8it it will let researchers identify related objects and organise them into a data ‘package’, adding more metadata and context if required, such as associating a package with a research institute, facility or experiment. Researchers will then be able to send the packages to the Research Data Catalogue and eventually push it out to a variety of other destinations, such as blogs or discipline repositories. Cr8it is currently at the proof of concept stage. To try out this service or learn more, please contact eResearch@uws.edu.au.

Currently we are investigating ‘use cases’ for depositing data at various stages of the research life cycle (eg. At the inception of a research idea, when applying for grant, when it is funded etc) These are mentioned in a previous blog post.

What’s in it for Me?

The opportunity to stay ahead of the pack. Aside from the practical issues of data storage and preservation, you could increase opportunities for collaboration and the impact of your research globally.

To arrange storage space for working data, or a secure space to archive and preserve data, contact: Toby O’Hara from eResearch at t.ohara@uws.edu.au or 4736 0928

To arrange for your data collection to be described and reflected in Research Data Australia (and associated locations) contact:

Susan Robbins Research Coordinator (Library) at s.robbins@uws.edu.au or 9852 5458

‘Moving Forward’

A series of UWS data related webinars and workshops will soon be available to assist anyone/everyone involved in the research data lifecycle. To be informed when they are scheduled please email Susan Robbins s.robbins@uws.edu.au research comic

“Piled Higher and Deeper” by Jorge Cham
www.phdcomics.com

No Such Thing as a Dumb Question

I’m working with external collaborators – can I give them access to our Research Data Repository?

This is being investigated as a priority, but unavailable at present.

I trip over boxes of old interviews in my lounge room – can you take them, digitise them?

UWS archives is able to store these for the duration of the ethics application contact RAMS. Post ethics expiry date, the data collection will be evaluated to determine the next step.

I’m retiring and have 20 years worth of research data on floppy discs. Can I give them to you to digitise, preserve and archive?

At present we don’t offer this option, but you can self-submit them and (soon) utilise Cr8it (see above) to manage the collection. Assistance is available. Contact Toby OHara t.ohara@uws.edu.au x2928

Data are the New Black: Data Sharing in the National/International Arena

A selection of data related activities occurring internationally.

CSIRO to embrace open access Hare, Julie. The Australian [Canberra, A.C.T] 11 July 2012: 31
THE CSIRO is making freely available 200,000 research papers dating back to the 1920s on its new, open-access repository. It is also creating a portal to contain most of the raw research data used by the organisation since its inception.
“It’s a massive job. We will eventually have 86 years of data in the repository,” said Jon Curran, CSIRO’s general manager of communications.
“We are anticipating this is where the world of science is heading.
“The mood is there. And we know the more visible the work the more excitement and energy that is generated.” …

GigaScience (http://www.gigasciencejournal.com/) , an innovative new journal handling ‘big-data’ from the entire spectrum of life sciences, has now been launched by BGI

Geoscience Data Journal New Wiley open access data journal
“It is becoming increasingly important that the data which underpins key findings should be made more available to allow for the further analysis and interpretation of those results,” said Mike Davis, Vice President and Managing Director, Life Sciences Wiley. “The ability of researchers to create and collect often huge new data sets has been growing rapidly in parallel with options for their storage and retrieval in a wide range of data repositories. We are launching the Geoscience Data Journal in response to these important developments.”
http://au.wiley.com/WileyCDA/PressRelease/pressReleaseId-104139.html

Hindawi Datasets International

Publishing a Dataset Paper in Datasets International is all about the underlying raw and tabular data that the author has obtained during his experiment. Every table or image should be accompanied with a full description of how this data has been obtained, for instance, if you provide us with a graph; you should provide us with the tabular data you have used to draw this graph.
Datasets should contain detailed explanation of the methodology and materials used in conducting the experiment/observation and no final results or conclusions. Accordingly, manuscripts should be submitted along with all the relevant data.

Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for the central access to data in a similar vein as Wikimedia Commons does for multimedia files.

Google Scholar already contains citations to datasets represented by a DOI.

An example of a data citation in a reference list

Birgisdottir, L., and Thiede, J.R.N.
(2002)
Carbon and density analysis of sediment core PS1243-1 PANGAEA. doi:10.1594PANGAEA.87536.
Cited in Jourabchi, P., L’Heureux, I., Meile, C., & Cappellen, P. V. (2010). Physical and chemical steady-state compaction in deep-sea sediments: Role of mineral reactions. Geochimica Et Cosmochimica Acta, 74(12), 3494-3513. Retrieved from www.scopus.com

Seeding the Commons Data Sharing Project Complete. by Susan Robbins is licensed under a Creative Commons Attribution 3.0 Unported License.

Research Data Repository (RDR) progress report, May 2013

Posted on 2013-05-09 by ptsefton

The RDR project at UWS started in 2010 with the purchase of some storage infrastructure, and was expanded in scope in 2012, based on this scoping document. Work began in earnest in June 2012 when project manager Toby O’Hara joined the team. We set out with these broad principles in mind:

The repository will consist of two main components:
A scalable storage service linked to a combination of local and cloud-based high performance computing. Some data may also reside in other, trusted storage systems such as national infrastructure or discipline repositories with suitable governance in place.
A catalogue of research data for internal use in management, and external use in dissemination and collaboration.
…
But the project is about much more than supplying storage and computing. It is about creating an organisational capability and culture of managing research data throughout the research lifecycle. We aim:
To enable research in all disciplines at UWS to take place efficiently and effectively on existing and new data sets.
To enable the validation of research through appropriate management of data inputs and outputs.
For re-use in new research which will cite the creators of data sets at UWS.
For compliance with funder requirements and codes of practice.

Those two main components are now established. We have both working storage (RDS) and archival storage (RDR) now commissioned and working on a small scale. (Note that terminology on this project has changed a bit – the RDR used to refer to all the components but it became quite clumsy to talk about ‘the archival repository part of the broader Research Data Repository’).

Figure – Super-simple view of the Research Data Repository with the two main kinds of storage – Working vs Archival

On top of that simple view, we can show how the RDR sits with other systems.

Figure RDR interaction with two other services. Dropbox.com integration is a simple one-way approach while the HIEv data capture application interacts with both working and archival storage via the Catalogue

There are many, many ways that these services could be extended but we have identified three high priorities from consulting with UWS researchers, and talking to other eResearch teams, which we’ll talk about in more detail below:

Adding support for distributed version control systems used by tech-savvy researchers to manage software code and documents.
Adding more support for distributed file-systems like Dropbox, but with better support for data security, access control and the ability to add eResearch applications over the top of the storage.
Dealing with the looming ‘feral file’ problem, where data storage tends to fill up, and there are a lack of options for researchers to hand-over data to an archival store.

Dealing with source-code and document version-control systems

There are two widely used distributed version control systems: git and Mercurial. Many researchers use these to manage program code and/or document sources for publications in text-markup such as LaTeX and increasingly MarkDown, via tools like KintR in the R environment. We are working to add support for this class of repository in our repository, which should be fairly straightforward, as the modern distributed code repositories support the key use-case by design. That is, they allow you to ‘push’ code changes to more than one repository, so a UWS member of a team that is already happily working with say BitBucket could push repository changes to a UWS archival repository for safe keeping, as well as the team repository. Why would they want to do this? It’s not about short term risk, but about having copies of data that are independent of service providers that might come and go in the medium to long term. And it’s about exactly the same use-cases for packaging data and depositing in an archival repository as with any other data project, when projects end, articles are published etc. More on this in a post soon.

Future file systems

The Dropbox.com file sync-and-share product is a clear winner in the distributed file-system stakes. It has a low-friction viral quality that lets it spread in ways that permeated and subverted our institutional networks and command-and-control structures. And it has an unparalleled ease of use¹. But there are two major problems:

There are some kinds of data for which one should NOT use Dropbox.com: the researcher has to decide if they are meeting ethical standards, funder requirements and layers of institutional policy.
And while Dropbox.com has an API – an interface against which third parties can write software applications, it is severely limited for doing the kind of ‘bridging’ work we want to between the RDS working-data store and the RDR archival store.

So, the fact that Dropbox.com is so popular, and so good, makes it clear that even if we can’t match it completely, we should be thinking about how to provide a similar service so research teams can:

Store stuff on all their devices and have it automatically synchronise between them, with some limits about re-sharing..
Invite others that they identify as collaborators to see the files. (No, that does not mean getting them to fill and sign a form apply for a university account, the way I have heard it described at a big university not far from here, it means I send you an invitation by email, you log in using something that (a) suits you and (b) works, for example, a gmail account, and once I’m sure that you are you, then the sharing starts. Yes, there are exceptions where we need higher-levels of assurance but for most collaborations too many barriers mean people will revert to Dropbox and smuggled USB drives.)

And, beyond what Dropbox.com can provide:

Store stuff in the right jurisdiction.
Allow eResearch tools, such as the one we cover next to access data via full-service machine interfaces (APIs).

There is a promising new application in this space now, run by AARNET called Cloudstor+. This gives Australian Researchers 100GB of free storage which can be expanded at low cost. This runs on the open source OwnCloud platform.

But note that there are many kinds of data that should NOT be placed in sharing-syncing services for various privacy and other legal reasons.

Creating a bridge between working file-storage and the archive.

We are now starting to hand out file-shares, which will, of course, fill up with files as researchers begin to take advantage of the storage space. But what will happen to those files when articles are published, projects and grants finish, research staff leave the institution? There are good reasons in all these situations to make sure that data are catalogued, and stuff is transferred to the Archival Store.

But it would be naïve to think that just because there are good reasons for these things to happen that they will. That’s why we have been working out how to encourage researchers to deposit data at various points in the existing research lifecycle – see our previous post on data management use-cases when we look at how and more importantly why people might be motivated to catalogue and deposit data.

Some data will come to the catalogue via applications like HIEv – the environmental data capture application. At the Hawkesbury Institute for the Environment (which is where the HIE in the name comes from) data is captured by technical research infrastructure and routed automatically to HIEv, where institute staff and collaborators can work with it. When they use a data set and publish an article or create a data set for re-use then they can trigger the process of having it sent to archival storage and cataloguing.

But for data that is not coming through a data capture application, uncatalogued, ‘wild’ or ‘feral’ data we want to provide a way for research teams to:

Look at their file-share and see all their (file-based) stuff.
Select groups of things that belong together, by directory, by file-type, by a search query, or by picking them out manually.
Add metadata to contextualise and explain the files, to support future re-use, and to explain how data supports published finding.
Publish/archive the data by sending to ReDBOX, the archival part of the overall Research Data Repository, where librarians will help optimise metadata and mind the data for the appropriate length of time.

Enter CrateIt (or Cra8it – (that’s Crate-it), an application to enable a user to pack-and-label-and-send as just described. In this part of the RDR project Lloyd is writing an OwnCloud plugin which can be used to find, preview, describe, pack and send research data files from the working store to the Research Data Repository for archival storage (or in the case of very large data sets, send links to the files).

We have written previously about a prototype application that does a lot of this already but the OwnCloud version is promising because it is integrated with OwnCloud’s existing sharing and replication services so Cr8it can take advantage of its access control services.

What next?

Work is proceeding now on the three priorities mentioned above; integration with version control systems, file-sharing and synchronisation and the Cr8it application for corralling files.

Beyond that, the future is less certain; the roadmap for eResearch at UWS, which is now more or less complete, but yet to be approved by the eResearch Steering committee calls for a steady roll-out of:

More data capture applications at more sites, including research institutes and research groups.
Developing institute and school level data management plans following the lead of the Hawkesbury Institute for the Environment.
Further integrating data management services into the research lifecycle.
Improved integration with computing resources and collaboration tools.
Incremental improvements and upgrades to all of existing services.
1For a quirky take on this, consider Les Orchard’s musing on how it treats him like he treats his pets. This is a interesting way to think about service provision:
consider these pointers for being nice to animals:
- Give them a reason to come to you. Don’t chase after and grab.
- If they want to leave, let them. Don’t hold on and squeeze tight.
- If you are allowed to pick them up, hold them gently yet offer enough support to make them feel safe.
- Pay attention to their reactions, learn what kind of attention they like. This gives them a reason to come back when you let them leave.
Les lives with bunnies, I live with a dog. With dogs you need to show them very explicitly they rank in the family pack (ie below the humans). That’s not a strategy I’d recommend IT or eResearch staff take with your local institute director!

Research Data Repository (RDR) progress report, May 2013 by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

eResearch projects, quick update

Posted on 2013-04-08 by ptsefton

[update 2013-04-09 – a couple of minor corrections]

This week the eResearch steering committee at the University of Western Sydney is meeting for the first time. We will be bringing the new committee up to speed with all the existing projects, and diving into detail on some key projects.

This is a very quick high-level overview of the status of our major projects, all of which have been reported-on here on the blog before, apart from the very newest.

Finished

The Seeding the Commons project was recently successfully completed. This project was funded by the Australian National Data Service (ANDS) to establish infrastructure for a Research Data Catalogue (RDC). ANDS call these kinds of catalogues ‘Metadata Stores’. This was not just about software, it was about taking the first steps to creating an organisation-wide culture of data management, along with the DC21 data capture project described below, which is still going.

The UWS library team led this project, and they will be providing a project summary, including lessons learned and benefits accrued for publication here soon. Thanks team! (Their report will name all the names that need to be named).

Ongoing

HIEv (Nee DC21)

Another ANDS-funded project, DC21, Data Capture for Climate Change and Energy Research is nearing completion.

There has been some solid progress with this one since Peter Bugeia from Intersect took over the project management late last year:

The software application has a new name: HIEv. The name is not an acronym. It’s pronounced ‘hive’. The HIE bit is a reference to the Hawkesbury Institute for the Environment.
It’s in production, gathering data from four major research facilities for use within the institute.
Version 1.8 was rolled out this week, with training for early-adopters in the institute to follow soon, to be delivered by Peter B and new institute data manager Gerard Devine.

The next steps are to do detailed real-life trials of two major workflows:

Making sure facility data can be presented to researchers in usable cleaned up form in a way that minimises redundant effort and ensures that everyone is working with the same citable data-sets.
Working out how to enable researchers to create research publish data and code that is as complete as possible in support of research ~~data~~ publications.

Over the next few months Gerry Devine will work to get as much (appropriate) data as possible from the institute into the system, and gather requirements to feed into a business-case for a further phase of the project.

The code for HIEv/DC21 is available on github.

Enterprise Research Data Catalogue (‘Metadata Stores’ – MS23)

The Metadata Stores project is nearly complete. We see this as an extension of the Seeding the Commons project, which recently concluded. Like that project this is as much about working with the research community to create new ways of working in an increasingly data-driven research landscape, as installing software. But install software we have – the library has implemented the open ReDBOX research data management software funded by ANDS and now used by more than a dozen Australian Universities.

The work on the catalogue has always been seen as part of a larger effort at UWS: the Research Data Repository Project.

Research data repository (RDR)

The RDR is a key part of the eResearch strategy at UWS (we don’t have a formally endorsed strategy, mind, that’s what the new committee is there for). There are lots of ways to carve-up ‘eResearch’ but we are working with a simple model underpinned by three ‘pillars’:

Research Data Management.
Research Computing (including all kinds of devices from puny smart phones and tablets to cloud servers and High Performance Computing (HPC)).
eResearch Collaboration tools and services.

The raw infrastructure is only part of the picture but it is the foundation. At UWS the Research Data Repository Project is the current focus for building this infrastructure.

Figure 1 The eResearch model for UWS – by Peter Sefton & Sarah Chaloner

Project manager Toby O’Hara has driven the rollout of the RDR – including project managing the Research Data Catalogue and the first basic Research Data Storage services for working data. On the working data front we now have some dedicated research data storage that can be accessed in various ways:

As ‘R Drive’ shares.
Mounted directly to research applications as database storage.
Linked to replicated file-management service, such as Dropbox.com. A group of early-adopter are testing a process for sharing their files with a UWS Research Data account that links Dropbox (and soon other services) with backed-up university-provided services.

Buying storage is simple enough, but in an organisation with several thousand users, making sure that the help-desk know how to turn-on that storage for the right people, and help them use it is far from trivial, and definitely not quick. We’re on the way, though.

Next up, the draft plan calls for:

Providing services for our researchers who use code-version-control systems. Git and Mercurial are the current favourites – the researchers who live by these are the poster-children for reproducible research~~, and~~
Developing formal research data management plans across all parts of the university.
A campaign to put in place data capture projects for as much strategically important research data as possible.
Establishing a link between working and archival storage via a project with the working title Crate It – Cr8it! – see the new projects below.

Provided, that is, that we can get the resources to keep going.

New projects

Human Communications Science Virtual Lab

The major new eResearch project at UWS is the Human Communications Science Virtual Laboratory. This is a NeCTAR-funded project with a total budget in the region of three million dollars, 1.4 of which came from the Australian Government and the rest from a number of Australian institutions, led by UWS. The HCSVlab has its own website with:

A statement of the problem we’re attacking.
THE PROBLEM OF
a lack of awareness, access and proficiency in the use of the full range of corpora, tools and techniques available to researchers of the diverse disciplines that constitute the human communication science research field
A description of the project.
The HCS virtual Laboratory (HCS vLab) will connect HCS researchers, their desks, computers, labs, and universities and so accelerate HCS research and produce emergent knowledge that comes from novel application of previously unshared tools to analyse previously difficult to access data sets. The HCS vLab infrastructure will overcome resource limitations of individual desktops; allow easy access to shared tools and data; and provide the guided use of workflow tools and options to allow researchers to cross disciplinary boundaries.

RDR / Research Data Catalogue Spin-off: Cr8it!

The Research Data Repository we’re building at UWS encompasses two kinds of data in the Research Data Storage (RDS) component – there’s working data which is fluid, and archival data which needs to be managed for the long-term (or however long is required by the data management plan for a particular project).

Cr8it is designed to tackle the problem that many organisations are reporting ‘We bought a petabyte of storage, let people use it, and now that it’s full, we’re wondering what’s in all those files! What to keep?’

Cr8it will provide a web-view of research data files in a way that:

Makes it easy to see what there is in the working part of the Research Data Store.
Allows researchers to identify, describe and package data at various points in the research lifecycle to deposit end-of-project data sets or create published data for papers.

eResearch projects, quick update by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Potential Research Data Repository data-management use-cases. For discussion.

Posted on 2012-08-17 by pt

By Peter Sefton and Toby O’Hara

This post looks at some of the ways that researchers and the staff who support them might interact with the new Research Data Repository (RDR) we are building at the University of Western Sydney (UWS). This is a document for discussion. As implementation manager, Toby has consulted with several stakeholders on how the RDR will be used and we’d like feedback preferably via the comments on this blog or by email if that suits you better. This post will be of most interest to those involved in the RDR project at UWS and those who are currently engaged in setting up or improving an institutional research repositories as well as the potential ‘customers’ of a data repository.

And it might be of interest to anonymous blogger the Library Loon who doesn’t approve of most research data management planning. She points out:

Most data-management models (the one she is thinking of no exception) map extremely poorly to the research-project cycles and timelines that researchers are accustomed to. The milestones researchers think about—grant applications, awards, data capture, data analysis, interim-report writing, article authoring, renewal applications, and so forth—barely appear in data-management models.
http://gavialib.com/2012/08/data-lifecycles-versus-research-lifecycles/

What we’re going to talk about here is quite similar to what I think is happening at many Australian institutions. A lot of the thinking here follows the lead of Vicki Picasso’s team at Newcastle and their work on build an institutional research data catalogue which responds to institutional triggers, including grant applications and awards.

The goal of our RDR project is to provide the benefits we described before: backed-up, well-described reusable data to foster collaboration and data citations, not to mention keeping funding bodies happy. All with the least possible negative impact or the research community. To accomplish this, the RDR project aims to fit in with the existing research lifecycle. Below, we outline several different scenarios where various participants interact with the RDR. If these scenarios make sense, after further consultation then we will use them to inform our data management planning at the university.

A word about the diagrams

The diagrams in this post all have slightly differing levels of detail – the idea is to illustrate each point once, rather than repeat things. A few things to keep in mind when reading the diagrams:

Research data and the methods for producing data come in all varieties. The diagrams are necessarily simple, so as not to exclude data types, or different ways of handling data.
While not all the diagrams include a data management they will be a part of our new repository workflows; the intention is to start with human-readable plans and then progress to machine-driven planning. For example making it possible for research management systems to send reminders when data deposits are due.
The Research Data Repository is in two components. The Research Data Store (RDS) is for storing the data itself, in groups, packages, notebooks, bundles, or “collections”. The Research Data Catalogue (RDC) is for storing descriptions of the data, and these descriptions tie back to the data which is deposited, in the RDS or in trusted store elsewhere.

The library is responsible for management of the Research Data Catalogue component of the RDR. In each case, there is a loop back from a data librarian to the Researcher in the event that the data description needs to be improved to maximise its ability to advertise the existence of data, promote re-use and assist in data management.

Information from the university’s research management systems about researchers, grants and publications add information and trigger notifications to the RDR, or to a designated librarian mailbox.

As mentioned elsewhere in this blog, there is an overarching objective to raise awareness of what data has been collected and may be available for reuse. One of the ways this can happen is by sending the data descriptions to the ANDS Research Data Australia service (the pink boxes, below).

Library-led deposit

The first use case, the library-led deposit deals with:

historical research completed some time ago,
research that has recently completed, or
research that is ongoing, but has been flagged as a potential candidate for inclusion in the repository for strategic reasons.

Our first scenario is already happening; as a way of introducing the concept of a Research Data Catalogue to the university and getting experience in managing one, the library is leading a project, funded by the Australian National Data Service (ANDS) known as “Seeding the commons”. This project involves the Library team identifying places that data might reside via discussions with research administration, searching for grants awarded and looking at UWS’s publications.

The Seeding the Commons program is a way of getting some experience with the business of describing research data, relating it to other entities such as people, organisational units, research projects and grants. Many of the processes we’re going through at this stage are not scalable to the whole enterprise over the long term; a librarian will not be this directly involved in every deposit of data in the future, but the library does figure in all the stories we’re telling in this post and the next about how the RDR will work. Nor will we make the mistake of thinking that if we build it they will come and expect researchers to start spontaneously depositing data.

Data Capture deposits

A lot of research data is generated by machines such as sensors or instruments in such a way that it can be captured, labelled and described close to the data source in such a way that people other than the original researchers involved in collecting in it know what it is and how it might be reused. ANDS have funded a number of projects in this area under their Data Capture stream. We have one of these projects at UWS, being built for the Hawkesbury Institute for the Environment – it will capture data coming from large forest-based experiments, house it in a working data store where researchers can interrogate it and work with it, then send it to the Research Data Repository for archiving. In this workflow, there will be some automated data processing – once a certain class of data is well described it can flow from a facility automatically, for example a months worth of weather data does not need a research librarian to describe it every month but for some other scenarios, particularly compiling a data set to support a publication (which is covered below), there will still be humans involved in selecting and describing data.

There is an additional level of detail in this diagram, which is not in the Library-led process above, which simply makes explicit the fact that the RDR encompasses the storage and the catalogue. The storage is where the data is deposited and the catalogue is where the descriptions of the data are kept, and (where permissible) shared with other systems.

Publishing/citation-driven deposit

We don’t need to point out that publishing is one of the key parts of the research life cycle. In our initial work on the UWS Research Data Repository, we have been approached by researchers wanting to deposit data somewhere accessible (anywhere!) because it is required by the journal to which they’re submitting. This will become more and more important as funders start to mandate open access data along with open access publications, and the scholarly process (not just the scholarly communications process) re-forms itself. Note that in this scenario there is a DOI – a Digital Object Identifier – created for a data set so it can be cited like a publication. This is not hugely important to researchers yet, but we’re betting that it will rapid become so as systems to collect data citation metrics start coming on line and almost certainly counting towards government reporting.

The Library Loon’s timely post about making sure that data management plans align with research practice aligns with our thinking on this. We’re working with researchers as the Hawkesbury Institute for the Environment on how data capture and repository systems need to be configured to support their publications, for example where they use R to fetch data, clean it, run models, then generate the figures for an article. More on that soon.

Grant-driven data deposit

Grants are key to the research lifecycle. At the University of Western Sydney work has already been done on integrating data management into the research lifecycle, starting with applications for internal grants. We know that changing the research culture will take some time, but eventually thinking about eResearch requirements, not just data management but computing and collaboration needs will become normal for all researchers, just as ethics forms are for many now.

We present two scenarios here, one when a grant starts, and the researcher is prompted to finish and deposit a data management plan, and another when the grant finishes and there is a check to make sure the data management plan has been followed. In between, of course there might be other research-lifecycle-events that trigger data deposits.

Reporting-driven data deposit

In Australia one of the key drivers, and a driver with a very heavy right foot, is the government. All universities have to report on publications via HERDC and research excellence via ERA, this means that there are processes in place for reporting that form a large part of the research lifecycle. This is another place to tie-in data management processes for data that is of strategic significance. The next diagram shows a couple of scenarios that could be driven by the reporting cycle, either a significant publication, where it is important to make sure that data is kept for reproducibility and reuse, or when reporting on research of global significance, where advertising data might lead to more such research.

Talk back

We recognise that there’s room for improvement in each of the scenarios above, and not just because we’ve kept the sequences high level and skipped over some intricacies. It is hoped that this will drive some discussion and exploration of options. Perhaps your university is also implementing similar processes and technologies, please feel free to use us as a sounding board. If there is sufficient interest in any of the scenarios above, we’re happy to organise a forum for discussion.

Copyright Peter Sefton and Toby O’Hara 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

Connecting Data Capture applications to Research Data Catalogues/Registries/Stores

Posted on 2012-08-06 by pt

By Peter Sefton

This document is a write-up from a presentation given by Peter Sefton to a round-table discussion on Metadata Stores projects hosted by Intersect, our NSW state eResearch organisation. It is probably of most interest to people involved in Australian National Data Service (ANDS) and NeCTAR, or similar projects in the area of research data management.

We’re posting it here on the UWS eResearch blog so we can get feedback from others in the community, including eResearch groups, data librarians, users of data capture applications, and so on, as well as participants from the meeting – as there was very limited time for discussion on the day.

Please use the comments. Did I misrepresent the way things are at your institution? Did I leave out any design patterns?

There are two parts.

A quick survey of architectural approaches to connecting Data Capture applications to Research Data Catalogues and central repositories. Data Capture applications are designed to manage research data as it comes off machines, out of experiments or is otherwise collected by researchers.
A short discussion of requirements for Data Capture application that would run on a file store and allow researchers to create useful collections of data from any file-based data. We don’t want to accumulate petabytes of research data and not know what it all is, how long to keep it and so on. This follows up a couple of blog posts over on Peter Sefton’s site speculating about this kind of application.

Connecting Data Capture to Catalogues/Repositories

ANDS has funded new infrastructure in Australian research institutions via a number of programs. This document looks at how the software solutions produced by two of those programs talk to each other; how Data Capture applications feed data to what ANDS calls Metadata Stores. ANDS recommends that institutions maintain a central research data registry, which is where this term “Metadata Store” is used:

ANDS has already funded development of various metadata stores solutions (VIVO, ReDBox, TARDIS). Different metadata store solutions have different ranges of functionality and fill different niches. ANDS strongly recommends a central metadata hub for institutions to track their data collections as intellectual assets.
http://metadata-stores.blogspot.com.au/p/solutions.html

There doesn’t seem to be a single ANDS definition of the term Metadata Store, but for the purposes of this exercise we will concentrate on its role in storing descriptions of research data collections, and use the term we use at UWS for the part of the metadata store that covers the all-important data collections; Research Data Catalogue (RDC).

Not all institutions are building central RDCs – notably in Australia, Monash University works with a decentralised model where research groups have their own infrastructure with feeds directly to the national discovery service, Research Data Australia. Data Capture applications are owned by research groups. This model reflects the reality the researchers work across institutional boundaries.

Figure 1 The simplest model for hooking-up Data Capture to “Metadata Stores” – Leave out the central catalogue! OAI-PMH is a Protocol for Metadata Harvesting and RIF-CS is the ANDS format for metadata-about-data interchange

Most other institutions in Australia seem to be taking ANDS’s advice and installing central catalogues or registries (metadata stores if you must) of research data, and associated entities such as people, organisations, projects and grants. As part of the planning for our Research Data Catalogue project at UWS and community meetings hosted by Intersect NSW I did a quick investigation via mailing lists and personal emails to see if I could find out what kinds of ways people are moving data from Data Capture applications to their central research data infrastructure. The following rather busy diagram shows some of the various ways that data and metadata are being shifted from one place to another.

This diagram is informed by real use-cases but as I put it together quickly it almost certainly has mistakes and omissions. I didn’t have time to do all the information people sent me justice, but it would be good to spell out each case in detail and get some data on which patterns work for which kinds of data, and start to think about standards so Data Capture Application X can be hooked up to Catalogue Y without too much engineering. Maybe ANDS or Intersect or one of the other state eResearch bodies can help with this?

Figure 2 Some of the patterns for moving data from capture to data store and captured metadata to catalogue

The diagram shows systems as rectangles and interfaces as circles, with arrows showing which way metadata and data gets moved. The Research Data Catalogue is where metadata goes, and the Research Data Store is where data goes. Data Capture to RDC is a push, RDC requesting metadata from Data Capture is a pull. You’ll see there are a mixture of push and pull protocols in this diagram.

For example the DC21 application used at UWS is like DC_A. The Catalogue periodically pull-polls DC21 using OAI-PMH to ask for new metadata. If there is any then the Research Data Store pulls the data via HTTP, the standard web protocol for fetching resources.

At the other end of the spectrum applications like one Conal Tuohy told me about at Latrobe (similar to DC_F) use the SWORD push protocol (which is built on the Atom Publishing Protocol) also shown in the diagram to push both metadata and data in a single package (it does more than that of course). There are also some instances of mixed approaches like DC_B where an application pushes metadata and data into a staging directory and both get pulled from there.

One protocol not yet seen is ODATA – another AtomPub variant like SWORD tuned for data deposit.

Part 2: The gap: generic Data Capture for files

The second part of the discussion was about Data Capture for files that are being put straight onto a research data store. This follows-on from a presentation I made previously “File wrangling for researchers / Feral-data capture and a follow-up Watching the file watcher: more on capturing feral research data for long term curation. These are just notes, but I hope to convene a meeting soon to start discussing how to meet these requirements. How do we make sense of the data accumulating in research data stores? We can’t automate everything for every new project (Data Capture apps run at around $100,000 to write).

At UWS we are continuing to explore what this kind of application would look like. We have a group of third-year computer science students working on a project in this area this semester.

So what do we need for a generic file based DC app?

Requirements:

Dropbox.com style simplicity for basic collab;
- Simple traditional-style storage
- Easy sharing with collaborators
Simple support for identifying and describing ‘obvious’ collections like “everything in this directory”
Support for making collections from disparate resources such as linking videos to transcripts, or gathering all the data, scripts and generated graphs for an article.

Drivers:

Backup! Researchers know they need it, often don’t have it.
Compliance with policy on data management, and funder mandates (at UWS this is being introduced via internal grants)
Publication-driven;
- Publisher requires data
- Researcher wants to do reproducible research
- Citable data (maybe, but we need a culture of data citation to drive the practice of data-citation)

I suggest that we start working with researchers who are wanting to publish data collections to go with journal publications; they are motivated to get this done, in many cases by journal requirements.

What to do?

Is there an existing web application we can run over the top of a data store we can build on? (There’s one at the University of Sydney that I hope to get a demo of soon.)
And depending on the answer to (1) is there support for building or adapting a Storage-coupled data capture app as part of the Metadata Stores project being run right now at Australian Institutions?

Comments?

Figure 3 Remember, capturing stuff is one thing, but once it’s caught you need to figure out what to do with it.

[Update 2012-08-07]

If anyone feels moved to draw their own diagram of their data capture app and how it connects to a catalogue/RDA then you can do so using PlantUML an open source UML diagramming app. There is an online form http://www.plantuml.com/plantuml/form, where you can type in source like the component diagram http://plantuml.sourceforge.net/component.html I used for Figure 2 above:

@startuml 
() "OAI-PMH + RIF-CS" as OAIPMH
() "Curated -OAI-PMH + RIF-CS" as OAIPMH1
() "Staging area" as DB
()  "Atom Feed" as Atom
() "Atom Publishing Protocol" as Atompub
() "File copy" as cp
() HTTP
() "Web form" as web
() SWORD 

package "Data Capture Apps" {
 [Web upload] --> web
 [DC_A] <-- OAIPMH
 [DC_B] --> DB
 [DC_C] <-- OAIPMH
 [DC_D] <-- Atom
 [DC_E] --> Atompub
 [DC_F] --> SWORD

}



component "Research Data Australia" as RDA

package "Research Data Repository" {
   component "Research Data Catalogue" as RDC
   component "Research Data Store" as RDS
}
DB <-- RDC
web --> RDC
OAIPMH  <-- RDC
Atom <-- RDC
Atompub --> RDC
SWORD --> RDC
SWORD --> RDS
RDC -> OAIPMH1
OAIPMH1 -> RDA
RDS <-- web
RDS --> HTTP
RDS --> DB
HTTP --> DC_A
RDS --> cp
cp --> DC_B
@enduml

Copyright Peter Sefton, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

University of Western Sydney Enterprise Research Data Catalogue Project

Posted on 2012-06-04 by ptsefton

[This document is a lightly-edited version of an approved project proposal written by staff at the University of Western Sydney for the Australian National Data Service (ANDS) metadata stores funding stream – we are publishing it here to assist in collaborating with other universities on their Metadata Stores projects. Some ANDS boilerplate text and financial information have been removed, and links added to materials that add context.]

ANDS Project Description

for

Enterprise Research Data Catalogue

ANDS Project Code: MS23

Document Version 1.0

Prepared by Peter Sefton and Peter Bugeia

University of Western Sydney

6/12/2011

Project Description

Organisation responsible for the project (Subcontractor)	University of Western Sydney
Organisation that will undertake the work (Sub-Subcontractor)
ABN or ACN	530 140 698 81
Name of Contact Person	Peter Sefton
Complete address and contact details of Contact Person	eResearch Capability Team Office of the Pro Vice Chancellor (Research) Academic and Research Division University of Western Sydney Campus : Penrith (Werrington North) Building : AD Room : AD.G.15 Locked Bag 1797 South Penrith NSW 2751 T: 61 2 4736 0072 F: 61 2 4736 0905 p.sefton@uws.edu.au
ANDS Program	Metadata Stores
Project Summary	This project adheres to NCRIS funding requirements. Funded activities are limited to: installation, configuration and testing of software; manual creation of metadata (beyond that required for software specification and testing); scoping exercises or studies in the amount of research data available at an institution. The project does not use NCRIS funds for the following activities: purchasing of IT hardware for storage or any other purpose; ongoing staffing; “proof of concept” software development; funding of work by parties based outside Australia. Any software development will be made available as open source.
Funding Sought	<removed>
Proposed project timeframe	10 months
Name of the person responsible for contract administration	<removed>
Names and affiliations of all collaborators if any	University of Newcastle – Vicki Picasso. Other collaborators will be identified during the course of the project.

Background

The University of Western Sydney is undertaking the early stages of an internally funded project to establish a Research Data Repository [link added] (RDR) and associated infrastructure to support it. This project is being led by the eResearch Unit with the participation of IT, the Library and the Office of Research Services. The repository will consist of:

scalable, managed file storage for both working and archived data;
access to virtualized computing infrastructure so that researchers can run data analysis tasks;
a research data catalogue containing metadata about data at a collection level for code-compliance, strategic research management and discovery purposes.

The storage component of the RDR was established in 2010. The next steps are to design the architecture that links the storage to computing infrastructure and cataloguing applications. This architectural work will be undertaken by the eResearch Unit, IT, and the University Library.

UWS has a nascent research data catalogue which is being established under ANDS project SC20.

Throughout this document the ‘metadata store’ for research data will be referred to as the ‘Research Data Catalogue’ to emphasise its role in the institution using a term that should be understandable to all stakeholders.

2. Aims and Objectives

*Alignment with ANDS Objective*	*already*	*to be*	no
To manage metadata about data collections held at the institution	(some progress on SC20)	X
To enable discovery and reuse of data collections held at the institution		X
To support strategic planning for research in the institution		X
To ensure high quality metadata		X

Overview of project

The proposed metadata stores work outlined in this document will contribute to the RDR project by implementing the research data catalogue (metadata store) in the institutional context, establishing data sources for parties and activities from research and library systems, and providing an expanded platform for describing collections.

This will be built into an integrated system for recording catalogue-descriptions of research data collections with a view to it becoming the institutional research data catalogue for the university. There is opportunity for it to be collaboratively built to fulfil a broader set of institutional requirements than just those of the University of Western Sydney’s.

The University has chosen the ReDBox application as the research data catalogue to fulfil functional requirements under SC20. This Metadata Stores project will explore how it can be expanded to be the basis of the University’s institutional research data catalogue, and seek alternative and additional software solutions if necessary. It is proposed to conduct this analysis in concert with other institutions using the same software and/or with similar requirements, so that any software developed or purchased has a broad user base.

Scope and boundaries

The project will focus on the following:

implementation of the core deliverables (D1-D6) suggested by ANDS, as none of these are fully established at UWS,
the establishment of workflows for identifying collections, and
the integration of data management planning into the broader research lifecycle.

The primary driver for this work is to establish a picture at UWS of where research data resides and to establish infrastructure for researchers to be able to store and describe their data for later re-use by themselves, their research teams and students, and more globally. This work will aim to meet UWS requirements for research management and practice as well as the ANDS goal of sharing collection descriptions.

The full scope of the final project will be refined and specified in Deliverable D15, Project Management Plan.

Dependencies

This project depends on the SC20 project to establish the basic application. This is considered low risk as the same application is now in production at both the University of Newcastle and at Flinders University.

Overall Approach

Strategy and methodology

This project will use an agile project methodology for software development tasks and for other tasks such as evaluation of data sources. The exact nature of the project will be developed with the project manager and team and documented in deliverable D15, Project Management Plan.

UWS is aiming to collaborate with other institutions that are using similar software and with similar approaches to research data in general. This will provide an opportunity to work together to specify and deliver new software features which meet a common need. We have identified one partner, the University of Newcastle and will work with them to recruit more.

Technical issues

Some technical issues which have presented themselves in the formative stages of this project include:

The relationship between storage infrastructure and the metadata catalogue and how these should be linked. Some attention will be given to specifying this interface in DC21 and SC20.
The relationship between NLA party IDs, local IDs and the forthcoming ORCID system, and the interfaces to all of these systems. This issue will need to be investigated with ANDS and the ANDS community.

Internal Resources

The exact breakdown of the resources needed for this project is not yet known but it will be lead by the eResearch Unit and will involve library staff in sourcing data collections.

External resources

It is not known at this stage if external resources will be engaged but it is highly likely that if software development is required, expressions of interest will be sought from QCIF (where ReDBox is currently maintained) and Intersect, the NSW eResearch service provider, and possibly via the internal teams of universities partnering in this work.

Stakeholders

The project steering committee will consist of representatives from:

The eResearch unit.
Research Services.
IT
The Library.
Researchers from various disciplines, by invitation, as needed.

4. Project Deliverables

D1	A working feed of records describing Collections and associated Activities, Parties and Services to Research Data Australia, in the current version of RIF-CS (1.3), demonstrated to meet the quality requirements for RIF-CS records as set by ANDS. This feed will contain additional descriptive metadata for newly identified collections, over and above the feed established in SC20 and will be available for use by researchers in an expanded range of discipline areas as per D2. RIF-CS 1.3 support will require an upgrade to ReDBox. The new Research Data Catalogue is expected to import the contents of the SC20 metadata store.
D2	A feed of collections from at least three distinct Faculties (or equivalent organisational units) within the institution to Research Data Australia. UWS is in the process of establishing 5 new flagship research institutes in addition to 10 existing Schools. Priority will be given to collections sourced from the institutes, which represent a broad range of disciplines, under criteria based on those used in SC20. The most established of these include: Hawkesbury Institute for the Environment (Climate Science) Institute for Culture and Society. MARCS Institute for Brain and Behaviour.* Civionics* (Civionics is a discipline concerned with the interface of the use of electronic devices for the monitoring of civil engineering infrastructure) *These are currently research centres in the process of becoming fully-fledged institutes.
D3	Demonstrated alignment of metadata records about Parties with an institutional name authority (HR or Library), with the authoritative form of the name sourced external to the metadata store, and with new researcher descriptions added to the metadata through regular updates from the name authority. Party information will be sourced from the software system used by Research Services for administering UWS research, grants and projects, this will be integrated with the Research Data Catalogue via a name authority system with an automatic update. Party IDs will be minted using the local UWS Handle server.
D4	Demonstrated alignment of metadata records about Parties with the ARDC Party Infrastructure Project, with researcher descriptions contributed to the NLA, and with People Australia identifiers for researchers recorded against researchers. The project will evaluate the different options for feeding data to the NLA , choosing between a feed to ANDS in RIF-CS format or to the NLA, and if the latter, choosing which metadata format to use, either RIF-CS or EAC-CPF. The project will also investigate a solution for importing or aligning local IDs with NLA IDs and how to interoperate with the global ORCID system when it comes online.
D5	Demonstrated alignment of metadata records about Activities with institutional and external sources of truth (Research Office, ARC and NHMRC grant registries), with the authoritative description of the Activity sourced external to the metadata store, and with new researcher project added to the metadata through regular updates from the sources of truth. This deliverable will use the same data sources and processes as D3, with the addition of processes to import globally defined IDs for activities, such as ARC grants, with a process for aligning these with local views of the same data.
D6	Demonstrated workflow for registering new Collections in the university; this can include automated update, or semi-automated (notification-based). This project will explore the following workflows for data collection registration, with the community of ReDBox user-organisations: The existing library-mediated registration process established in SC20 with data-interviews informing curated descriptions. Automated feeds from data capture systems, feeding into template records which have been curated as in the point above by the library. This will be piloted in the DC21 project. A new system that will integrate the process of applying for data storage, and creating a data management plan into a single form, to integrate the process of describing and capturing data into institutional processes. An system that allows researchers to capture and view data in the RDR-managed storage system or on local storage, and to curate it into collections, both by manually selecting items, and by rule (such as a metadata query or by location). This will have a plugin architecture to allow it to be adapted for different disciplines and file types and build on the integration work between DC21 and SC20.
D7	A software system to realise deliverables D1–D6 (and D8, D13–D14 if applicable), with robust storage and management of metadata. The starting point for a software system used will be the one used for implementing SC20, which is the ANDS-funded ReDBox application. We will aim to undertake this work in concert with other institutions and evaluate the most appropriate way to create the new functionality, either by extending ReDBox or by using other systems.

Optional Deliverables

If your institution has already implemented some of the foregoing deliverables at an institutional level, ANDS expects that you will also include some of the following optional deliverables:

Demonstrated ability to manage the following aspects of the collection lifecycle through recording and exposing relevant metadata related to:

D8.1 embargo dates for collections, where applicable
D8.2 current online location of collection (on internal store or external store)
D8.3 current offline location of collection
D8.4 intellectual property rights (licensing, restrictions on reuse)
D8.5 retention policy (disposal date, deposit date)

D8.6 policy framework (data management plan relevant, ethics clearance forms relevant)

Many of these functions are delivered by the ReDBox application out of the box, the implementation will make sure that they are adopted at UWS.

A public researcher or research profile portal, exposing publishable metadata about the research data being held at the institution.

Not a priority.

D10

Demonstrated ability to feed a selected subset of the collection records relating to a particular discipline to a discipline registry, following the metadata schema and conventions of that registry

Not a priority.

D11

Demonstrated ability to manage the following aspects of the collection lifecycle through recording and exposing relevant metadata:

citation requirements (authoritative identifiers, including DOI, preferred citation format)

citation tracking of collections

audit information (refer to publications audit)

proprietary tools and formats used in collecting the collection

Not a priority.

D12

Strategic reporting on contents and coverage of metadata store for internal use

This is a key area for informing the establishment of a Research Data Repository and the organisational cultural environment in which it will exist. This project will aim to produce reports that can be used to track the growth of the RDR, via the Research Data Catalogue.

D13

Storage and exposure for discovery of object level metadata, and alignment of object level metadata with collection metadata (i.e. ability to navigate from object metadata to collection metadata; update of object metadata aligned with update of collection metadata)

Not a priority.

D14

Storage and management of technical metadata for object and collection reuse, including software and equipment descriptions, methodology, and data interpretation

Not a priority.

Procedural Deliverables

D15	Project Management Plan, using the ANDS template, specifying the details of the planned activity, with risks, schedules, etc
D16	Progress Reports, using ANDS templates
D17	Final Report, using ANDS templates
D18	Deposit of any software (including stylesheets and schemata) developed in the project for achieving other deliverables, and that can be (usefully) used outside the institution, in either Google Code or SourceForge, including: a Google code comment and tag or SourceForge summary and tag containing the text “ANDS-funded” Developer manuals where applicable, to facilitate reuse Deployment manuals to facilitate external deployment User manuals to facilitate use
D19	A source code report, if any software is developed and publicly deposited under D17
D20	A User Acceptance Test online survey

5. Assumptions, Constraints, Dependencies and Risks

	Assumptions	Constraints	Dependencies	Risks*
Staffing	UWS will be able to provide staff to inform the project and recruit a project manager.	The usual constraints of working in a university.	This project depends on the RDR project, which is not yet established, but does have a budget.	Project management and data librarian staff can not be sourced.
Organisational	The RDR project will continue to develop, and storage will be available to researchers via some kind of easy-to-use application process.	UWS project management and governance processes must be followed.	This depends on the ITS budget.	RDR storage does not come online.
Technical		The scope of the technical work is yet to be established – there are no indications that insurmountable challenges will arise.
External Suppliers	Software development can be sourced from QCIF or Intersect
Legal/Ethical
Other	Researchers have limited time to participate.	Early work on SC20 is finding that sourcing data collections is difficult		Collections will be hard to source. (Mitigation: try to provide services that are of high value to researchers and collect metadata as a gateway to their provision (eg the process of filling out applications for storage).

* – Where Risks have been identified, briefly outline your mitigation strategy.

6. Stakeholder Analysis

Stakeholder	Interest / stake	Importance
eResearch Unit	Lead agency	High
Library	Business owner for the Research Data Catalogue – operational responsibility for data curation.	High
Research Services	Custodians of the ancillary data about parties and activities which support the RDC.	High
Information Technology Services	Implementer / supplier of storage infrastructure and environment for the RDR	High

7. Project Management

Project Team, Roles and Responsibilities

Role	% EFT	Responsibilities	Recruitment required? (yes/no)	In-kind contribution or ANDS funded?
Project Manager	50	Deliver the project to ANDS expectations. Assume responsibility and accountability for each Deliverable. Monitor and report to ANDS on project progress. Advise ANDS if project appears to be in danger of non-delivery. Please add more rows as required to describe further responsibilities.	yes	ANDS funded
Project steering committee	?	Exact composition to TBA – [Steering committee now established – chaired by a representative of the office of the Pro Vice Chancellor Reseach, has representatives from ITS, Library, Office of Research Services and eResearch.]		In Kind
Data librarians	50%	Source data collections Curate data descriptions		In Kind
eResearch team	10%	Write policy and procedures for data management in the context of the RDR and RDC Report to ANDS on project governance [fixed typo] issues		In Kind

8. Budget

<removed>

9. Exit and Sustainability Plans

10. Milestones for Payment

Amount	Indicative Timing	Milestone
25% <removed>	Day One (1)	Contract execution
25% <removed>	Agreed project start date + eight (8) weeks	D15 Project Management Plan, using the ANDS template, specifying the details of the planned activity, with risks, schedules, etc D16 Progress Report, using ANDS templates
25% <removed>	Agreed project start date + 30 weeks	D16 Progress Report, using ANDS templates D1 A working feed of records describing Collections and associated Activities, Parties and Services to Research Data Australia, in the current version of RIF-CS (1.3), demonstrated to meet the quality requirements for RIF-CS records as set by ANDS
25% <removed>	52 weeks (Completion)	[D2–D7 mandatory dellverables] [any optional deliverables, including D8–D14 where applicable] D17 Final Report, using ANDS templates D18 Deposit of any software (including stylesheets and schemata) developed in the project for achieving other deliverables, and that can be (usefully) used outside the institution, in an open source repository such as Google Code, SourceForge or GitHub: a comment, summary or tag containing the text “ANDS-funded” developer manuals where applicable, to facilitate reuse deployment manuals to facilitate external deployment user manuals to facilitate use. D19 A source code report, if any software is developed and publicly deposited under D18 D20 A User Acceptance Test online survey

11. Glossary of Terms

Term	Definition
Collection	A collection describes a grouping of physical or digital items of interest to the research community, particularly research data sets or physical collections of research materials.
Activity	An activity is an undertaking or process related to the creation, update, or maintenance of a collection.
Party	A party is a person or group related to an activity, to the creation, update, or maintenance of a collection, or to the provision of a service. Parties add to the discoverability of collections and add valuable contextual information, including assisting with determination of value for a collection. A party could be either a group: one or more persons acting as a family, group, association, partnership, corporation, institution or agency. person: a human being; or an identity (or role) assumed by one or more human beings.

Appendix A. Check list of metadata store functionality

The purpose of this background check is to determine the scope of the project by structuring an analysis of your institution’s data management readiness, and to provide a check list that reflects the functionality of an effective data collection infrastructure. Completion of the checklist is not mandatory, but may well be useful to your institution.

	Yes	No	Developing
Does your institution have a Data Management Policy?			X
Is your institution able to automatically aggregate metadata about data collections from various areas/units within your institution?		X
Is any of this metadata exposed for discovery through a discipline portal?		X
Is any of this metadata exposed for discovery through an institutional portal?		X
Is any of this metadata exposed for discovery through Research Data Australia?			X
Are you able to expose and manage metadata about data collections at an object level? (Individual data objects; data collection methods; sample information; etc.)		X
Do you manipulate metadata descriptions aggregated from various areas of the institution, in order to align them with an institutional metadata standard?		X
Does your institution’s metadata conform or map to RIF-CS?			X
Does your institution’s metadata use controlled vocabularies?			X
Is your institution’s metadata integrated with institutional sources of truth (e.g. HR for researchers, Research Office for grants)?			X
Is your institution’s metadata integrated with national sources of truth (e.g. NLA Party, ARC/NHMRC grants registry)?			X
Do you have a process for registering new data collections as they are created?			X
When it comes to the core attributes of data collections required for effective data management, are you able to manage the following:	Yes	No	Developing
embargo dates for collections, where applicable?		X
current online location of collection (whether internal store or external store)?			X
current offline location of collection?			X
intellectual property rights – licensing, restrictions on reuse?			X
retention policy e.g. disposal date, deposit date?			X
policy framework e.g. data management plan, ethics clearance forms?			X

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

Posted on 2012-03-16 by ptsefton

Mixing our Research Data Metaphors: Seeding the commons, capturing data & taming ‘wild’ research data

By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond

About this post

During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR) which we outlined in a previous post. In this post we will dig deeper into the architecture and look at how a couple of the components interact, specifically; how does a lab-level data management application talk to the institution-level Research Data Repository when a researcher wants to archive a data set for reuse and citation? This work is a partnership with researchers and technicians at the Hawkesbury Institute for the Environment (HIE), our NSW eResearch partner Intersect, the UWS library and IT, and the UWS eResearch team.

Non-technical summary: The data capture application for environmental scientists at HIE will be aimed at obtaining and managing data for immediate use and re-use. This post describes the technical approach we will use to allow researchers to create a data set from one or more data sources, ask the system to keep it for the long term in the UWS Research Data Repository, and issue an identifier they can use to cite it in a research publication. Keeping data in the RDR means both adding data to the Research Data Storage (RDS) component and maintaining a record about the data in the Research Data Catalogue (RDC).

Technical summary (contains jargon which is explained below): The data-curation interface between the ANDS-funded Data Capture (DC21) and Seeding the Commons (SC20) projects at UWS has now been specified. Data sets identified by researchers as important in the DC21 application will be harvested by the institutional Research Data Repository using the OAI-PMH protocol with a RIF-CS payload. Data librarians will check and improve collection descriptions and, for those of significant re-use potential, publish them to Research Data Australia. On publication, the Research Data Repository application will move data from a pre-published to a published state. Pre-published data may be openly accessible for collaboration purposes but will not have DOI identifiers or guaranteed persistence.

Data capture and seeding the commons

We have two Australian National Data Service (ANDS) projects running a UWS at the moment.

There’s a Data Capture project, which, amongst other capabilities, is designed to capture some of the ‘wild’ data, organizing it into collections that can be secured, referenced and re-used by others. This is known as DC21, AKA Climate Change and Energy Research Data Capture Project (DC21).
Data might be considered ‘wild’ if there questions about its long term management (will we be able to find it ten years from now?), short term safety (is it backed up?), or its status is not know (is it raw or cleansed?).
There’s a Seeding the Commons project which, amongst other things, is aimed at establishing a catalogue application which publishes descriptions of collections of data available for re-use on a search site; Research Data Australia.

Here’s what the DC21 application is doing:

This project will develop the data architecture and associated software systems to automatically capture data and meta-data from three instruments. The motivation for the project is that on completion the systems developed will serve as a basis for including the additional instruments utilised by CCERF and other research groups at UWS.

And it has a close connection to the Seeding the Commons project SC20.

The project is closely aligned and is partly dependent on the UWS Seeding the Commons project (SC20). The meta-data collected in this project will be contributed to the UWS eResearch Metadata Store. SC20 will be developing RIF-CS and OAI-PMH compliance for the UWS eResearch Metadata Store to allow for it to be harvested into the ARDC.

OAI-PMH, RIF-CS?

OAI-PMH is a web protocol allowing one service to pull data from another. It’s very similar to RSS and Atom used to keep track of updates on websites by software like Google Reader.
RIF-CS is the data format used to publish catalogue descriptions of research data and associated entities like people and projects to Research Data Australia. RIF-CS is an ANDS-specific format which is not sufficient on its own to capture a full set of archival and management data about research data collections, but our initial analysis is that it will be sufficient to communicate between the data capture application and the centralised research data repository.

From data capture – to data embalming, er, preservation and re-consumption

Luc Small of Intersect has written up the DC21 application.

While it’s called a ‘capture’ application, with connotations of Gerald Durrell style antics in the wilds, trapping temperature readings and soil moisture readings with tranquilizer darts, DC21 is really about data domestication. Sure we need to obtain data, but it’s not just about raw, untamed, data; technicians and researchers do things to the data. They clean it and analyse it, and make useful collections out of data from different sources.

The bit we’re interested in here in this post is the point at which someone says “I’m ready to write this up” – at this point they will want to make sure their research is defensible, reproducible and, perhaps most importantly, citable. Before we go on to talk about this process, lets look at some of the assumptions we’re making about the application DC21.

Design Considerations

Data capture applications contain working data that might be reworked, cleaned or deleted before it is published or used as the basis for a publication or report.
Research projects are born, they run and they get completed. Research facilities are built and will eventually become obsolete. Data capture systems which service these projects and facilities are likely to suffer the same fate – they will not always have governance in place to ensure that they persist over long periods of time. (Yes, we know it’s in the requirements spec that every app is ‘sustainable’ but let’s be realistic).
The Research Data Repository (RDR) and its sub parts (the data storage system and the Research Data Catalogue RDC) capture important institutional assets. To maintain these research data assets, the RDR will need to have governance in place to ensure its long term persistence.
The RDR will have RIF-CS-over-OAI-PMH and other interfaces that are needed for compliance and data discovery, meaning that data capture applications need not have these (but they can, of course).
A data set that is required for validation of research should have a persistent identifier expressed as an HTTP URI. (Handles and DOIs can both be used to make URIs, with some benefits and attendant risks).
Publicly accessible data sets, as well as those that are expected to be cited even if not available as Open Access
And an implementation detail: At UWS, the ReDBox Research Data Catalogue application will be the software that runs the Seeding the Commons and RDC projects.

Rules of Engagement

Here are some rules of engagement, which are emerging as we get further into the design process for the Research Data Repository (RDR), data capture (DC21) and Research Data Catalogue applications (SC20). These rules are helping to ensure that the research data being captured is robust and well managed. Data sets that are needed to validate research, and which researchers want to be citable:

Must be deposited in the Research Data Storage component (RDS) of the RDR or another persistent store that meets the same standards for data preservation. Note that much data will be in the RDS already, deposit is then a state-change rather than a move.
Must be described in the Research Data Catalogue (RDC) with a link to where the data resides. (Support will be available for this from the library).
Data capture applications must have a mechanism for a researcher to ask for a data set to be ‘curated’ so it is available for a defined period and correctly described, for example if they want to use it as the basis of a publication.

The current solution

Against the background of our medium-terms plans for a UWS Research Data Repository, and the above design considerations, rules of engagement and requirements, the technical teams from the Data Capture project and the Seeding the Commons project spent the best part of a day working out a white-board sketch of the interfaces between the lab-level working data management application and the repository.

While this high level solution design assumes ReDBox, other metadata store applications could be slotted in instead – the interface is standards based (RIF-CS over OAI-PMH).

The whiteboard looked like this. Below, we’ll simplify that with a proper diagram made on a computer.

Figure 1 Interface between data capture application and the Research Data Repository (using OAI-PMH and the RIF-CS standard for metadata about research data)

There are two main interface points:

Name authority lookup, where every bit of metadata entered into DC21 is as high as possible in quality, via:
1. A linked-data approach using HTTP URIs (AKA URLs) as names for things, as per the Gospel According to Tim.
2. A single source of truth via the Mint component of ReDBox for data like subject codes, people, organisations etc.
The ‘curation boundary’ where DC21 hands-over metadata to the Research Data Catalogue, and when that’s been curated by data librarians, data is pulled into the public-facing facet of the Research Data Store.

The first of these is already done in DC21 – as far as we know this is the first time a service other than ReDBox has been connected to an instance of the Mint as an authority. We will talk more about the importance of name authorities as ‘sources of institutional truth’ and the use of identifiers as our Research Data Repository project proceeds. For now, we will note that as far as possible every time someone fills out a form with something the institution already knows (a name of a person, a grant-code etc) then the data is looked up in the name authority, rather than replying on people typing strings, or local look-up tables. The UWS Research Data Catalogue is going to be ‘no strings attached’, as in text-strings. URIs all the way!

The more important interface is the second, the main subject of this post, handles deposit of data collections into the trusted Research Data Repository.

Based on all the design considerations and rules of engagement outlined above, the ‘curation boundary’ needs to be crossed when a researcher wants to keep an archival snapshot of a particular data set.

The story here is designed for data sets of moderate size, like those we’re getting from the Hawkesbury Institute for the Environment.

So, here’s the story:

A researcher uses the DC21 application to find a number of data files from across two of the facilities at the institute, conducts some analysis and writes n article. (The system remembers every download from the data store).
The researcher asks for the particular data set used for the article to be published/curated, either by uploading the data back into the system, or clicking on a search history.
The DC21 application bundles the requested data, with as much provenance and metadata as possible, such as adding raw data.
The DC21 application sets a flag against that downloaded collection to mark it as ready for publication – meaning it will start appearing in the OAI-PMH feed. The DC21 application will also remember that the data behind the collection has been referenced in a collection. This is to ensure that the data is not subsequently deleted or modified without due consideration for the collection.
The Research Data Catalogue, which is part of the Research Data Repository picks up the new collection record from the OAI-PMH feed and puts in in the ‘ReDBox inbox’..
The team of data librarians see the new data set in the inbox, add missing metadata for management and discovery purposes, maybe contacting the researcher for more information, and publishes the data.
The Data Catalogue application mints a new DOI for the data set, and causes the data to be copied into the public part of the research data store. (Yes, we have to work out some of the details about when IDs get minted in this process – this step might need to happen earlier.)
Later, another researcher can discover the data, via searching the web, a discovery service like Research Data Australia or via the Research Data Catalogue directly, they get a URL version of the DOI for the data set.
When someone downloads the data using the DOI-URL, they’re redirected to the data in the Research Data Store.

Figure 2 Step-by step data curation and publishing process

Copyright Peter Sefton and Peter Bugeia, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

An Australian Research Data Repository

Posted on 2012-02-29 by ptsefton

[This document was originally posted on Peter Sefton’s blog.]

By Peter Sefton and Peter Bugeia, with input from the UWS eResearch community and beyond

About this post

During 2012 The University of Western Sydney (UWS) will be rolling out a Research Data Repository (RDR). Peter Bugeia from Intersect put together the project proposal which secured the budget to move ahead with this project and Andrew (Alf) Leahy has been organising storage for research groups on request, ahead of the RDR becoming properly operationalised.

Part of this RDR project will be a project funded by the Australian National Data Service (ANDS) under the Metadata Stores programme. There will be a number of meetings over the next couple of months where Australian Universities discuss how they are going to spend their Metadata Stores money. There is a great opportunity for organisations running similar software architectures and solutions to pool their development resources; we’ve already spoken with Newcastle about this. We hope this preliminary material will help others in the community work out how much we have in common or tell us where we might be able to improve things. More on that at the end of the post.

Before we go on, about that term ‘Metadata Stores’

In ANDS speak, a Metadata Store is something that has all the metadata about research data and associated entities like parties (people, organisations) and activities (grant funded projects for example), in order that collection descriptions (of research data) can be published to their search service Research Data Australia (RDA).

But in many organisations we’re lacking the most important thing. We don’t have the descriptions of research data to suck up into a ‘metadata store’. Yes, there are data capture projects, lots of them which will attempt to automate to some extent the packaging process, but there’s a lot of stuff out there that won’t get captured this way. And even with automated data capture, someone still has to write the human readable description as a framework for the automated part. Many of us are getting librarians involved in this enterprise Why? They’re good at it. From a library point of view, this is a growth area, in a time when many other functions of the library are changing and/or disappearing.

Describing collections is a key part of this whole ANDS agenda, which strangely sometimes gets a bit forgotten. For example, in this diagram from the ANDS website, “collection descriptions” appear only outside of the institution. When working on these projects it’s important to keep thinking about where these descriptions are going to come from.

ANDS diagram showing Collection Descriptions in the Research Data Commons. But think: who’s going to provide the descriptions in the institution – they’re not going to manifest spontaneously as they appear to here!

The RDR

The repository will consist of two main components:

A scalable storage service linked to a combination of local and cloud-based high performance computing. Some data may also reside in other, trusted storage systems such as national infrastructure or discipline repositories with suitable governance in place.
A catalogue of research data for internal use in management, and external use in dissemination and collaboration.

Unlike a typical monolithic Institutional Repository (IR) the storage and catalogue services are disaggregated, because the data involved can be large and is much more varied in nature than the typical contents of an IR. Also, some data will reside in trusted data stores outside of the central storage supplied by IT. Not to mention that some of it is on paper and some on obsolete digital media.
But the project is about much more than supplying storage and computing. It is about creating an organisational capability and culture of managing research data throughout the research lifecycle. We aim:
To enable research in all disciplines at UWS to take place efficiently and effectively on existing and new data sets.
To enable the validation of research through appropriate management of data inputs and outputs.
For re-use in new research which will cite the creators of data sets at UWS.
For compliance with funder requirements and codes of practice.

The RDR will be a locally governed service linked to national infrastructure for dissemination of research data, data storage and computing

The repository software stack will consist of:

Scalable, managed file storage for both working and archived data, including any discipline-specific metadata describing that data
Access to virtualized and non-virtualized computing infrastructure at UWS, national and commercial facilities so that researchers can run data analysis tasks.
A research data catalogue containing metadata about data at a collection level for code-compliance, strategic research management and discovery purposes.
Interfaces to national research infrastructure including storage and computing (RDSI, NCI, NeCTARetc).

The initial storage component of the RDR was established in 2010. The next steps in the RDR project are to design the architecture that links the storage to computing infrastructure and cataloguing applications, and buy a lot more discs. This architectural work will be undertaken by the eResearch Unit (That’s Peter Sefton, Andrew Leahy and our Intersect eResearch Analyst Peter Bugeia),Information Technology Services, and the University Library.

The RDR will serve several communities:

Community	RDR services
Researchers	A safe place to put working data. A platform for collaboration around data. Archival storage, to meet researcher and departmental obligations to preserve data. A platform for describing and advertising data sets for discovery and re-use. A platform for publishing open access data A platform for protecting confidential or trade-secret data, managing embargoes and disposal dates. Direct connection to scalable computing resources. Minimal extra work for researchers over current interactions with ORS (grant applications, reporting etc).
Office of Research Services	An institutional view of where data resides. Integration with the research lifecycle and existing ORS systems Improved visibility for UWS research
The Library	Curation tools for research-services librarians to describe data sets Descriptions of data collections become part of the library’s holdings.
Information Technology Services	Consolidation of research data into one service model (currently storage provisioning is distributed and ad hoc).
UWS eResearch team	University processes around the RDR will make it easier to identify where eResearch needs are at UWS. Basic data storage is a ‘gateway’servicefor researchers into embracing more advanced computing and data-driven research (ieeResearch).
General community	The public will have an improved view of what research is conducted at UWS. Potential future Australian Government or other reporting on data re-use and citation rates.
External researchers	Discoverable data, people and projects Immediate re-use of open-access data where possible Mediated use data where possible, in collaborative projects or by arrangement with data creators

Fitting in to UWS

The RDR will be integrated with processes for managing the research lifecycle. It is a design goal that all researcher interaction with the system will have as little impact on the researchers as possible, by aligning form-filling and bureaucracy with existing and/or inevitable processes such as grant application or government reporting.

Storage allocation will be tied to stages in the research lifecycle where researchers already interact with the Office of Research Services, such as grant application time. Thus all funded research projects will have appropriate working and archive storage allocated from the project start.

Reporting on published data collections will be aligned with events such as grant-funded projects concluding.

More about the repository software architecture

A repository is not just a software application. It’s a lifestyle. It’s not just for Christmas. And if you build it they almost certainly won’t come. The main repository-like components of the RDR, the storage and catalogue, will be loosely coupled, but there will need to be overall repository governance in place to make sure that data is well looked after. (We’re setting aside discussion of the computing component for now, more on that soon.)

More detail about the RDR

To implement the Research Data Catalogue UWS will be using the ReDBox Research Data Catalogue application working alongside a scalable storage system (we have a post about what kind of storage and how it might be organised coming soon).

Above, we talked above about the need for systems that can describe research data collections. In most organisations we don’t know what data we have, because there is no catalogue or registry with metadata about data collections sitting there ready to be aggregated. The ReDBox application will fill this gap; it is the place where the library team will work with researchers to describe their collections, check the descriptions and finally publish them, with metadata that is as high in quality as we can manage.

Some collection descriptions will come, at least in part, from data capture software systems. It’s worth noting that many data capture projects could be a lot more ephemeral than the repository. They could be project based, or they could use software that does not have a long working life, so one of the core assumptions we’ve made is that for many data capture applications there will be a step when data crosses the curation boundary into the repository, where we know it will be looked after for the appropriate length of time. Data may or may not be moved into the storage component in the repository depending on a number of factors, including its size, the nature of the data capture application and whether it has the right governance in place to ensure that data can be managed to the standards set for the RDR. There’s a lot more to say about this interface, we’re preparing another post with a detail based on a specific implementation being built by Intersect and UWS.

Open questions for the UWS deployment include:

Will the catalogue itself be used for discovery or do we want to publish information about research data holdings via some other means, such as the VITAL-powered institutional repository, or via a discovery layer service?
Will we insist on a single RDR API to program against, or be more relaxed about how various services talk to the storage, catalogue or name authority service directly? At stake here is the integrity of the contract with researchers and the world, the contract that the catalogue matches what data we have.
Experience suggests that trying to make everything go through one tiny little software keyhole won’t work. Governance is governance, software that gets in the way gets routed-around, and it will take people to honour the various contracts involved such as the one that says once you assign a DOI (an ID, as seen in journal publishing) to a data set you don’t change the data.

Opportunities for collaboration

A number of Australian Universities have money under the ANDS Metadata Stores funding stream for projects like the UWS Research Data Catalogue. There will be lots of models for what a metadata store looks like, but if we can find some common ground then we should be able to:

Form a consortium of some kind, of like-minded institutions, using similar software componentry.
Work out which software, documentation, training course requirements are shared across sites.
Pool some of our funding to develop the above.

Simple, apart from the complexities of legal contracts etc. But there might be ways to divide up work, do it at different institutions and contribute it back to the commons; all ANDS-funded software deliverables are open source, and documentation etc can be made open access.

A good place to look for commonality is in the ANDS recommended deliverables.

D1	A working feed of records describing Collections and associated Activities, Parties and Services to Research Data Australia, in the current version of RIF-CS (1.3), demonstrated to meet the quality requirements for RIF-CS records as set by ANDS.
D2	A feed of collections from at least three distinct Faculties (or equivalent organisational units) within the institution to Research Data Australia.
D3	Demonstrated alignment of metadata records about Parties with an institutional name authority (HR or Library), with the authoritative form of the name sourced external to the metadata store, and with new researcher descriptions added to the metadata through regular updates from the name authority.
D4	Demonstrated alignment of metadata records about Parties with the ARDC Party Infrastructure Project, with researcher descriptions contributed to the NLA, and with People Australia identifiers for researchers recorded against researchers.
D5	Demonstrated alignment of metadata records about Activities with institutional and external sources of truth (Research Office, ARC and NHMRC grant registries), with the authoritative description of the Activity sourced external to the metadata store, and with new researcher project added to the metadata through regular updates from the sources of truth.
D6	Demonstrated workflow for registering new Collections in the university; this can include automated update, or semi-automated (notification-based).
D7	A software system to realise deliverables D1–D6 (and D8, D13–D14 if applicable), with robust storage and management of metadata.

To summarise, a comment about these deliverables in the light of the architecture sketched above. D6 is huge. “Demonstrated workflow for registering new Collections in the university” Most of the other deliverables are dealing with data that’s already well described, and needs to be integrated into a metadata store. One could argue that we could quite easily link a collection to a URL about a grant funded project (there’s your activities taken care of), forget about describing people at the national library or trying to set up yet another local ID, and put resources into getting ORCID up and running. But what about the all-important collections? Isn’t that the whole point? Note to other Metadata Stores project people: don’t forget, the workflow for ‘registering’ collections presupposes that you have a way of describing the collections in the first place, and making sure you can manage those collection descriptions even if some of the more ephemeral data sources (such as data capture projects) disappear.

Copyright Peter Sefton and Peter Bugeia, 2012-02-14. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

	COMPLETED	CURRENT FOCUS	NEAR FUTURE	TRANSITION TO 2.0
Information and Tools		Data Management Checklist and Template Website information Needs analysis tool	Joint Statement Data Management workshops Data Management Plans for Schools and Institutes	Publicity and announcements of services available Policy planning
Technology Services		~~Shared drive support plan~~ Researcher ticket tracking	Online request forms Services planning Service catalogue update
Research Data Catalogue	~~Contracts with ANDS~~ ~~SC20 commences~~ ~~DC21 commences~~ ~~ReDBox installation~~ ~~Import existing metadata~~	DC21 data links ~~SC20 data collections~~ MS23 connections Procedures &workflow Test Planning User Acceptance Test	File watching Automated feeds to RDA Automated ingest of institutional records Researcher self submit
Research Data Store	~~Strategy~~ ~~Project initiation~~ ~~Procurement planning~~	~~Procurement~~ ~~Set up~~ Migration of existing data ~~Shared drive services~~ Provision for RDC	Connections to cloud storage Application layer

Key Facts
Funding Source/Amount:	ANDS – Data Capture Program – $200k
Lead Organisation/CIs:	University of Western Sydney Hawkesbury Institute of the Environment Prof. Ian Anderson
Timeframe:	Development commenced in December 2011 and the system will go live in the latter half of 2012.
Related Projects:	TERN/OzFlux: The present project is best regarded as supporting the precursor activities that enable the delivery of quality assured data to a facility such as OzFlux.

eResearch Tools day. Another look at CKAN

What happened?

And then what?

So how did we do?

My CKAN day…

Peter B

Alf

David

Carmi

Petie

The $64,000 Question: is CKAN up to it?

Lessons learned

Services that exist now

External services that we are supporting

What infrastructure has been delivered

Slide 1

Notes

Slide 2

Notes

The talk

Notes

For example

Notes

Australian Government Initiatives in Research Data Management

Notes

Slide 6

Notes

Slide 7

Notes

Slide 8

Notes

Slide 9

Notes

Slide 10

Notes

Slide 11

Notes

eResearch @ UWS

Notes

Slide 13

Notes

Slide 14

Notes

Slide 15

Notes

Slide 16

Notes

Slide 17

Notes

Slide 18

Notes

Slide 19

Notes

From acting to archiving…

Notes

Slide 21

Notes

Slide 22

Notes

Slide 23

Notes

Slide 24

Notes

Slide 25

Notes

Slide 26

Notes

Slide 27

Notes

Slide 28

Notes

Slide 29

Notes

Slide 30

Notes

Slide 31

Notes

Lessons Learned

What’s Next?

What’s in it for Me?