eResearch Tools day. Is CKAN aDORAble?

On Tuesday August 12th the UWS eResearch team had one of our irregular-but-should-be-regular tool-hacking days, looking at the CKAN data repostiory software again There were three main aims:

  1. Evaluate the software as a working-data repository for an engineering lab and maybe an entire institute, similar to the way HIEv sits in the Hawkesbury Insitute for the Environment.

  2. Evaluate the software as a generic research data management solution for groups wanting to capture data into a reposiotry as part of their research. Does CKAN fit with our principles for eResearch Development and Selection?. Joss Winn wrote about CKAN as a research data store a couple of years ago, why they chose it at Lincoln, and there was a workshop lastyear in London which produced some requirements documents etc.

  3. Provide a learning opportunity for staff, giving them a chance to try new things an develop skills (such as using an API, picking up a bit of Python etc).

What happened?

David demoed CKAN and showed:

  • Simple map-based vizualization using a spreasheet of capital cities he found on the Intents

  • Simple plotting of some maths-data

And then what?

We then broke up into small groups (mainly of size one, if we’re honest), to investigate different aspects of CKAN.

  • Katrina and Carmi: Looking at the abilities for uploading Excel files by ingesting some data.gov.au datasets. What can be done, what can’t? What happens with metadata?
  • David: Looking into the upload of a HIEv package/cr8it crate into CKAN. Can we automagically get the metadata out and stash it in CKAN. Can we represent the packages file structure in CKAN?
  • Alf: Document this instance and preview infrastructure needs.
  • PeterS: previews for markdown and other files. getting stuff out of files. events/queues. RDF and URIs.
  • PeterB: TOA5 uploads from HIEv
  • Lloyd: POST an ADELTA record into CKAN.

So how did we do?

Well, we got data moving around via a number of methods – spreadsheets went in via the web interface, documents went in over the API, documents came out over the API.

We learnt the differences between CKAN’s structured and unstructured data. "Structured" data is essentially tabular data: if you’re bringing it in via a CSV or a spreadsheet then it’s structured. What this means is that it gets stored as a relational table within CKAN and in principle this means you can access particular rows. Unstructured data is anything else, and you can access all of a blob or none of it.

We found gists handy for passing code snippets and wee “how-to” texts between the team on Slack.

My CKAN day…

Peter B

We had a reasonably successful day. I found the upload of a file resource through the CKAN API (from Python) worked a lot easier with the extra documenttion. We had some problems with the security key in that the API wouldn’t run for me or Peter S when using our own keys, but it all worked when we used each others’ – reason: unknown. From a python script we were able to open a specially formatted csv file (TOA5 format from Campbell Scientific having 2 additional rows of metadata at the top), decode the first 2 rows and turn the metadata into name/value pairs when we created the CKAN dataset. So this was fairly flexibly done. A lot of our HIE climate change data is formatted this way and it means we should be able to ingest records failry readily as csv.

Alf

I wrote some short instructions (in a gist) on how to start up our CKAN instance.

Unfortunately the rest of the time was more heat than fire, as I read up on CKAN’s web-based previewing feature which uses Recline.js as well as Data Proxy, but it still a little bit unclear to me how it’s tied together.

Peter B pointed out that extracting individual rows from datasets is possible if the dataset is kept in a database underneath CKAN rather than as a file "blob". So I did some reading and partial setup of the CKAN Data Storer Extension. The setup guide is aimed at someone with more Python experience than me, so I got trapped in "celery and pasta (paster) land" for most of the afternoon!

David

Initial success in dusting down my long-dormant Python skills and getting data in and out via the API was followed by losing a lot of time trying to extract the RDFa from the HIEv package’s HTML. Neither manual crufting nor Python’s [RDFaDict][https://pypi.python.org/pypi/rdfadict] could get it all out (in fact, the library got nothing. Nothing!). The lesson here is to be sure that we put metadata in a place and a form that we can get it out programmatically.

Notwithstanding that, CKAN had a lot going for it in terms of upload and access, but it wasn’t immediately clear how it would handle complex metadata within its data model.

Carmi

At Tools Day I learned to create a new dataset item plus upload a file with data to that item via the CKAN API using Python for the first time. That was the highlight for me. It was also interesting to see what is possible in terms of visualising data. I uploaded a few excel spreadsheets and the graphing interface was very user-friendly. I would like to see it utilised for data visualisation in the Centre for the Development of Western Sydney’s website.

Petie

This time posting actual data to CKAN seemed easier – I am assuming the documentation must have improved. I managed to put together something that could create new datasets and attach new files – a potential denial of service attack against CKAN or a tool for testing its scalability. And at Peter B’s suggestion worked on some very simple code to extract metadata and CSV from TOA5 files, as used by Campbell Scientific data loggers residing at the Hawkesbury Institute for the Environemnt.

The $64,000 Question: is CKAN up to it?

I general, yes CKAN seems to be a reasonable platform for data management that aligns well with our principles.

It has the basic features we need:

  • APIs for getting stuff in and out and searching

  • A discovery interface with faceted search

  • Previews for different file types

There are some limitations.

  • Despite what is says on the website and what Joss Winn reports, it’s not really ‘linked-data-ready’

  • It does have metadata and that is extensible but there’s not formal support of recognized ‘proper’ metadata schemas, jsut name-value pairs

There are some questions still to explore:

  • How well will it scale? We can probe this easily enough by pumping a lot of data into it

  • How robust and transactional is the data store? If we have different people or processes trying to act on the same objects at the same time will it cope or collapse?

  • Can we use more sophistcated metadata? We might look at stuff like the ability to add an RDF file that contains richer metadata than the built in stuff? How hard would this be? Could we allow richer forms for filling out, say, MODS metadata?

  • Ditto for using URIs. How easy would it be to add real linked data support? Would a hack do? ie instead of storing name/value pairs allow some conventions like name (URI)/value (URI). Again, how easy is it to hack the user interface to support stuff like autocomplete using name authorities rather than collecting yet more strings.

Lessons learned

We didn’t talk to each other as much as we should have. This possibly due the venue – our offices – which meant people went to their desks. Next time we’ll be in a more interactive venue.

David was held up by the design of the data packages from HIEv – we need to revise the data packaging so that it has metadata in easy-to-use JSON as well as metadata embedded in RDFa.

Creative Commons Licence
eResearch Tools day. Is CKAN aDORAble? by members of the UWS eResearch team is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Research Data Repository Services Delivered in Stage One

The Research Data Repository over the last year has delivered an impressive array of infrastructure and services.

Services that exist now

Service How the service is delivered
  • A researcher can request a Research Shared Drive up to 1 TB with multiple users and access anywhere on UWS campus. FAQ is online.
  • The request can originate from the researcher or from eResearch, and then the ITS team provision the share in accordance with the Support plan.
  • The request form is online.
  • A researcher can back up their git repository onto Research Data Store.
  • The service is delivered ad hoc by the eResearch team.
  • A researcher can request a virtual machine.
  • The request can originate from the researcher or from eResearch, and then the ITS team provision the virtual machine in accordance with the relevant SOP.
  • A researcher can deposit their research data in the Research Data Catalogue.
  • There are two ways to initiate the request, by Self-service by using an online form.
  • Or, in discussions with eResearch and the Library Research Services team.
  • Once initiated, the Research Services Coordinator – Library follows Library procedure in creating a new collection record and storing the data collection (as applicable).
  • Library systems can harvest metadata from UWS and web sources of truth, on a regular basis.
  • This metadata is stored in the Research Data Catalogue and provides lookup for applications like ReDBox and HIEv.
  • The service is delivered in accordance with Library procedures.
  • This is self-service by obtaining a copy of the checklist online, with support from the eResearch team as needed.
  • This is self-service by obtaining a copy of the checklist online, with support from the eResearch team as needed. The eResearch team can and do occasionally write Data Management Plans on behalf of the researchers, using the same template.
  • This is self-service via reading website content and following links for more information with assistance from the eResearch team.
 

External services that we are supporting

Service How the service is delivered
  • A researcher can obtain a NeCTAR virtual machine up to 2 cores at a time for up to 3 months (eResearch can assist with access and set up).
  • A researcher can apply for a medium and large (high intensity) virtual machines from NeCTAR.
  • A researcher can get a Cloudstor+ account through AARNET, this is cloud storage for research, located within Australia (eResearch are actively promoting this service and seeking user evaluations of it.
  • A ReDBox administrator can initiate a bug fix or issue with QCIF for resolution.
  • QCIF provide support, with assistance from the eResearch team.
 

What infrastructure has been delivered

Infrastructure – Storage
  • 127 Tb of high quality disk for researchers and research related-uses has been deployed. This storage is highly flexible and extensible and can be utilised as SAN or NAS depending on the need. > Migration of all data from old 70 Tb SAN
  • Established new service, Research Shared Drive (SIF share) > New FAQ/README with instructions for install, and also best practices in data management > New support plan through close coordination with eResearch and ITS > 10 research teams are currently using the RDS.
  • Storage has been connected to a number of virtual machines for research specific projects and applications.
Collaborative Storage
  • Explored and trialled several collaborative storage solutions, including Oxygen Cloud, WOS cloud, SparkleShare, and OwnCloud.
  • Selected OwnCloud based on experience at other organisations (such as AARNET and Lincoln University in UK).
  • A trial was conducted whereby a link was made between Dropbox and the Research Shared Drive. The team set up a Dropbox account which can receive a copy of a researcher’s Dropbox, and store that data on the same Researcher’s Shared Drive. This system is still in development stages.
  • A trial was conducted whereby a link was made between Source Code Repositories (version control systems) and the Research Data Store. The link is demonstrated by a UWS git server which clones public access git repositories. By way of example, we cloned the eResearch-apps repository.
Up Next:
  • Trial a collaborative storage option based on OwnCloud.
  • Establish a mechanism by which a user pushes their git repository to UWS storage.
  • Serve the needs of researchers who use other version control systems such as Mercurial and SubVersion.
Infrastructure – Compute
  • 4 servers have been provisioned for research use, 2 existing from HIE, and 2 provided through RDR, this is the Research Cluster.
  • The Research Cluster comprises 160 processor cores and 1024 Gb of memory available.
  • 6 vm’s which had been created previously were successfully migrated onto the Research Cluster.
  • There are 9 virtual machines which have been created in the research cluster, with plans to migrate more virtual machines across from the School of Medicine and other schools and institutes.
  • We can provision up to approximately 40 ‘medium intensity’ virtual machines.
Up Next:
  • Create canned virtual machines which comes ready-ready with tools needed to analyse data.
Infrastructure – Software
  • New packaging software was developed for research data, called CrateIt (Cr8it). Cr8it was started under two different approaches. The first approach was to leverage a toolset called The Fascinator, and the other approach was to incorporate new features into OwnCloud.
  • Document conversion, such as ePub generation, was ported into OwnCloud-Cr8it.
  • An automatic generation of a combined metadata catalogue record plus manifest was started. The manifest will be human and machine readable, leveraging work done by the HIEv (DC21) project.
Up Next:
  • Create a Cr8it trial and roll it out.
  • Flesh out what metadata record needs to be created by the Cr8it packaging process.
Research Data Catalogue
  • A simple form was developed that a researcher can use to indicate that they have a data set they would like to archive.
  • A pro forma questionnaire has been developed by the Research Services team at the Library. A process for including a new data set was also developed by the Library Research team.
  • 3 new procedure documents were created which formalised the ingest of metadata from RHESYS (University Research Management System) and from external sources, such as ReDBox wiki, NHMRC and ARC. Approximately 1,500 researchers and 500 projects are in the Research Data Catalogue available via lookup when a new data collection record is created.
  • New Research Data Catalogue entries (30+) were added to Research Data Australia, searchable by anyone with web access.
  • The ReDBox application was set up so that people who create data sets at UWS also have their unique details merged with an existing (or newly created) record in the National Library of Australia database, which is linked to any other data sets or publications which they have created in the same field or under the same name.
  • A new feature in ReDBox was added whereby an administrator can view the results of ingesting records about people and research projects. These results are presented in the form of ingest reports, describing what was ingested, modified, or removed, to support Quality Assurance going forward.
  • A ReDBox support agreement was negotiated with QCIF, which provides bug fixes and technical support until December 2014.
  • A new wizard for creating a data management plan inside the data catalogue is currently being trialled. The idea is that any data management plan which is created will be stored in the catalogue along with the data, and can be exported as a pdf if needed.
Services – Research Data Management
  • A new Data Management Plan Checklist was created.
  • A new Data Management Plan Template was created.
  • Additional page was added to the Office of Research Services pages, which included: > Data Management defined, > Data Management best practices, > Links to RDR services, > Links to external services and more information as applicable, and > Standard pro forma language that researchers can use to complete their research application forms.
  • Internal application forms were improved to ask researchers to explain how data management will be addressed, including: > Internal grant application for UWS funded research, and, > Application form to start new external grant application through ORS.
  • eResearch interviewed researchers with live projects and created 3 Data Management Plans using the Data Management Plan Template, plans which have been provided to the researchers.
  • eResearch interviewed managers of research facilities and drafted 4 Data Management Plans thus far, which have been provided to the facility managers.
Up Next:
  • Finalise Data Management Plans for our research facilities. In addition eResearch is currently assisting with new shared drives for these facilities (this is really BAU but is within the scope of the project).
  • Deposit the Data Management Plans in the Research Data Catalogue.
 

4A Data Management Acquiring, Acting-on, Archiving & Advertising research data at the University of Western Sydney

This is a presentation with speaker notes from the Open Repositories 2013 conference at Prince Edward Island in Canada, as presented by Peter Sefton, written with Peter Bugeia.

[Update 2013-07-25 Added missing link to Kangaroo video

Creative Commons Licence
4a Data Management by Peter Sefton and Peter Bugeia is licensed under a Creative Commons Attribution 3.0 Unported License

Slide 1

Notes

Abstract

There has been significant Government investment in Australia in repository and eResearch infrastructure over the last several years, to provide all universities with an institutional repository for publications, and via the Australian National Data Service to encourage the creation of institution-wide Research Data Catalogues, and research Data Capture applications. Further rounds of funding have added physical data storage and cloud computing services. This presentation looks at an example of how these streams of money have been channeled together at the University of Western Sydney to create a joined-up vision for research data management across the institution and beyond, creating an environment where data may be used by research teams within and outside of the institution. Alongside of the technical services, we report on early work with researchers to create a culture of replicable use of data, towards the vision of truly reproducible research.

This presentation will show a proven end-to-end design for research data flows, starting from a research group, The Hawkesbury Institute for the Environment, where a large sensor network gathers data for use by institute researchers, in-situ, with data flowing-through to an institutional data repository and catalogue, and thence to Research Data Australia – a national data search engine. We also discuss a parallel workflow with a more generic focus – available to any researcher. We also report on work we have done to improve metadata capture at source, and to create infrastructure that will support the entire research data lifecycle. We include demonstrations of two innovations which have emerged from the associated project work: the first is of a new tool for researchers to find, organize, package and publish datasets; the second is of a new packaging format which has both human-readable and machine-readable components.


Slide 2

Notes

Some of the work we discuss here was funded by the Australian National Data Service. See:

Seeding the commons project to describe data sets at UWS and the Data catalogue project.

HIEv Data Capture at the Hawkesbury Institute for the Environment


The talk

Notes

We’ll use the four A’s to talk about some issues in data management.

We need a simple framework which covers it all, to capture how we work with research data from cradle to grave:

We need to Acquire the raw data and make it secure and available to be worked on.

We need to Act on the data to cleanse it while keeping track of how it was cleansed, analyse it using tools to support our research, while maintaining the data’s provenence.

We need to Archive the data from working storage to an archival store, making it citable

We need to Advertise that the data exists so that others can discover it and use it confidently with simple access mechanisms and simple tools.

4A must work for

high-intensity research data such as that from gene sequences, sensor networks, astronomy, medical diagnostic equipment, etc.

the long tail of unstructured research data.


For example

Notes

In the presentation, Peter Sefton used the short video linked here as an ice-breaker.

If only data capture were as simple as catching a kangaroo in a shopping bag!


Australian Government Initiatives in Research Data Management

Notes

There have been several rounds of investment in (e)research infrastructure in Australia over the last decade, including substantial investments to get institutional publications repositories established.

Australian National Data Service (ANDS) $50M (link)

National eResearch Collaboration Tools and Resources (NeCTAR) project (link) $50M

Research Data Storage Infrastructure (RDSI) $50M (link)

Implemented to date:

National Research Data Catalogue – Research Data Australia

Standard approach to updating the Catalogue (OAI-PMH and rif-cs)

10+ Institutional Metadata Repositories implemented

120+ data capture applications implemented across 30+ research organisations

Upgrade of High Performance Computing infrastructure

Colocation of data storage and computing


Slide 6

Notes

UWS is a young (~20years) university performing well above most of its contemporaries in research.


Slide 7

Notes

This slide by Prof Andrew Cheetham – the Deputy Vice Chancellor for Research shows that UWS performs very well at attracting competitive grant income from the Australian Research Council.


Slide 8

Notes

UWS is concentrating its research into flagship institutes – we will be talking in more detail about HIE, here, our environmental institute which does research from cutting across different disciplines spanning from the leaf level to the ecosystem level.


Slide 9

Notes


Slide 10

Notes

Intersect is the peak eResearch organisation in the state of NSW:

Intersect was formed in 2008 in response to research IT needs.

The term ‘eResearch’ is used to refer to the application of advanced information and communication technologies to the practice of research. It enhances existing research processes, making them more efficient and effective, and it enables new kinds of research processes. eResearch brings together the effective management and organisation of research data with computing infrasrcture and software applications to enable research and to facilitate collaboration between researchers.

eResearch loosely translates to e-Science and Cyber-infrastrcture, depending on which part of world you come from.

Intersect is a not for profit company which is owned by its members (see list on next page)

Intesect currently consists of 60 staff, with eResearch Analysts on-site at members (this is unique in Australian eResearch)

Services include: Data capture solutions / software development, high end data storage infrastrcture, research data management planning, high performance computing (Intersect administers its own supercomputing facility and provides a share of Australia’s leading computing infrastructure at Australian national University to its members, virtual computing, consulting, training, strategic advice.

UWS is a member of Intersect


Slide 11

Notes

These are Intersect’s members. Intersect also collaborates with other eResearch organisations throughout Australia.

The slide is a photo of at the recent Hackfest event. THis is an annual fun competition for software developers to use open government data in innovative ways. Intersect hosted the NSW chapter of the event.


eResearch @ UWS

Notes

The eResearch unit at UWS is a small team, currently reporting to the Deputy Vice Chancellor, Research. See our FAQ.


Slide 13

Notes

At UWS, we haven’t tried to drive change with top-down policy. Instead, we’ve taken a practical, project-based approach which has allowed a data architecture to evolve. The eResearch Roadmap calls for a series of data capture applications to be developed for data-intensive research, along with a generic application to cover the long tail of research data.

The 4A Vision

For the purposes of this presentation we will talk about the ‘4A’ approach to research data management – Acquire, Act, Archive and Advertise. The choice of different terms from the 2Rs Reuse and Reproduce of the conference theme is intended to throw a slightly different light on the same set of issues. The presentation will examine each of these ‘A’s in turn and explain how they have helped us to organize our thinking in developing a target technical data architecture and integrated data-related end-to-end business processes and services involving research technicians and support staff, researchers and their collaborators, library staff, information technology staff, office of research services, and external service providers such as the Australian National Data Service and the National Library of Australia. The presentation will also discuss how all of this relates to the research project life cycle and grant funding approval.

Acquiring the data

We are attacking data acquisition (known as Data Capture by the Australian National Data Service, ANDS 1) in two ways:

With discipline specific applications for key research groups. A number of these have been developed in Australia recently (for example MyTARDIS 2), we will talk about one developed at UWS. With ANDS funding, UWS is building an open source automated research data capture system (the HIEv) for the Hawkesbury Institute for the Environment to automatically gather time-series sensor data and other data from a number of field facilities and experiments, providing researchers and their authorised collaborators with easy self-service discovery and access to that data.

Generic services for Data storage via simple file shares, Integration with cloud storage including Dropbox.com and other distributed file systems. And Source-code repositories such as public and private github and bitbucket stores for working code and textual data.

Acting on data

The data Acquisition services described above are there in the first instance to allow researchers to use data. With our environmental researchers, we are developing techniques for developing reusable data sets which include raw data, commented scripts to clean the data (eg a comment “filter out known bad-days when the facility was not operating”) then re-organize it via resampling or other operations into useful ‘clean’ data that can be fed to models, plotted etc and used as the basis of publications. Demo: the presentation will include a live demonstration of using HIEv to work on data and create a data archive.

From action to archive

Having created both re-usable base data sets and publication-specific operations on data to create plots etc there are several workflows where various parties trigger deposit of finished, fixed, citable data into a repository. Our project team mapped out several scenarios where data are deposited with different actors and drivers including motivations that are both carrot (my data set will be cited) and stick (the funder/journal says I have to deposit). Services are being crafted to fit in with these identified workflows rather than build new things and assume “they will come”.

Archiving the data

The University of Western Sydney has established a Research Data Repositoryi (RDR), the central component of which is a Research Data Catalogue, running on the ReDBOX open source repository platform. While individual data acquisition applications such as HIEv are considered to have a finite lifespan, the RDR will provide on-going curation of important research datasets. This service is set up to harvest data sets from the working-data applications, including the HIEv data-acquisition application and the CrateIt data packaging service using the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).

Advertising the data

As with Institutional Publications Repositories, one of the key functions of the Research Data Repository is to disseminate metadata about holdings to aggregation services and give data a web presence. Many Australian institutions are connected to the Research Data Australia discovery service 6, which harvests metadata via an ANDS-defined standard over the OAI-PMH harvesting protocol. There is so far no Google-Scholar-like service which is harvesting data about data sets via direct web crawling (that we know about), so there are no firm standards for how to embed data in a page, but we are tracking the developments of the Schema.org vocabulary, which is driven largely by Google’s group of companies which are Google’s peers, and the work described above on data packaging with RDFa metadata is intended to be consumed by direct crawlers. It is possible to unzip a CrateIt package and expose it to the web thus creating a machine-readable entry-point to the data within the Zip/BagIt archive.

Looking to the future, the University is also considering plans for an over-arching discovery hub, which would bring together all metadata data about research including information on publications, people, and organisation.

Technical architecture

The following diagram shows the first end-to-end data capture to archiving pathways to be turned on at the University of Western Sydney, covering Acquisition and Action on data (use) and Archiving and Advertising of data for reuse. Note the inclusion of a name-authority service which is used to ensure that all metadata flowing through the system is unambiguous and inked-data-ready 7. The name Authority is populated with data about people, grants and subject codes from databases within the research services section of the university and from community-maintained ontologies. A notable omission from the architecture is integration with the Institutional Publications Repository – we hope to be able to report on progress joining up that piece of the infrastructure via a Research Hub at Open Repositories 2014.

i Project materials refer to the repository as a project which includes both working and archival storage as well as some computing resources, drawing a line around ‘the repository’ that is larger than would be usual for a presentation at Open Repositories.


Slide 14

Notes

There are a number of major research facilities at HIE, here are two whole-tree chambers which allow control over temperature, moisture and atmospheric CO2.


Slide 15

Notes

This diagram shows the end to end data and application architecture which Intersect and UWS eResearch built to capture data from HIE sensors and other sources. Each of the columns roughly equates to the four A model. Once data is packaged in the HIev, it is stored in the Research Data Store and there is a corresponding record for it in the Research Data Catalog. The data packaging format produced by the HIEv, along with the delivery protocol are key to the architecture: the data packaging format (based on bagit) is stand-alone from the HIEv and self-describing, the delivery protocol (OAI-PMH) is well-defined and standards based. THese are discussed in more detail in later slides. When other data capature applications are developed at UWS, to integrate into and extend the architecture they will simply need to package data in the same format and produce and deliver the same meta-data via the same delivery protocol as the HIEv.


Slide 16

Notes

This diagram shows how the four ‘A’s fit together for HIE. Acquisition and action are closely related – it is important to provide services which researchers actually want to use and to build in data publishing and packaging services rather than setting up an archive, and hoping they come to it with data.


Slide 17

Notes

The HIEv/DC21 application is available as open source:

Funded by ANDS

Developed by Intersect

Automated data capture

Ruby on Rails application

Agile development methodology

Went live in Jan 2013.

1200 files, 15 GB of RAW data, 25 users.

120 files auto-uploaded nightly, +1GB per week

Expected to reach 50,000 files in next couple of years

Now extended to include Eucface data

Possibly to be extended to include Genomic data (20TB per year)

Integrated with UWS data architecture

Supports the full 4 As – links Acquire to Act to Archive


Slide 18

Notes

Acting on data: our researchers are not staring to do work with the HIEv system: here’s an API developed by Dr Remko Duursma to consume data from R-stats.


Slide 19

Notes

Acting on data: researchers can pull data either manually of via API calls and do work, such as this R-plot.


From acting to archiving…

Notes

The following few slides show how a user can select some files…


Slide 21

Notes

… look at file metadata …


Slide 22

Notes

… add files to a cart …


Slide 23

Notes

… download the files in a zip package …


Slide 24

Notes

… inside the zip the files are structured using the bagit format …


Slide 25

Notes

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites) …


Slide 26

Notes

… with detail about every file as per the HIEv application itself


Slide 27

Notes

… and embedded machine readable metadata using RDFa lite attributes


Slide 28

Notes

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.


Slide 29

Notes

Advertising – data. This is a record about an experiment on Research Data Australia.


Slide 30

Notes

Acquiring the data – long tail.

We looked in some detail at how the HIEv data capture application works for environmental data – but what about researchers who are on the long tail, and who don’t have specific software applications for their group?

We are working on a similar Acquire and Act service that will operate with files and trying to make it as useful and attractive as possible. Most research teams we talk to at UWS are using Dropbox or one of the other ‘Share, Sync, See’ services. Dropbox has limitation on what we can do with its APIs and does not play nicely with authentication schemes other than its own, so we are looking at building ‘Acquire and Act’ services using an open source alternative; ownCloud.

Our application is known as Cr8it (Crate-it).


Slide 31

Notes

A number of techniques employed at UWS:

the “R” drive

research-project-oriented data shares

synchronisation with dropbox and owncloud

synchronisation with github and svn

References

1. Burton, A. & Treloar, A. Designing for Discovery and Re-Use: the ‘ANDS Data Sharing Verbs’ Approach to Service Decomposition. International Journal of Digital Curation 4, 44–56 (2009).

2. Androulakis, S. MyTARDIS and TARDIS: Managing the Lifecycle of Data from Generation to Publication. in eResearch Australasia 2010 (2010).at <http://ccaeducause1.caudit.edu.au/index.php/eraust/2010/paper/view/62>

3. Sefton, P. M. The Fascinator – Desktop eResearch and Flexible Portals. (2009).at <https://smartech.gatech.edu/handle/1853/28483>

4. Kunze, J., Boyko, A., Vargas, B., Madden, L. & Littman, J. The BagIt File Packaging Format (V0.97). at <http://tools.ietf.org/html/draft-kunze-bagit-06>

5. Group, W. W. & others RDFa Core 1.1 Recommendation. (2012).at <http://www.w3.org/TR/rdfa-syntax/>

6. Wolski, M., Richardson, J. & Rebollo, R. Shared benefits from exposing research data. in 32 nd Annual IATUL Conference (2011).at <http://iatul2011.bg.pw.edu.pl/proceedings/ft/Wolski_M.pdf>

7. Berners-Lee, T. Linked data, 2006. at <http://www.w3.org/DesignIssues/LinkedData.html>

Seeding the Commons Data Sharing Project Complete.

The federally funded component of the UWS Seeding the Commons data description and sharing project has been completed. A comprehensive description of the Project is available via a previous blog post. Descriptions of 21 UWS data collections are now available in Research Data Australia. Some of the collections are open access, some are available via mediated access (contact the researcher to discuss access conditions) and some are metadata (description) only. The collections are also represented in Trove, the National Library’s single search portal, and discoverable by Google, Google Scholar and other search engines. Data Collections with a DOI will be indexed in the new Thomson Reuters Data Citation Index , allowing them to be formally cited in research papers. Congratulations to all participating researchers who now have their data/data description accessible to a vast audience of international scholars including potential collaborators.

Lessons Learned

A shift in the culture of data sharing is required to ensure that data does not remain the lost output of research. Whilst some have embraced sharing, others still insist they ‘just don’t want to’ share their data. A concerted effort is required to raise the awareness of UWS researchers on the benefits of data sharing through a campaign of communication, education and engagement for all in the research data lifecycle. If you know of a data sharing success story don’t be shy, spread the word.

Data description is complex and can appear daunting until all the pieces fall into place. A cheat sheet is in development and available on request to assist. Once refined the sheet will be published.

What’s Next?

Researchers may self-submit data descriptions and/or small data sets into the UWS Research Data Catalogue. Library staff will complete the metadata, confirm the record with the submitter and make it available in Research Data Australia.

Researchers wanting to share data/descriptions but who are unsure about self-submission may contact Susan Robbins at s.robbins@uws.edu.au or 9852 5458.

UWS is currently working on Cr8It (pronounced Crate-it) – a web based packaging application for research data. It will give users an organised view of their files and as much metadata as possible automatically extracted from the files. Cr8it it will let researchers identify related objects and organise them into a data ‘package’, adding more metadata and context if required, such as associating a package with a research institute, facility or experiment. Researchers will then be able to send the packages to the Research Data Catalogue and eventually push it out to a variety of other destinations, such as blogs or discipline repositories. Cr8it is currently at the proof of concept stage. To try out this service or learn more, please contact eResearch@uws.edu.au.

Currently we are investigating ‘use cases’ for depositing data at various stages of the research life cycle (eg. At the inception of a research idea, when applying for grant, when it is funded etc) These are mentioned in a previous blog post.

What’s in it for Me?

The opportunity to stay ahead of the pack. Aside from the practical issues of data storage and preservation, you could increase opportunities for collaboration and the impact of your research globally.

To arrange storage space for working data, or a secure space to archive and preserve data, contact: Toby O’Hara from eResearch at t.ohara@uws.edu.au or 4736 0928

To arrange for your data collection to be described and reflected in Research Data Australia (and associated locations) contact:

Susan Robbins Research Coordinator (Library) at s.robbins@uws.edu.au or 9852 5458

‘Moving Forward’

A series of UWS data related webinars and workshops will soon be available to assist anyone/everyone involved in the research data lifecycle. To be informed when they are scheduled please email Susan Robbins s.robbins@uws.edu.au research comic

“Piled Higher and Deeper” by Jorge Cham
www.phdcomics.com

No Such Thing as a Dumb Question

I’m working with external collaborators – can I give them access to our Research Data Repository?

This is being investigated as a priority, but unavailable at present.

I trip over boxes of old interviews in my lounge room – can you take them, digitise them?

UWS archives is able to store these for the duration of the ethics application contact RAMS. Post ethics expiry date, the data collection will be evaluated to determine the next step.

I’m retiring and have 20 years worth of research data on floppy discs. Can I give them to you to digitise, preserve and archive?

At present we don’t offer this option, but you can self-submit them and (soon) utilise Cr8it (see above) to manage the collection. Assistance is available. Contact Toby OHara t.ohara@uws.edu.au x2928

Data are the New Black: Data Sharing in the National/International Arena

A selection of data related activities occurring internationally.

CSIRO to embrace open access Hare, Julie. The Australian [Canberra, A.C.T] 11 July 2012: 31

THE CSIRO is making freely available 200,000 research papers dating back to the 1920s on its new, open-access repository. It is also creating a portal to contain most of the raw research data used by the organisation since its inception.

“It’s a massive job. We will eventually have 86 years of data in the repository,” said Jon Curran, CSIRO’s general manager of communications.

“We are anticipating this is where the world of science is heading.

“The mood is there. And we know the more visible the work the more excitement and energy that is generated.” …

GigaScience (http://www.gigasciencejournal.com/) , an innovative new journal handling ‘big-data’ from the entire spectrum of life sciences, has now been launched by BGI

Geoscience Data Journal New Wiley open access data journal

“It is becoming increasingly important that the data which underpins key findings should be made more available to allow for the further analysis and interpretation of those results,” said Mike Davis, Vice President and Managing Director, Life Sciences Wiley. “The ability of researchers to create and collect often huge new data sets has been growing rapidly in parallel with options for their storage and retrieval in a wide range of data repositories. We are launching the Geoscience Data Journal in response to these important developments.”

http://au.wiley.com/WileyCDA/PressRelease/pressReleaseId-104139.html

Hindawi Datasets International

Publishing a Dataset Paper in Datasets International is all about the underlying raw and tabular data that the author has obtained during his experiment. Every table or image should be accompanied with a full description of how this data has been obtained, for instance, if you provide us with a graph; you should provide us with the tabular data you have used to draw this graph.

Datasets should contain detailed explanation of the methodology and materials used in conducting the experiment/observation and no final results or conclusions. Accordingly, manuscripts should be submitted along with all the relevant data.

Wikidata aims to create a free knowledge base about the world that can be read and edited by humans and machines alike. It will provide data in all the languages of the Wikimedia projects, and allow for the central access to data in a similar vein as Wikimedia Commons does for multimedia files.

Google Scholar already contains citations to datasets represented by a DOI.

An example of a data citation in a reference list

Birgisdottir, L., and Thiede, J.R.N.
(2002) 
Carbon and density analysis of sediment core PS1243-1 PANGAEA. doi:10.1594PANGAEA.87536.

Cited in Jourabchi, P., L’Heureux, I., Meile, C., & Cappellen, P. V. (2010). Physical and chemical steady-state compaction in deep-sea sediments: Role of mineral reactions. Geochimica Et Cosmochimica Acta, 74(12), 3494-3513. Retrieved from www.scopus.com

Creative Commons License
Seeding the Commons Data Sharing Project Complete. by Susan Robbins is licensed under a Creative Commons Attribution 3.0 Unported License.

Research Data Repository (RDR) progress report, May 2013

The RDR project at UWS started in 2010 with the purchase of some storage infrastructure, and was expanded in scope in 2012, based on this scoping document. Work began in earnest in June 2012 when project manager Toby O’Hara joined the team. We set out with these broad principles in mind:

The repository will consist of two main components:

  1. A scalable storage service linked to a combination of local and cloud-based high performance computing. Some data may also reside in other, trusted storage systems such as national infrastructure or discipline repositories with suitable governance in place.

  2. A catalogue of research data for internal use in management, and external use in dissemination and collaboration.

But the project is about much more than supplying storage and computing. It is about creating an organisational capability and culture of managing research data throughout the research lifecycle. We aim:

  • To enable research in all disciplines at UWS to take place efficiently and effectively on existing and new data sets.

  • To enable the validation of research through appropriate management of data inputs and outputs.

  • For re-use in new research which will cite the creators of data sets at UWS.

  • For compliance with funder requirements and codes of practice.

Those two main components are now established. We have both working storage (RDS) and archival storage (RDR) now commissioned and working on a small scale. (Note that terminology on this project has changed a bit – the RDR used to refer to all the components but it became quite clumsy to talk about ‘the archival repository part of the broader Research Data Repository’).

Figure – Super-simple view of the Research Data Repository with the two main kinds of storage – Working vs Archival

On top of that simple view, we can show how the RDR sits with other systems.

Figure RDR interaction with two other services. Dropbox.com integration is a simple one-way approach while the HIEv data capture application interacts with both working and archival storage via the Catalogue

There are many, many ways that these services could be extended but we have identified three high priorities from consulting with UWS researchers, and talking to other eResearch teams, which we’ll talk about in more detail below:

  1. Adding support for distributed version control systems used by tech-savvy researchers to manage software code and documents.

  2. Adding more support for distributed file-systems like Dropbox, but with better support for data security, access control and the ability to add eResearch applications over the top of the storage.

  3. Dealing with the looming ‘feral file’ problem, where data storage tends to fill up, and there are a lack of options for researchers to hand-over data to an archival store.

Dealing with source-code and document version-control systems

There are two widely used distributed version control systems: git and Mercurial. Many researchers use these to manage program code and/or document sources for publications in text-markup such as LaTeX and increasingly MarkDown, via tools like KintR in the R environment. We are working to add support for this class of repository in our repository, which should be fairly straightforward, as the modern distributed code repositories support the key use-case by design. That is, they allow you to ‘push’ code changes to more than one repository, so a UWS member of a team that is already happily working with say BitBucket could push repository changes to a UWS archival repository for safe keeping, as well as the team repository. Why would they want to do this? It’s not about short term risk, but about having copies of data that are independent of service providers that might come and go in the medium to long term. And it’s about exactly the same use-cases for packaging data and depositing in an archival repository as with any other data project, when projects end, articles are published etc. More on this in a post soon.

Future file systems

The Dropbox.com file sync-and-share product is a clear winner in the distributed file-system stakes. It has a low-friction viral quality that lets it spread in ways that permeated and subverted our institutional networks and command-and-control structures. And it has an unparalleled ease of use1. But there are two major problems:

  1. There are some kinds of data for which one should NOT use Dropbox.com: the researcher has to decide if they are meeting ethical standards, funder requirements and layers of institutional policy.

  2. And while Dropbox.com has an API – an interface against which third parties can write software applications, it is severely limited for doing the kind of ‘bridging’ work we want to between the RDS working-data store and the RDR archival store.

So, the fact that Dropbox.com is so popular, and so good, makes it clear that even if we can’t match it completely, we should be thinking about how to provide a similar service so research teams can:

  • Store stuff on all their devices and have it automatically synchronise between them, with some limits about re-sharing..

  • Invite others that they identify as collaborators to see the files. (No, that does not mean getting them to fill and sign a form apply for a university account, the way I have heard it described at a big university not far from here, it means I send you an invitation by email, you log in using something that (a) suits you and (b) works, for example, a gmail account, and once I’m sure that you are you, then the sharing starts. Yes, there are exceptions where we need higher-levels of assurance but for most collaborations too many barriers mean people will revert to Dropbox and smuggled USB drives.)

And, beyond what Dropbox.com can provide:

  • Store stuff in the right jurisdiction.

  • Allow eResearch tools, such as the one we cover next to access data via full-service machine interfaces (APIs).

There is a promising new application in this space now, run by AARNET called Cloudstor+. This gives Australian Researchers 100GB of free storage which can be expanded at low cost. This runs on the open source OwnCloud platform.

But note that there are many kinds of data that should NOT be placed in sharing-syncing services for various privacy and other legal reasons.

Creating a bridge between working file-storage and the archive.

We are now starting to hand out file-shares, which will, of course, fill up with files as researchers begin to take advantage of the storage space. But what will happen to those files when articles are published, projects and grants finish, research staff leave the institution? There are good reasons in all these situations to make sure that data are catalogued, and stuff is transferred to the Archival Store.

But it would be naïve to think that just because there are good reasons for these things to happen that they will. That’s why we have been working out how to encourage researchers to deposit data at various points in the existing research lifecycle – see our previous post on data management use-cases when we look at how and more importantly why people might be motivated to catalogue and deposit data.

Some data will come to the catalogue via applications like HIEv – the environmental data capture application. At the Hawkesbury Institute for the Environment (which is where the HIE in the name comes from) data is captured by technical research infrastructure and routed automatically to HIEv, where institute staff and collaborators can work with it. When they use a data set and publish an article or create a data set for re-use then they can trigger the process of having it sent to archival storage and cataloguing.

But for data that is not coming through a data capture application, uncatalogued, ‘wild’ or ‘feral’ data we want to provide a way for research teams to:

  • Look at their file-share and see all their (file-based) stuff.

  • Select groups of things that belong together, by directory, by file-type, by a search query, or by picking them out manually.

  • Add metadata to contextualise and explain the files, to support future re-use, and to explain how data supports published finding.

  • Publish/archive the data by sending to ReDBOX, the archival part of the overall Research Data Repository, where librarians will help optimise metadata and mind the data for the appropriate length of time.

Enter CrateIt (or Cra8it – (that’s Crate-it), an application to enable a user to pack-and-label-and-send as just described. In this part of the RDR project Lloyd is writing an OwnCloud plugin which can be used to find, preview, describe, pack and send research data files from the working store to the Research Data Repository for archival storage (or in the case of very large data sets, send links to the files).

We have written previously about a prototype application that does a lot of this already but the OwnCloud version is promising because it is integrated with OwnCloud’s existing sharing and replication services so Cr8it can take advantage of its access control services.

What next?

Work is proceeding now on the three priorities mentioned above; integration with version control systems, file-sharing and synchronisation and the Cr8it application for corralling files.

Beyond that, the future is less certain; the roadmap for eResearch at UWS, which is now more or less complete, but yet to be approved by the eResearch Steering committee calls for a steady roll-out of:

  • More data capture applications at more sites, including research institutes and research groups.

  • Developing institute and school level data management plans following the lead of the Hawkesbury Institute for the Environment.

  • Further integrating data management services into the research lifecycle.

  • Improved integration with computing resources and collaboration tools.

  • Incremental improvements and upgrades to all of existing services.

    1For a quirky take on this, consider Les Orchard’s musing on how it treats him like he treats his pets. This is a interesting way to think about service provision:

    consider these pointers for being nice to animals:

    • Give them a reason to come to you. Don’t chase after and grab.

    • If they want to leave, let them. Don’t hold on and squeeze tight.

    • If you are allowed to pick them up, hold them gently yet offer enough support to make them feel safe.

    • Pay attention to their reactions, learn what kind of attention they like. This gives them a reason to come back when you let them leave.

    Les lives with bunnies, I live with a dog. With dogs you need to show them very explicitly they rank in the family pack (ie below the humans). That’s not a strategy I’d recommend IT or eResearch staff take with your local institute director!

Creative Commons License
Research Data Repository (RDR) progress report, May 2013 by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

eResearch projects, quick update

[update 2013-04-09 – a couple of minor corrections]

This week the eResearch steering committee at the University of Western Sydney is meeting for the first time. We will be bringing the new committee up to speed with all the existing projects, and diving into detail on some key projects.

This is a very quick high-level overview of the status of our major projects, all of which have been reported-on here on the blog before, apart from the very newest.

Finished

The Seeding the Commons project was recently successfully completed. This project was funded by the Australian National Data Service (ANDS) to establish infrastructure for a Research Data Catalogue (RDC). ANDS call these kinds of catalogues ‘Metadata Stores’. This was not just about software, it was about taking the first steps to creating an organisation-wide culture of data management, along with the DC21 data capture project described below, which is still going.


The UWS library team led this project, and they will be providing a project summary, including lessons learned and benefits accrued for publication here soon. Thanks team! (Their report will name all the names that need to be named).

Ongoing

HIEv (Nee DC21)

Another ANDS-funded project, DC21, Data Capture for Climate Change and Energy Research is nearing completion.

There has been some solid progress with this one since Peter Bugeia from Intersect took over the project management late last year:

  • The software application has a new name: HIEv. The name is not an acronym. It’s pronounced ‘hive’. The HIE bit is a reference to the Hawkesbury Institute for the Environment.

  • It’s in production, gathering data from four major research facilities for use within the institute.

  • Version 1.8 was rolled out this week, with training for early-adopters in the institute to follow soon, to be delivered by Peter B and new institute data manager Gerard Devine.

The next steps are to do detailed real-life trials of two major workflows:

  1. Making sure facility data can be presented to researchers in usable cleaned up form in a way that minimises redundant effort and ensures that everyone is working with the same citable data-sets.

  2. Working out how to enable researchers to create research publish data and code that is as complete as possible in support of research data publications.

Over the next few months Gerry Devine will work to get as much (appropriate) data as possible from the institute into the system, and gather requirements to feed into a business-case for a further phase of the project.

The code for HIEv/DC21 is available on github.

Enterprise Research Data Catalogue (‘Metadata Stores’ – MS23)

The Metadata Stores project is nearly complete. We see this as an extension of the Seeding the Commons project, which recently concluded. Like that project this is as much about working with the research community to create new ways of working in an increasingly data-driven research landscape, as installing software. But install software we have – the library has implemented the open ReDBOX research data management software funded by ANDS and now used by more than a dozen Australian Universities.

The work on the catalogue has always been seen as part of a larger effort at UWS: the Research Data Repository Project.

Research data repository (RDR)

The RDR is a key part of the eResearch strategy at UWS (we don’t have a formally endorsed strategy, mind, that’s what the new committee is there for). There are lots of ways to carve-up ‘eResearch’ but we are working with a simple model underpinned by three ‘pillars’:

  1. Research Data Management.

  2. Research Computing (including all kinds of devices from puny smart phones and tablets to cloud servers and High Performance Computing (HPC)).

  3. eResearch Collaboration tools and services.

The raw infrastructure is only part of the picture but it is the foundation. At UWS the Research Data Repository Project is the current focus for building this infrastructure.

image002

Figure 1 The eResearch model for UWS – by Peter Sefton & Sarah Chaloner

Project manager Toby O’Hara has driven the rollout of the RDR – including project managing the Research Data Catalogue and the first basic Research Data Storage services for working data. On the working data front we now have some dedicated research data storage that can be accessed in various ways:

  1. As ‘R Drive’ shares.

  2. Mounted directly to research applications as database storage.

  3. Linked to replicated file-management service, such as Dropbox.com. A group of early-adopter are testing a process for sharing their files with a UWS Research Data account that links Dropbox (and soon other services) with backed-up university-provided services.

Buying storage is simple enough, but in an organisation with several thousand users, making sure that the help-desk know how to turn-on that storage for the right people, and help them use it is far from trivial, and definitely not quick. We’re on the way, though.

Next up, the draft plan calls for:

  • Providing services for our researchers who use code-version-control systems. Git and Mercurial are the current favourites – the researchers who live by these are the poster-children for reproducible research, and

  • Developing formal research data management plans across all parts of the university.

  • A campaign to put in place data capture projects for as much strategically important research data as possible.

  • Establishing a link between working and archival storage via a project with the working title Crate It – Cr8it! – see the new projects below.

Provided, that is, that we can get the resources to keep going.

 

New projects

Human Communications Science Virtual Lab

The major new eResearch project at UWS is the Human Communications Science Virtual Laboratory. This is a NeCTAR-funded project with a total budget in the region of three million dollars, 1.4 of which came from the Australian Government and the rest from a number of Australian institutions, led by UWS.  The HCSVlab has its own website with:

  • A statement of the problem we’re attacking.

    THE PROBLEM OF

    a lack of awareness, access and proficiency in the use of the full range of corpora, tools and techniques available to researchers of the diverse disciplines that constitute the human communication science research field

  • A description of the project.

    The HCS virtual Laboratory (HCS vLab) will connect HCS researchers, their desks, computers, labs, and universities and so accelerate HCS research and produce emergent knowledge that comes from novel application of previously unshared tools to analyse previously difficult to access data sets. The HCS vLab infrastructure will overcome resource limitations of individual desktops; allow easy access to shared tools and data; and provide the guided use of workflow tools and options to allow researchers to cross disciplinary boundaries.

RDR / Research Data Catalogue Spin-off: Cr8it!

The Research Data Repository we’re building at UWS encompasses two kinds of data in the Research Data Storage (RDS) component – there’s working data which is fluid, and archival data which needs to be managed for the long-term (or however long is required by the data management plan for a particular project).

Cr8it is designed to tackle the problem that many organisations are reporting ‘We bought a petabyte of storage, let people use it, and now that it’s full, we’re wondering what’s in all those files! What to keep?’

Cr8it will provide a web-view of research data files in a way that:

  • Makes it easy to see what there is in the working part of the Research Data Store.

  • Allows researchers to identify, describe and package data at various points in the research lifecycle to deposit end-of-project data sets or create published data for papers.