What’s in the CKAN?

Creative Commons License What’s in the CKAN? by Peter Sefton and Kim Heckenberg, photos by Andrew Leahy is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

What’s in the CKAN?

On Tuesday the 4th March 2014, the extended UWS eResearch Team and our friend Gerry Devine the Data Manager at Hawkesbury Institute of the Environment (HIE) met on the UWS Hawkesbury campus to have the first of a planned series of ‘Tool Day’ exploration and evaluation sessions.

These days are an opportunity to explore various eResearch applications, ideas and strategies that may directly benefit UWS researchers during the research life cycle, this particular day was looking at a back-end eResearch infrastructure tool, but we will also be running researcher-focussed workshops and training sessions, using the Research Bazaar (#resbaz) methodology being developed by Steve Manos, David Flanders and team at the University of Melbourne.

IMG_20140304_121446

The first application on the list was CKAN, which is the acronym for Comprehensive Knowledge Archive Network and is an open-source;

data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available. See more at: http://ckan.org/#sthash.qMFkyVG8.dpuf

We are interested in the potential for CKAN as a data capture and working-data repository solution. In terms of the AAAA data management model we’re developing at UWS, that covers the first two A’s:

  1. Acquiring data – CKAN can accept data both from web-uploads and via an API.
  2. Acting on data – CKAN has a discovery interface for finding data sets, simple access control via group-permissions and ways to deal with tabular, spreadsheet-ish data online. It looks like a reasonable general-purpose place to put all kinds of data but particularly CSV-type stuff such as time-series data sets, which CKAN can preview, and plot/graph.
  3. Archiving data, archiving at UWS is expected to be handled by the institutional Research Data Repository (RDR) or a discipline specific repository, so we’re looking at how CKAN can be used to identify and describe data sets and post them to an appropriate archival repository.
  4. Advertising data. The default for disseminating research data in Australia is to make sure data collection descriptions are fed to Research Data Australia, along with making sure that any relevant discipline specific discovery services are aware of the data too.

Joss Winn at Lincoln in the UK has explored CKAN for research data management. He says:

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site is  running the new/next UI design, too, which looks great.

To start exploring the basic I/O capabilities of the CKAN application, the team separated into groups to perform various tasks. Andrew/Alf’s job was to build an instance of the CKAN environment on a UWS virtual machine running CentOS. The task involved chasing-down a current installation guide that actually works. This proved challenging as the documentation regarding CentOS was six months old. Andrew achieved his mission, and claims to have learned something.

Peter B and Gerry were tasked with uploading data through the CKAN API; we (naively) thought that we might be able to write a quick script to suck data out of HIEv, the working-data repository for Gerry’s institute and push it to the test CKAN instance that Intersect have set up as part of the Research Data Storage Initiative (RDSI). Initial progress was promising, and Gerry and Peter managed to create data sets in CKAN, but getting a file, any file, uploaded into a data set proved beyond us on the day.

Lloyd and Graham explored the PHP CKAN API library, which is four-years since its last update and not very complete. The library came complete with a hard-coded URL for a CKAN site (what that means is that it was set up to always talk to the same CKAN server, normally an API library would take the server as an argument). Lloyd had fixed that and will offer it back to the developer, if we get a chance to test it. At the moment, though, we don’t have much confidence in that code.

(By the following evening we had sorted out the API problems which seemed to be as simple as us trying to use the latest API library against a not-so-new server, and Gerry was able to upload data files to data sets.)

Open Questions about CKAN:

  1. Are the good ways to package multiple DataSets together for deposit as a data collection?
  2. How can we follow linked-data principles and avoid using strings to describe things? We’d really like to be able to link data sets to their research context, as discussed on PT’s blog:

    Turns out Gerry has been working describing the research context for his domain, the Hawkesbury Institute for the Environment. Gerry has a draft web site which describes the research context in some detail – all the background you’d like to have to make sense of a data file full of sensor data about life in whole tree chamber number four. It would be great if we could get the metadata in systems in HIEv pointing to this kind of online resource with statements like this:

    <this-file> generatedByhttps://sites.google.com/site/hievuws/facilities/eucface

A couple of CKAN annoyances:

  1. It’s not great that the API talks about “Packages” while the user interface says “Data Sets”.
  2. Installation is a bit of a chore, as Andrew puts it, it’s “scary”; you follow a long set of steps and only at the end find out whether it works. The Ubuntu installation is a little bit more structured, but still, some way-points would be good.
  3. It seems odd that the default installation does not include a data store, so by default it is only a catalogue, this tripped us up when trying to use the API.
IMG_20140304_121348

This was our first try at an eResearch Tools Day, here are some note for ourselves:

  1. While going out lunch at the Richmond Club was quintessentially Western Sydney and quite pleasant, it is probably better to eat on-site and not break the flow by all jumping in the eResearch van. Pizza, delivered next time.
  2. We do want to invite other eResearch types and where appropriate some researchers to some of these days, but want the first few to be with people we know well so we can refine the format. (As noted above these are technically focussed days for technical people, all about learning basic infrastructure, not about research questions, there will be other venues for researcher collaboration).
  3. It should not take ten days for us to blog about an event – next time we’ll appoint a communications officer.