eResearch Tools day. Is CKAN aDORAble?

On Tuesday August 12th the UWS eResearch team had one of our irregular-but-should-be-regular tool-hacking days, looking at the CKAN data repostiory software again There were three main aims:

  1. Evaluate the software as a working-data repository for an engineering lab and maybe an entire institute, similar to the way HIEv sits in the Hawkesbury Insitute for the Environment.

  2. Evaluate the software as a generic research data management solution for groups wanting to capture data into a reposiotry as part of their research. Does CKAN fit with our principles for eResearch Development and Selection?. Joss Winn wrote about CKAN as a research data store a couple of years ago, why they chose it at Lincoln, and there was a workshop lastyear in London which produced some requirements documents etc.

  3. Provide a learning opportunity for staff, giving them a chance to try new things an develop skills (such as using an API, picking up a bit of Python etc).

What happened?

David demoed CKAN and showed:

  • Simple map-based vizualization using a spreasheet of capital cities he found on the Intents

  • Simple plotting of some maths-data

And then what?

We then broke up into small groups (mainly of size one, if we’re honest), to investigate different aspects of CKAN.

  • Katrina and Carmi: Looking at the abilities for uploading Excel files by ingesting some data.gov.au datasets. What can be done, what can’t? What happens with metadata?
  • David: Looking into the upload of a HIEv package/cr8it crate into CKAN. Can we automagically get the metadata out and stash it in CKAN. Can we represent the packages file structure in CKAN?
  • Alf: Document this instance and preview infrastructure needs.
  • PeterS: previews for markdown and other files. getting stuff out of files. events/queues. RDF and URIs.
  • PeterB: TOA5 uploads from HIEv
  • Lloyd: POST an ADELTA record into CKAN.

So how did we do?

Well, we got data moving around via a number of methods – spreadsheets went in via the web interface, documents went in over the API, documents came out over the API.

We learnt the differences between CKAN’s structured and unstructured data. "Structured" data is essentially tabular data: if you’re bringing it in via a CSV or a spreadsheet then it’s structured. What this means is that it gets stored as a relational table within CKAN and in principle this means you can access particular rows. Unstructured data is anything else, and you can access all of a blob or none of it.

We found gists handy for passing code snippets and wee “how-to” texts between the team on Slack.

My CKAN day…

Peter B

We had a reasonably successful day. I found the upload of a file resource through the CKAN API (from Python) worked a lot easier with the extra documenttion. We had some problems with the security key in that the API wouldn’t run for me or Peter S when using our own keys, but it all worked when we used each others’ – reason: unknown. From a python script we were able to open a specially formatted csv file (TOA5 format from Campbell Scientific having 2 additional rows of metadata at the top), decode the first 2 rows and turn the metadata into name/value pairs when we created the CKAN dataset. So this was fairly flexibly done. A lot of our HIE climate change data is formatted this way and it means we should be able to ingest records failry readily as csv.

Alf

I wrote some short instructions (in a gist) on how to start up our CKAN instance.

Unfortunately the rest of the time was more heat than fire, as I read up on CKAN’s web-based previewing feature which uses Recline.js as well as Data Proxy, but it still a little bit unclear to me how it’s tied together.

Peter B pointed out that extracting individual rows from datasets is possible if the dataset is kept in a database underneath CKAN rather than as a file "blob". So I did some reading and partial setup of the CKAN Data Storer Extension. The setup guide is aimed at someone with more Python experience than me, so I got trapped in "celery and pasta (paster) land" for most of the afternoon!

David

Initial success in dusting down my long-dormant Python skills and getting data in and out via the API was followed by losing a lot of time trying to extract the RDFa from the HIEv package’s HTML. Neither manual crufting nor Python’s [RDFaDict][https://pypi.python.org/pypi/rdfadict] could get it all out (in fact, the library got nothing. Nothing!). The lesson here is to be sure that we put metadata in a place and a form that we can get it out programmatically.

Notwithstanding that, CKAN had a lot going for it in terms of upload and access, but it wasn’t immediately clear how it would handle complex metadata within its data model.

Carmi

At Tools Day I learned to create a new dataset item plus upload a file with data to that item via the CKAN API using Python for the first time. That was the highlight for me. It was also interesting to see what is possible in terms of visualising data. I uploaded a few excel spreadsheets and the graphing interface was very user-friendly. I would like to see it utilised for data visualisation in the Centre for the Development of Western Sydney’s website.

Petie

This time posting actual data to CKAN seemed easier – I am assuming the documentation must have improved. I managed to put together something that could create new datasets and attach new files – a potential denial of service attack against CKAN or a tool for testing its scalability. And at Peter B’s suggestion worked on some very simple code to extract metadata and CSV from TOA5 files, as used by Campbell Scientific data loggers residing at the Hawkesbury Institute for the Environemnt.

The $64,000 Question: is CKAN up to it?

I general, yes CKAN seems to be a reasonable platform for data management that aligns well with our principles.

It has the basic features we need:

  • APIs for getting stuff in and out and searching

  • A discovery interface with faceted search

  • Previews for different file types

There are some limitations.

  • Despite what is says on the website and what Joss Winn reports, it’s not really ‘linked-data-ready’

  • It does have metadata and that is extensible but there’s not formal support of recognized ‘proper’ metadata schemas, jsut name-value pairs

There are some questions still to explore:

  • How well will it scale? We can probe this easily enough by pumping a lot of data into it

  • How robust and transactional is the data store? If we have different people or processes trying to act on the same objects at the same time will it cope or collapse?

  • Can we use more sophistcated metadata? We might look at stuff like the ability to add an RDF file that contains richer metadata than the built in stuff? How hard would this be? Could we allow richer forms for filling out, say, MODS metadata?

  • Ditto for using URIs. How easy would it be to add real linked data support? Would a hack do? ie instead of storing name/value pairs allow some conventions like name (URI)/value (URI). Again, how easy is it to hack the user interface to support stuff like autocomplete using name authorities rather than collecting yet more strings.

Lessons learned

We didn’t talk to each other as much as we should have. This possibly due the venue – our offices – which meant people went to their desks. Next time we’ll be in a more interactive venue.

David was held up by the design of the data packages from HIEv – we need to revise the data packaging so that it has metadata in easy-to-use JSON as well as metadata embedded in RDFa.

Creative Commons Licence
eResearch Tools day. Is CKAN aDORAble? by members of the UWS eResearch team is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Principles for eResearch Systems’ Development and Selection

This blog post is a first attempt at a set of principles and best practices that we should strongly encourage for eResearch.

Summary manifesto

  • Thou Shalt Have No Data Without Metadata
    • RDF is best practice for Metadata
    • Use Metadata Standards where they exist
    • Use URIs rather than Scalars (eg Strings) as names
    • Name all data and metadata ASAP
  • Thou Shalt Separate Thy Concerns
    • Use "Small Pieces, Loosely Joined" in preference to big monolithic applications
    • Processes over Tools
    • Separate safe storage of Data-plus-metadata from processing, analysing, searching viewing and other functions
    • Access all services via APIs
  • Data Is Everything1, Everything Is2 Data
    • Stuff collected, created and crafted in the process of research
    • Code, workflow and process descriptions
    • Publications and documentation
    … they’re all data, and hence, should not be without metadata.

No Data Without Metadata

Metadata tells you what the corresponding data actually is, without it we do not know what the data really means. We should capture metadata as soon as practical, preferably with the data to which it applies.

Using URIs for subjects, predicates and (where applicable) values give us precision and clarity. The semantics of ontologies are well defined, and the ability to refer to data objects via a globally-unique, completely unambiguous reference will support the reuse of that data – one of the main pillars of eResearch. In general, Tim Berner’s-Lees’ Five Stars of Linked Open Data are relevant, but note that not all research data can or should be made available as open data, although linked data is better than non-linked data.

While we acknowledge that science data formats and instrument makers have their own metadata formats, as do the library community and agencies such as The Australian National Data Service, which may not be RDF or Linked Data ready, we should use RDF and/or URIs as identifiers wherever possbile. This includes storing metadata as RDF in our repositories. The abilities this give us to link data and to search the metadata are too powerful to give up.

Separation Of Concerns

We strongly suspect that finding one single system which can do all things for all researchers is not going to happen. Instead, we believe that we should look to building ecosystems of collaborating systems, talking to each other over (preferably) standard APIs with each system doing specific tasks and doing them well.

Exposing services via well defined APIs gives several benefits:

  • workflow scripts can be developed to facilitate use of the services.
  • we can provide multiple implementations of a given function within the ecosystem, allowing users to choose one that gives them the facilities they need.
  • we can upgrade, or even completely change, services. As long as the implementations support the appropriate APIs, workflows should not be affected.
  • we can incorporate new systems into this ecosystem relatively easily, extending the range of services in the ecosystem

Tools are all well and good, and in an ideal world all our systems would work together to give us a shiny, synergistic whole. However, back on planet reality we have to be aware that there will be gaps between the tools. You know what? That can be OK. Small numbers of manual steps in a tool-chain don’t invalidate the process. Having said that, manual steps are potential sources of random mistakes and we should work towards minimising them.

Data Are Everything

Data includes inputs, results, physical specimens

Metadata includes information about all the context in which research is conducted where, what machines, which chemcials, which edition of the book, the temperature of the apartus, anything that might influence the results.

At the core of eResearch practice is keeping data safe (remember: No Data Without Metadata). Different classes of data are safest in different homes, but ideally each data set or item should live in a repository, where:

  • It can be given a URI
  • It can be retrieved/accessed via a URI by those who should be allowed to see it, and not by those who should not
  • There are plans in place to make sure the URI resolves to something useful as long is it is likely to be needed (which may be "as long as possible").

If we take the idea of separation of concerns seriously then a web view, search, query services are part of the repository. Indexes and web-interfaces are separate concerns.

Types of repositories include:

  • Databases, where available
  • Digital object repositories that group together related files as data-sets, or items with common metadata
  • Code repositories for code and documentation
  • Publication and documentation repositories

Yes, we should add references to this document.

Creative Commons Licence
Principles for eResearch Systems’ Development and Selection by David Clarke & Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

  1. Yes, pedants, we know: Data Are Everything

  2. This one we are sure about