Hey, does this Data taste funny?

Creative Commons License
Hey, does this Data taste funny? by Andrew Leahy is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IMG_20140312_121336eResearch was invited to be part of the higher degree research (HDR) student orientation session this week. It was great to see all those keen & passionate minds about to set out on an epic journey!

We were slotted after Janette’s talk about Ethics approvals which was all about understanding and managing risk, this made an easy segue into risk about data.

At which point a USB key – figuratively loaded with 3 years of research data an almost completed thesis and spiked with a small amount of potassium permanganate – was unceremoniously dropped into a beer glass… ooooopppps!

So, Data Management. We know it’s deadly boring, but it’ll make you cry if you don’t get it right. Please think about it as you start planning your research.

The eResearch Data Management and Technology Planning page is a good place to start.

UWS students, refer to your green HDR handbook, page 47, and if you have any IT related questions please check the UWS MyITPortal.

Good Luck!

What’s in the CKAN?

Creative Commons License What’s in the CKAN? by Peter Sefton and Kim Heckenberg, photos by Andrew Leahy is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

What’s in the CKAN?

On Tuesday the 4th March 2014, the extended UWS eResearch Team and our friend Gerry Devine the Data Manager at Hawkesbury Institute of the Environment (HIE) met on the UWS Hawkesbury campus to have the first of a planned series of ‘Tool Day’ exploration and evaluation sessions.

These days are an opportunity to explore various eResearch applications, ideas and strategies that may directly benefit UWS researchers during the research life cycle, this particular day was looking at a back-end eResearch infrastructure tool, but we will also be running researcher-focussed workshops and training sessions, using the Research Bazaar (#resbaz) methodology being developed by Steve Manos, David Flanders and team at the University of Melbourne.

IMG_20140304_121446

The first application on the list was CKAN, which is the acronym for Comprehensive Knowledge Archive Network and is an open-source;

data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available. See more at: http://ckan.org/#sthash.qMFkyVG8.dpuf

We are interested in the potential for CKAN as a data capture and working-data repository solution. In terms of the AAAA data management model we’re developing at UWS, that covers the first two A’s:

  1. Acquiring data – CKAN can accept data both from web-uploads and via an API.
  2. Acting on data – CKAN has a discovery interface for finding data sets, simple access control via group-permissions and ways to deal with tabular, spreadsheet-ish data online. It looks like a reasonable general-purpose place to put all kinds of data but particularly CSV-type stuff such as time-series data sets, which CKAN can preview, and plot/graph.
  3. Archiving data, archiving at UWS is expected to be handled by the institutional Research Data Repository (RDR) or a discipline specific repository, so we’re looking at how CKAN can be used to identify and describe data sets and post them to an appropriate archival repository.
  4. Advertising data. The default for disseminating research data in Australia is to make sure data collection descriptions are fed to Research Data Australia, along with making sure that any relevant discipline specific discovery services are aware of the data too.

Joss Winn at Lincoln in the UK has explored CKAN for research data management. He says:

Before I go into more detail about why we think CKAN is suitable for academia, here are some of the feature highlights that we like:

  • Data entry via web UI, APIs or spreadsheet import
  • versioned metadata
  • configurable user roles and permissions
  • data previewing/visualisation
  • user extensible metadata fields
  • a license picker
  • quality assurance indicator
  • organisations, tags, collections, groups
  • unique IDs and cool URIs
  • comprehensive search features
  • geospacial features
  • social: comments, feeds, notifications, sharing, following, activity streams
  • data visualisation (tables, graphs, maps, images)
  • datastore (‘dynamic data’) + file store + catalogue
  • extensible through over 60 extensions and a rich API for all core features
  • can harvest metadata and is harvestable, too

You can take a tour or demo CKAN to get a better idea of its current features. The demo site is  running the new/next UI design, too, which looks great.

To start exploring the basic I/O capabilities of the CKAN application, the team separated into groups to perform various tasks. Andrew/Alf’s job was to build an instance of the CKAN environment on a UWS virtual machine running CentOS. The task involved chasing-down a current installation guide that actually works. This proved challenging as the documentation regarding CentOS was six months old. Andrew achieved his mission, and claims to have learned something.

Peter B and Gerry were tasked with uploading data through the CKAN API; we (naively) thought that we might be able to write a quick script to suck data out of HIEv, the working-data repository for Gerry’s institute and push it to the test CKAN instance that Intersect have set up as part of the Research Data Storage Initiative (RDSI). Initial progress was promising, and Gerry and Peter managed to create data sets in CKAN, but getting a file, any file, uploaded into a data set proved beyond us on the day.

Lloyd and Graham explored the PHP CKAN API library, which is four-years since its last update and not very complete. The library came complete with a hard-coded URL for a CKAN site (what that means is that it was set up to always talk to the same CKAN server, normally an API library would take the server as an argument). Lloyd had fixed that and will offer it back to the developer, if we get a chance to test it. At the moment, though, we don’t have much confidence in that code.

(By the following evening we had sorted out the API problems which seemed to be as simple as us trying to use the latest API library against a not-so-new server, and Gerry was able to upload data files to data sets.)

Open Questions about CKAN:

  1. Are the good ways to package multiple DataSets together for deposit as a data collection?
  2. How can we follow linked-data principles and avoid using strings to describe things? We’d really like to be able to link data sets to their research context, as discussed on PT’s blog:

    Turns out Gerry has been working describing the research context for his domain, the Hawkesbury Institute for the Environment. Gerry has a draft web site which describes the research context in some detail – all the background you’d like to have to make sense of a data file full of sensor data about life in whole tree chamber number four. It would be great if we could get the metadata in systems in HIEv pointing to this kind of online resource with statements like this:

    <this-file> generatedByhttps://sites.google.com/site/hievuws/facilities/eucface

A couple of CKAN annoyances:

  1. It’s not great that the API talks about “Packages” while the user interface says “Data Sets”.
  2. Installation is a bit of a chore, as Andrew puts it, it’s “scary”; you follow a long set of steps and only at the end find out whether it works. The Ubuntu installation is a little bit more structured, but still, some way-points would be good.
  3. It seems odd that the default installation does not include a data store, so by default it is only a catalogue, this tripped us up when trying to use the API.
IMG_20140304_121348

This was our first try at an eResearch Tools Day, here are some note for ourselves:

  1. While going out lunch at the Richmond Club was quintessentially Western Sydney and quite pleasant, it is probably better to eat on-site and not break the flow by all jumping in the eResearch van. Pizza, delivered next time.
  2. We do want to invite other eResearch types and where appropriate some researchers to some of these days, but want the first few to be with people we know well so we can refine the format. (As noted above these are technically focussed days for technical people, all about learning basic infrastructure, not about research questions, there will be other venues for researcher collaboration).
  3. It should not take ten days for us to blog about an event – next time we’ll appoint a communications officer.

Linux Desktops for Research Software

We recently ran a hands-on workshop for genomics researchers to introduce them to a set of powerful graphical applications for genomics analysis. The research staff and HDR students at the workshop were primarily from the Hawkesbury Institute for the Environment (HIE) and the School of Science and Health.Linux Workshop ImageResearchers who need to analyse genomic data tend to be big consumers of computing resources. HIE researchers are using a large underlying computing environment (for the nerds, 48 CPU’s and 512Gb of memory). This system is then configured to run many simultaneous Linux desktop sessions. Linux is the most common choice for high-performance computing, 97% of the Top500 supercomputers use Linux. For HIE, we run Linux systems with a graphical desktop interface which complements the Windows or Apple Mac desktop environments that many researchers are familiar with. The Linux desktops are accessed using an application called NoMachine (NX), which requires a client program to be installed on the local computer. There are clients for Windows, Mac OSX and Linux and soon iOS/Android. We are using the open-source implementation of NX from OpenNX.

OpenNX login screen

Once logged in users are presented with a familiar interface with menus, icons and file explorer. The Linux desktop minimizes the need to jump into a Linux command line, which can be daunting to new users. The main applications installed for HIE researchers are graphical windows-like programs, such as CLC Genomics Workbench and Geneious Pro.

CLC Genomics Workbench Desktop

There are many desktop applications that can be access from powerful Linux computing environments. If you are interested in making use of large computing environments check if there is a Linux version of your application, you may be surprised!  

ANDS Uber Dojo

“Develop something researchers will find Cool. You’ve got 3 hours. Begin!” were the instructions barked by Dave Flanders from ANDS.

I turn to my esteemed colleague, Dr Peter Sefton, “I thought you said this was going to be fun?”.

“It is fun, you just don’t know it yet”, he smiled.

Peter and I were UWS representatives at the “Uber Dojo: Advanced Black Belt Event for Tools & Data in the Cloud”. An ANDS event that invited 40 of Australia’s proven research cloud developers into a glass-lined workshop for 2 days of skill-sharing and development hacking on the Aussie Research Cloud.

But hang on, Uber what? Well…

Since Dave arrived on the scene the ANDS developer gatherings have taken a step away from Revenge of the Nerds towards The Karate Kid (the 1984 original, not the atrocious sequels). With lead developers labelled “sensei”, and receiving “dan stripes” based on their ability to deliver solutions. Small group training sessions are called “dojo’s” where participants take turns as Hands and Brains. Hands is the only person allowed to use the keyboard, and Brains wrangles the group and directs the Hands. Meanwhile the senseis lead the group by posing questions, Yoda-style.

Now back to David’s Challenge, “Develop something researchers will find Cool.”

Peter and I were initially going to re-implement our whizz-bang Research Data Australia real-time exploration tool using some of the new techniques we’d learned on Day 1. But after getting a glimpse of super quick VM deployment during our session on Chef with Steve Androulakis (Monash) and Tim Dettrick (UQ). We decided to work on an idea that Peter had been pondering with researchers at UWS for reproducible research.

In a nutshell, the idea was to bring together three things –

1. A dataset
2. Code or toolset
3. System configuration (something which will run 2. using 1.)

Bundle that together to allow a researcher to run up a short-term virtual research environment on the Nectar Cloud. Where they could do some work – eg. confirm output or modify the code. When finished have the components placed safely back in their respective repositories and the VM instance shutdown.

For our example use-case, the 3 components were: 1. forest-based climate data from our good friends doing climate-change experiments at Hawkesbury Institute for the Environment. 2. This data is manipulated and presented using R. And 3. we programmatically create a Linux VM that implemented R-Studio, loads the data + code and presents them to the researcher.

You can read much more about this in Peter’s blog post.

30 minutes before pens-down we had one successful end-to-end run under our belt. But, like all good tech demos, I managed to botch the Apache permissions on my laptop which stopped us from demo’ing the entire shoe-string & boot-lace apparatus. Probably not a bad thing, without snapshots, the Linux R-build takes over 5 minutes to complete. For a couple of old hands hacking on brand-new tools we probably did okay.

What did I take away?

A brand new appreciation of ephemeral (aka cloud) computing resources. To date I’ve been treating the Nectar VM’s much like our institutional VM’s – as a precious resource to be curated and managed. At UWS creating a VM typically take a couple of weeks from inception to login prompt. This means we build persistent long term server-like solutions. Which have long-term overhead of patching, maintaining, securing and sysadmin’ing.

Having a computing environment that appears when needed to do a specific piece of work and then goes away, is a huge change. I just need to concentrate on the 3 critical components – the data, the code, the instructions to build the environment. I no longer have to be concerned about administrating lot’s of long-term computing environments.

Unfortunately, our current UWS processes aren’t geared to anything besides long-running persistent VM’s. During the dojo’s challenge sessions we probably created and destroyed more VM’s than I’ve submitted server-deploy-requests for in the last 3 years of eResearch at UWS. This was completely mind boggling for me. And means a re-think about how we plan for research computing.

What makes all this possible is being able to spin up VM’s from an API and commission software to the host systems using automated tools like Puppet and Chef. Nectar Research Cloud implements Open Stack and we used the EC2 API with Python-Boto to programmatically create VM’s to our specification.

I also have a new appreciation of modern coding environments in Python and Ruby. And we need more skills & training in this area.

So, was Peter right? Did it turn out to be fun?

Most definitely.

Looking forward to the developers challenge and hackfest at the eResearch Australasia conference Sunday 28 Oct – Wednesday 31 Oct.

My week with Google

by Andrew “Alf” Leahy

Unlike academics, professional staff at universities don’t often have the opportunity to visit and work with colleagues overseas. Perhaps someone reading can explain why? This has been my second working trip and all I can say is MAKE IT HAPPEN!! You will come back with new ideas and expertise, new networks and increased confidence in your own role & abilities.

Quick Background Liquid Galaxy started as a 20% project of Google engineers Jason Holt and Dan Barcay as a way to quickly check and visually demonstrate the Google Street View imagery. The project was extended to include Google Earth, enabling Earth to be configured on multiple PC’s in an immersive fashion. I’ve worked on similar technology in the past and immediately jumped on the project when it was made public in late 2010. I’ve been working, mostly in my own 20% time :) with the primary goal of using it as a tool for compelling visualisation of our research data. Along the way I’ve built a few systems using resources graciously supplied by the School of Computing & Mathematics. At UWS the project is called Wonderama mainly because we have a broad range of uses in addition to running Google content.

On the back of my involvement with Liquid Galaxy I was invited to spend the last week of June as a guest of Google at the Googleplex in Mountain View, California and at the Google I/O developers conference in San Francisco. There were multiple purposes to the visit: to meet face-to-face with the other developers; lend a hand during the setup of a new Liquid Galaxy for the conference; be available to chat with other developers about Liquid Galaxy; and to generally share ideas and immerse ourselves in things Liquid Galaxy!

I arrived Saturday June 23 at San Francisco Airport to be greeted by my sponsor and Google engineer Jason Holt. Andreu Ibanez, a fellow Liquid Galaxy builder from Lleida Spain arrived an hour later. Jason and his family very graciously put us up at their home in Mountain View. We had the option to stay at a hotel, but I knew that wouldn’t be as much fun! That afternoon Jason gave Andreu and I a personal tour of the Google campus, which included a couple of the Liquid Galaxy rigs. My first time with the real deal. This was a good way to get any Google ‘fanboi’ sight-seeing out of our systems!

 

On Sunday we went on a tour of some of the community “maker” spaces that I particularly wanted to see while I was in the Valley. These included Google’s Workshops, The Hacker Dojo, BioCurious, Sawdust Shop and TechShop. Along the way we also visited The Computer History Museum and the NASA AMES Exploration Center, ending up at Stanford Campus and dinner in Stanford. An amazing day experiencing technologies at all different scales.

Monday when we arrived we were met by a throng of new employee’s (affectionately known as Nooglers) queuing for security passes and waiting to be inducted. Ducking around a few corridors Jason led us to his new supersized cockpit-design Liquid Galaxy, with 80inch screens, just wow! The End Point guys were already there and keen to start dismantling (from End Point we had Ben, Matt, Kiel and Zed). Like Andreu, these were Liquid Galaxy folk I’d interacted with via email, Skype and in Hangouts over the last 18 months. Handshakes over we began loading up for the move to Moscone convention center. Once the truck was loaded we had time for a Liquid Galaxy state-of-the-nation meeting over lunch at a Google Cafe (the food’s all free you know!). Then off to San Francisco to check out Moscone Center and start the re-building. Here we met Shane who was responsible for the machining and construction of a new rig frame which was due to arrive from Tennessee. However by Monday afternoon it seemed that the shipping company had misplaced the truck! So a few of us headed back to Mountain View to dismantle and load the old frame onto another truck to be delivered to Moscone that evening. We finished the day at the Google Store to pick up some schwag, and dinner with Google crew at one of their renowned cafeteria’s until we were the last people there!

Tuesday Andreu and I checked into our hotel in San Francisco and made our way two blocks over to Moscone. The day was spent putting final touches on the Liquid Galaxy, preparing tours of the amazing new Earth 3D Imagery, and registering for the conference. Here is a video of the build in process courtesy of Andreu. I had the opportunity to setup our UWS-built Kinect control for Liquid Galaxy and the team and some other exhibitors got to have a play. We returned to our hotel about 2am! The new frame was still AWOL.

Wednesday was Day 1 of Google I/O. 6000 conference attendee’s all in one Keynote hall was pretty awesome, it included sky divers!! I spent most of the day shmoozing with visitors to the Google Maps Sandbox where the Liquid Galaxy rig was located. I met lot’s of great people including fellow Aussies from Atlassian and Telstra. At lunch we heard from a relieved Shane that the new rig frame had finally turned up from Tennessee, hurrah. But this was going to be another late night! The Google I/O After Hours dinner & party ran until 10pm at Moscone. While we waited I set up the Kinect controller again and we gave passers-by a chance to ‘surf the planet’ Liquid Galaxy-style. After everyone else was kicked out we dismantled and rebuilt the rig onto the new frame. I left in the early hours so I would be in a reasonable state to talk with people bright and early on Thursday. Matt & Kiel didn’t finish until daylight.

Thursday Day 2 of Google I/O and a brand-new Liquid Galaxy rig to wow people! Check out a great video of the system with Peter Birch here. Peter personally thanked us all, “You guys make us look good!”.
Another great big Keynote from Google and I went back shmoozing err I mean networking. Here’s video proof! I did force myself to attend the Google Compute Engine technical session that was announced that morning. One great thing about Google I/O is that pretty much all the sessions are available online, either streamed live or a few days after. This evening all the Developer Sandbox areas were cleared away so we dismantled and packed the rig for shipping back to the Google in Mountain View the next morning. Unfortunately being distracted packing I completely missed the Google I/O Ignite Sessions which I’d hoped to catch live. David Weekly from The Hacker DoJo gave a 5 minute Ignite talk at I/O.

Friday morning after Andreu and I confounded Starbucks counter staff with our accents, we traveled with the End Point crew to Mountain View and re-assembled the rig at the Googleplex. Followed by lunch in a Google cafeteria (did I mention the food is fantastic and free?), a meeting with a Google Earth PM about future Liquid Galaxy projects. Late afternoon we finally said goodbye to our Googler hosts, and I headed back with the End Point crew and Andreu smack into the San Francisco’s Friday evening rush hour *doh*. Leaving us with just enough time for a walk to China Town and a sunset dinner with a view of Coit Tower.

The last two days I managed some touristy things – walked The Embarcadero and pier markets, lunch at Fishermans Wharf, picked up a Zip Car, over to Lombard St and across the Golden Gate which was completely obscured by fog! Dropped into picturesque Sausalito, continued around the Bay to visit the UC Berkeley Campus and the across the Bay Bridge back to the city. Sunday more R&R, including a walk over to South Beach to watch the Spain v Italy UEFA Cup Final with a bunch of excitable Spaniards, Spain winning 4:nil! A pleasant walk up thru China Town and finally off to the airport for the 14h flight home to Sydney.

And then it was over. Way too soon! :-(

Highlights: Meeting and working with Jason, Ben, Kiel, Matt and Andreu shoulder-to-shoulder! Spending time with Googlers in their ‘natural habitat’. Talking a helluva a lot about Liquid Galaxy. Networking and connecting with dozens of people & projects and bringing them back to UWS, including: Ben @Harvard World Map project, Electronic Arts HTML5 gaming guys, Michael @Ubilabs, European OpenData vis projects, Dave @Keyhole founder, Steven @Center for Advanced Spatial Analysis UCL, Square Enix immersive gaming, Cara @NASA AMES Outreach, Peter-Haris-Mano @Google Maps & Earth, Jenifer @Google Oceans, Piotr @Google Art Project, Tanya @Google for Good, etc. etc.

PS. My trip mobile trail with placemarks courtesy of Google Latitude for loading into Google Earth.