University of Western Sydney Enterprise Research Data Catalogue Project

[This document is a lightly-edited version of an approved project proposal written by staff at the University of Western Sydney for the Australian National Data Service (ANDS) metadata stores funding stream – we are publishing it here to assist in collaborating with other universities on their Metadata Stores projects. Some ANDS boilerplate text and financial information have been removed, and links added to materials that add context.]

ANDS Project Description

for

Enterprise Research Data Catalogue

ANDS Project Code: MS23

Document Version 1.0

Prepared by Peter Sefton and Peter Bugeia

University of Western Sydney

6/12/2011


 Project Description

Organisation responsible for the project (Subcontractor)

University of Western Sydney

Organisation that will undertake the work (Sub-Subcontractor)

ABN or ACN

530 140 698 81

Name of  Contact Person

Peter Sefton

Complete address and contact details of Contact Person 

eResearch Capability Team

Office of the Pro Vice Chancellor (Research)

Academic and Research Division

University of Western Sydney

Campus : Penrith (Werrington North)

Building : AD

Room : AD.G.15

Locked Bag 1797

South Penrith NSW  2751

T: 61 2 4736  0072

F: 61 2  4736 0905

p.sefton@uws.edu.au

ANDS Program

Metadata Stores

Project Summary

This project adheres to NCRIS funding requirements.

Funded activities are limited to: installation, configuration and testing of software; manual creation of metadata (beyond that required for software specification and testing); scoping exercises or studies in the amount of research data available at an institution.

The project does not use NCRIS funds for the following activities:

  • purchasing of IT hardware for storage or any other purpose;

  • ongoing staffing; “proof of concept” software development;

  • funding of work by parties based outside Australia.

Any software development will be made available as open source.

Funding Sought

<removed>

Proposed project timeframe

10 months

Name of the person responsible for contract administration

<removed>

Names and affiliations of all collaborators if any

University of Newcastle – Vicki Picasso.

Other collaborators will be identified during the course of the project.

Background

The University of Western Sydney is undertaking the early stages of an internally funded project to establish a Research Data Repository [link added] (RDR) and associated infrastructure to support it. This project is being led by the eResearch Unit with the participation of IT, the Library and the Office of Research Services. The repository will consist of:

  • scalable, managed file storage for both working and archived data; 

  • access to virtualized computing infrastructure so that researchers can run data analysis tasks;

  • a research data catalogue containing metadata about data at a collection level for code-compliance, strategic research management and discovery purposes.

The storage component of the RDR was established in 2010. The next steps are to design the architecture that links the storage to computing infrastructure and cataloguing applications. This architectural work will be undertaken by the eResearch Unit, IT, and the University Library.

UWS has a nascent research data catalogue which is being established under ANDS project SC20.

Throughout this document the ‘metadata store’ for research data will be referred to as the ‘Research Data Catalogue’ to emphasise its role in the institution using a term that should be understandable to all stakeholders.

2.  Aims and Objectives

Alignment with ANDS Objective

already

to be

no

To manage metadata about data collections held at the institution

(some progress on SC20)

X

To enable discovery and reuse of data collections held at the institution

X

To support strategic planning for research in the institution

X

To ensure high quality metadata

X

Overview of project

The proposed metadata stores work outlined in this document will contribute to the RDR project by implementing the research data catalogue (metadata store) in the institutional context, establishing data sources for parties and activities from research and library systems, and providing an expanded platform for describing collections.

This will be built into an integrated system for recording catalogue-descriptions of research data collections with a view to it becoming the institutional research data catalogue for the university. There is opportunity for it to be collaboratively built to fulfil a broader set of institutional requirements than just those of the University of Western Sydney’s.

The University has chosen the ReDBox application as the research data catalogue to fulfil functional requirements under SC20. This Metadata Stores project will explore how it can be expanded to be the basis of the University’s institutional research data catalogue, and seek alternative and additional software solutions if necessary. It is proposed to conduct this analysis in concert with other institutions using the same software and/or with similar requirements, so that any software developed or purchased has a broad user base.

Scope and boundaries

The project will focus on the following:

  • implementation of the core deliverables (D1-D6) suggested by ANDS, as none of these are fully established at UWS,

  • the establishment of workflows for identifying collections, and

  • the integration of data management planning into the broader research lifecycle.

The primary driver for this work is to establish a picture at UWS of where research data resides and to establish infrastructure for researchers to be able to store and describe their data for later re-use by themselves, their research teams and students, and more globally. This work will aim to meet UWS requirements for research management and practice as well as the ANDS goal of sharing collection descriptions.

The full scope of the final project will be refined and specified in Deliverable D15, Project Management Plan.

Dependencies

This project depends on the SC20 project to establish the basic application. This is considered low risk as the same application is now in production at both the University of Newcastle and at Flinders University.

Overall Approach

Strategy and methodology

This project will use an agile project methodology for software development tasks and for other tasks such as evaluation of data sources. The exact nature of the project will be developed with the project manager and team and documented in deliverable D15, Project Management Plan.

UWS is aiming to collaborate with other institutions that are using similar software and with similar approaches to research data in general. This will provide an opportunity to work together to specify and deliver new software features which meet a common need. We have identified one partner, the University of Newcastle and will work with them to recruit more.

Technical issues

Some technical issues which have presented themselves in the formative stages of this project include:

  • The relationship between storage infrastructure and the metadata catalogue and how these should be linked. Some attention will be given to specifying this interface in DC21 and SC20.

  • The relationship between NLA party IDs, local IDs and the forthcoming ORCID system, and the interfaces to all of these systems. This issue will need to be investigated with ANDS and the ANDS community.

Internal Resources

The exact breakdown of the resources needed for this project is not yet known but it will be lead by the eResearch Unit and will involve library staff in sourcing data collections.

External resources

It is not known at this stage if external resources will be engaged but it is highly likely that if software development is required, expressions of interest will be sought from QCIF (where ReDBox is currently maintained) and Intersect, the NSW eResearch service provider, and possibly via the internal teams of universities partnering in this work.

Stakeholders

The project steering committee will consist of representatives from:

  • The eResearch unit.

  • Research Services.

  • IT

  • The Library.

  • Researchers from various disciplines, by invitation, as needed.

4. Project Deliverables

D1

A working feed of records describing Collections and associated Activities, Parties and Services to Research Data Australia, in the current version of RIF-CS (1.3), demonstrated to meet the quality requirements for RIF-CS records as set by ANDS. This feed will contain additional descriptive metadata for newly identified collections, over and above the feed established in SC20 and will be available for use by researchers in an expanded range of discipline areas as per D2. RIF-CS 1.3 support will require an upgrade to ReDBox. The new Research Data Catalogue is expected to import the contents of the SC20 metadata store.

D2

A feed of collections from at least three distinct Faculties (or equivalent organisational units) within the institution to Research Data Australia.

UWS is in the process of establishing 5 new flagship research institutes in addition to 10 existing Schools.  Priority will be given to collections sourced from the institutes, which represent a broad range of disciplines, under criteria based on those used in SC20. The most established of these include:

  • Hawkesbury Institute for the Environment (Climate Science)

  • Institute for Culture and Society.

  • MARCS Institute for Brain and Behaviour.*

  • Civionics* (Civionics is a discipline concerned with the interface of the use of electronic devices for the monitoring of civil engineering infrastructure)

*These are currently research centres in the process of becoming fully-fledged institutes.

D3

Demonstrated alignment of metadata records about Parties with an institutional name authority (HR or Library), with the authoritative form of the name sourced external to the metadata store, and with new researcher descriptions added to the metadata through regular updates from the name authority.

Party information will be sourced from the software system used by Research Services for administering UWS research, grants and projects, this will be integrated with the Research Data Catalogue via a name authority system with an automatic update. Party IDs will be minted using the local UWS Handle server.

D4

Demonstrated alignment of metadata records about Parties with the ARDC Party Infrastructure Project, with researcher descriptions contributed to the NLA, and with People Australia identifiers for researchers recorded against researchers.

The project will evaluate the different options for feeding data to the NLA , choosing between a feed to ANDS in RIF-CS format or to the NLA, and if the latter, choosing which metadata format to use, either RIF-CS or EAC-CPF. The project will also investigate a solution for importing or aligning local IDs with NLA IDs and how to interoperate with the global ORCID system when it comes online.

D5

Demonstrated alignment of metadata records about Activities with institutional and external sources of truth (Research Office, ARC and NHMRC grant registries), with the authoritative description of the Activity sourced external to the metadata store, and with new researcher project added to the metadata through regular updates from the sources of truth.

This deliverable will use the same data sources and processes as D3, with the addition of processes to import globally defined IDs for activities, such as ARC grants, with a process for aligning these with local views of the same data.

D6

Demonstrated workflow for registering new Collections in the university; this can include automated update, or semi-automated (notification-based).

This project will explore the following workflows for data collection registration, with the community of ReDBox user-organisations:

  • The existing library-mediated registration process established in SC20 with data-interviews informing curated descriptions.

  • Automated feeds from data capture systems, feeding into template records which have been curated as in the point above by the library. This will be piloted in the DC21 project.

  • A new system that will integrate the process of applying for data storage, and creating a data management plan into a single form, to integrate the process of describing and capturing data into institutional processes.

  • An system that allows researchers to capture and  view data in the RDR-managed storage system or on local storage, and to curate it into collections, both by manually selecting items, and by rule (such as a metadata query or by location). This will have a plugin architecture to allow it to be adapted for different disciplines and file types and build on the integration work between DC21 and SC20.

D7

A software system to realise deliverables D1–D6 (and D8, D13–D14 if applicable), with robust storage and management of metadata.

The starting point for a software system used will be the one used for implementing SC20, which is the ANDS-funded ReDBox application. We will aim to undertake this work in concert with other institutions and evaluate the most appropriate way to create the new functionality, either by extending ReDBox or by using other systems.


Optional Deliverables

If your institution has already implemented some of the foregoing deliverables at an institutional level, ANDS expects that you will also include some of the following optional deliverables:

D8

Demonstrated ability to manage the following aspects of the collection lifecycle through recording and exposing relevant metadata related to:

  • D8.1 embargo dates for collections, where applicable

  • D8.2 current online location of collection (on internal store or external store)

  • D8.3 current offline location of collection

  • D8.4 intellectual property rights (licensing, restrictions on reuse)

  • D8.5 retention policy (disposal date, deposit date)

D8.6 policy framework (data management plan relevant, ethics clearance forms relevant)

Many of these functions are delivered by the ReDBox application out of the box, the implementation will make sure that they are adopted at UWS.

D9

A public researcher or research profile portal, exposing publishable metadata about the research data being held at the institution.

Not a priority.

D10

Demonstrated ability to feed a selected subset of the collection records relating to a particular discipline to a discipline registry, following the metadata schema and conventions of that registry

Not a priority.

D11

Demonstrated ability to manage the following aspects of the collection lifecycle through recording and exposing relevant metadata:

  • citation requirements (authoritative identifiers, including DOI, preferred citation format)
  • citation tracking of collections
  • audit information (refer to publications audit)
  • proprietary tools and formats used in collecting the collection
  • Not a priority.

    D12

    Strategic reporting on contents and coverage of metadata store for internal use

    This is a key area for informing the establishment of a Research Data Repository and the organisational cultural environment in which it will exist. This project will aim to produce reports that can be used to track the growth of the RDR, via the Research Data Catalogue.

    D13

    Storage and exposure for discovery of object level metadata, and alignment of object level metadata with collection metadata (i.e. ability to navigate from object metadata to collection metadata; update of object metadata aligned with update of collection metadata)

    Not a priority.

    D14

    Storage and management of technical metadata for object and collection reuse, including software and equipment descriptions, methodology, and data interpretation

    Not a priority.

    Procedural Deliverables

    D15

    Project Management Plan, using the ANDS template, specifying the details of the planned activity, with risks, schedules, etc

    D16

    Progress Reports, using ANDS templates

    D17

    Final Report, using ANDS templates

    D18

    Deposit of any software (including stylesheets and schemata) developed in the project for achieving other deliverables, and that can be (usefully) used outside the institution, in either Google Code or SourceForge, including:

  • a Google code comment and tag or SourceForge summary and tag containing the text “ANDS-funded”
  • Developer manuals where applicable, to facilitate reuse
  • Deployment manuals to facilitate external deployment
  • User manuals to facilitate use
  • D19

    A source code report, if any software is developed and publicly deposited under D17

    D20

    A User Acceptance Test online survey

    5. Assumptions, Constraints, Dependencies and Risks

    Assumptions

    Constraints

    Dependencies

    Risks*

    Staffing

    UWS will be able to provide staff to inform the project and recruit a project manager.

    The usual constraints of working in a university.

    This project depends on the RDR project, which is not yet established, but does have a budget.

    Project management and data librarian staff can not be sourced.

    Organisational

    The RDR project will continue to develop, and storage will be available to researchers via some kind of easy-to-use application process.

    UWS project management and governance processes must be followed.

    This depends on the ITS budget.

    RDR storage does not come online.

    Technical

    The scope of the technical work is yet to be established – there are no indications that insurmountable challenges will arise.

    External Suppliers

    Software development can be sourced from QCIF or Intersect

    Legal/Ethical

    Other

    Researchers have limited time to participate.

    Early work on SC20 is finding that sourcing data collections is difficult

    Collections will be hard to source. (Mitigation: try to provide services that are of high value to researchers and collect metadata as a gateway to their provision (eg the process of filling out applications for storage).


    * – Where Risks have been identified, briefly outline your mitigation strategy.

    6. Stakeholder Analysis

    Stakeholder

    Interest / stake

    Importance

    eResearch Unit

    Lead agency

    High

    Library

    Business owner for the Research Data Catalogue – operational responsibility for data curation.

    High

    Research Services

    Custodians of the ancillary data about parties and activities which support the RDC.

    High

    Information Technology Services

    Implementer / supplier of storage infrastructure and environment for the RDR

    High

    7. Project Management

    Project Team, Roles and Responsibilities

    Role

    % EFT

    Responsibilities

    Recruitment required? (yes/no)

    In-kind contribution or ANDS funded?

    Project Manager

    50

  • Deliver the project to ANDS expectations.

  • Assume responsibility and accountability for each Deliverable.

  • Monitor and report to ANDS on project progress.

  • Advise ANDS if project appears to be in danger of non-delivery.

  • Please add more rows as required to describe further responsibilities.

  • yes

    ANDS funded

    Project steering committee

    ?

    Exact composition to TBA –

    [Steering committee now established – chaired by a representative of the office of the Pro Vice Chancellor Reseach, has representatives from ITS, Library, Office of Research Services and eResearch.]

    In Kind

    Data librarians

    50%

  • Source data collections

  • Curate data descriptions

  • In Kind

    eResearch team

    10%

  • Write policy and procedures for data management in the context of the RDR and RDC

  • Report to ANDS on project governance [fixed typo] issues

  • In Kind

    8. Budget

    <removed>

    9. Exit and Sustainability Plans

    <This section was not filled in>

    10. Milestones for Payment

    Amount

    Indicative Timing

    Milestone

    25%
    <removed>

    Day One (1)

  • Contract execution

  • 25%
    <removed>

    Agreed project start date + eight (8) weeks

  • D15
    Project Management Plan, using the ANDS template, specifying the details of the planned activity, with risks, schedules, etc

  • D16
    Progress Report, using ANDS templates

  • 25%
    <removed>

    Agreed project start date + 30 weeks

  • D16
    Progress Report, using ANDS templates

  • D1
    A working feed of records describing Collections and associated Activities, Parties and Services to Research Data Australia, in the current version of RIF-CS (1.3), demonstrated to meet the quality requirements for RIF-CS records as set by ANDS

  • 25%
    <removed>

    52 weeks
    (Completion)

  • [D2–D7 mandatory dellverables]

  • [any optional deliverables, including D8–D14 where applicable]

    • D17
      Final Report, using ANDS templates

    • D18
      Deposit of any software (including stylesheets and schemata) developed in the project for achieving other deliverables, and that can be (usefully) used outside the institution, in an open source repository such as Google Code, SourceForge or GitHub:

  • a comment, summary or tag containing the text “ANDS-funded”

  • developer manuals where applicable, to facilitate reuse

  • deployment manuals to facilitate external deployment

  • user manuals to facilitate use.

    • D19
      A source code report, if any software is developed and publicly deposited under D18

    • D20
      A User Acceptance Test online survey

  • 11. Glossary of Terms

    Term

    Definition

    Collection

    A collection describes a grouping of physical or digital items of interest to the research community, particularly research data sets or physical collections of research materials.

    Activity

    An activity is an undertaking or process related to the creation, update, or maintenance of a collection.

    Party

    A party is a person or group related to an activity, to the creation, update, or maintenance of a collection, or to the provision of a service.

    Parties add to the discoverability of collections and add valuable contextual information, including assisting with determination of value for a collection. A party could be either a

  • group:  one or more persons acting as a family, group, association, partnership, corporation, institution or agency.
  • person:  a human being; or an identity (or role) assumed by one or more human beings.
  • Appendix A. Check list of metadata store functionality

    The purpose of this background check is to determine the scope of the project by structuring an analysis of your institution’s data management readiness, and to provide a check list that reflects the functionality of an effective data collection infrastructure. Completion of the checklist is not mandatory, but may well be useful to your institution.

    Yes

    No

    Developing

    Does your institution have a Data Management Policy?

    X

    Is your institution able to automatically aggregate metadata about data collections from various areas/units within your institution?

    X

    Is any of this metadata exposed for discovery through a discipline portal?

    X

    Is any of this metadata exposed for discovery through an institutional portal?

    X

    Is any of this metadata exposed for discovery through Research Data Australia?

    X

    Are you able to expose and manage metadata about data collections at an object level? (Individual data objects; data collection methods; sample information; etc.)

    X

    Do you manipulate metadata descriptions aggregated from various areas of the institution, in order to align them with an institutional metadata standard?

    X

    Does your institution’s metadata conform or map to RIF-CS?

    X

    Does your institution’s metadata use controlled vocabularies?

    X

    Is your institution’s metadata integrated with institutional sources of truth (e.g. HR for researchers, Research Office for grants)?

    X

    Is your institution’s metadata integrated with national sources of truth (e.g. NLA Party, ARC/NHMRC grants registry)?

    X

    Do you have a process for registering new data collections as they are created?

    X

    When it comes to the core attributes of data collections required for effective data management, are you able to manage the following:

    Yes

    No

    Developing

    embargo dates for collections, where applicable?

    X

    current online location of collection (whether internal store or external store)?

    X

    current offline location of collection?

    X

    intellectual property rights – licensing, restrictions on reuse?

    X

    retention policy e.g. disposal date, deposit date?

    X

    policy framework  e.g. data management plan, ethics clearance forms?

    X