In characterizing the term Research Object the call for proposals for Research Object 2018 uses the phrase “multi-part research outcomes with their context”. The DataCrate specification1 is a research data packaging and dissemination specification designed to capture exactly that; outcomes (also inputs) and context.
DataCrate specifies how to gather together data in such a way that it can (a) be packaged via zip, tar, a disc image, a multi-part package or (b) be hosted on a web server or file share for inspection by potential users and/or used directly on High Performance Computing systems or otherwise accessed and analyzed.
DataCrates can contain any kind of data, and the contextual information may include, but is not limited to, data about the people, software and equipment used in the research as well as supporting documents such as publications, funding agreements or README files.
The DataCrate specification grew out of two generations of previous data packaging work. The first implementation was in the HIEv system at Western Sydney University [1], a data capture system for environmental science data that captures data files produced by sensor networks and allows manual or API-based upload of other files. Using a web interface researchers can select files to export, for example to support an research article.
The requirements / principles for the HIEv system were presented at eResearch Australasia.
- The packaging format should not be data-format-sensitive.
- The packaging format should not be research domain specific.
- The packaging format should not be technology or platform specific.
- The data package should contain as much contextual information as possible.
- Metadata should be easily human and machine-readable.
- The package format should contain self-checking and verification features.
- The metadata format should be compatible with the semantic web by using URIs as names for things including metadata terms*.
- Requirement
7
implies using Linked-Data [2], but the project should not attempt to define and manage its own ontologies, for reasons of sustainability*.- A data package should be able to be displayed on the web - implying that the human readable metadata in
5
should be in HTML*.
*The last three requirements or principles were not explicit in the presentation but were discussed during the development, and proved increasingly important in the development of the current DataCrate specification.
The data packaging in HIEv used the Bagit [3] packaging spec to cover requirement 6
- BagIt also doesn’t get in the way of any of the other requirements.
The main innovation in HIEv’s packaging was to add an HTML file that covered requirements 4
(as much context as possible) & 5
(human and machine readable metadata). To do this, HIEv produces a summary of the context, with information about the facilities used – their name, nature and location – and technical details about the payload files, thus satisfying requirement 5
. Using RDFa [4] to embed metadata in the HTML file gave both human (HTML) and machine (RDFa) views of the data.
Cr8it [5] was another early implementation that used the same basic idea for data packages in the OwnCloud file sharing application.
The first two DataCrate proto-implementations had no guidelines for what metadata to use beyond what was hard-wired into each code-base, so there was no hope of easy interoperability or safe extensibility, and there were no repositories into which data could be published, but we had good feedback about the concept from the eResearch community and from the very limited number of researchers exposed to the systems, so in 2016 when UTS began work on a new Research Data Management service [6], we decided to properly specify a data packaging format that met the above requirements and the DataCrate standard was born.
A team based at UTS, with some external collaborators started a process to work out (a) was there an existing standard which had emerged since the HIEv work that met the requirements? (b) If not, which RDF vocabularies should we use? and (c) the mechanics of organizing the files in the packages. At this point our specification had matured and we were looking to find or build a data packaging format which had the following DataCrate Requirements:
2
in JSON-LD format using well-documented ontologies / vocabularies with coverage for:
who
created it, what
is it, where
did the work take place, what
in the world is it about for both the package/dataset and the file level.2
.These could all be accomplished by an update of the HIEv data package but it was important to make sure we were not re-inventing something that had been done elsewhere.
We were not able to find any general-purpose packaging specification with anything like the HTML+RDFa index that HIEv data packages have, allowing for human and machine readable metadata. Using BagIt plus extra files worked well in our initial implementations so unless a better alternative surfaced – the decisions were around formalizing metadata standards.
BagIt, which had been used in HIEv and Cr8it, is an obvious standard on which to base a research data packaging format - it is widely used in the research data community, there is cross-platform (requirement 3
) tooling available and it covers the integrity aspects of packaging data.
Two data packaging options were identified as mature enough to be evaluated: Frictionless Data Packages and Research Objects.
Frictionless Data Packages [7], which uses a simple JSON format as a manifest, has roughly equivalent packaging features to BagIt, having checksum features built in. In their favor, Frictionless Data Packages have the ability to describe the headers in tabular data files. However, they do not meet the requirement 7
of having linked-data metadata, so while the JSON metadata is technically machine readable, in that is simple to parse, it is not easy to relate to the semantic web as it does not use linked-data standards. Further the terms are defined locally to the specification, without URIs. It is also unclear how to extend the specification in a standardized way, contrasting with linked-data approaches which automatically allow extension by the use of URIs.
As an example, the specification2 does not give a single way to describe temporal coverage:
Adherence to the specification does not imply that additional, non-specified properties cannot be used: a descriptor MAY include any number of properties in additional to those described as required and optional properties. For example, if you were storing time series data and wanted to list the temporal coverage of the data in the Data Package you could add a property temporal (cf Dublin Core):
“temporal”: { “name”: “19th Century”, “start”: “1800-01-01”, “end”: “1899-12-31” }
This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the Tabular Data Package specification extends Data Package to the case where all the data is tabular and stored in CSV. https://github.com/frictionlessdata/specs/blob/0860ecd6bbb7685425e6493165c9b1a1c91eb16b/specs/data-package.md
This laissez faire extension mechanism in Frictionless Data Packages has appeal, but it is likely to result in a proliferation of highly divergent non-standardized metadata, jeopardizing interoperability. DataCrate aims to encourage common behaviors, to facilitate metadata interchange by using JSON-LD and by specifying widely used metadata standards. So, using the temporal coverage example, DataCrate uses schema.org’s temporalCoverage property. schema.org’s selection certainly raised some eyebrows in the library world and it is far from perfect, but its adoption by major search engines led us to believe that it would achieve a ubiquity that would stand DataCrate in good stead. Indeed, this choice means that as services such as Google’s Dataset search roll out, DataCrates will be indexed, whereas the semantics of FrictionlessData packaged are less likely to be indexed in detail. See the details provided by Google3.
{
"\@id": "https://doi.org/10.5281/zenodo.1009240",
"\@type": "Dataset",
<...>
"name": "Sample dataset for DataCrate v1.0",
"publisher": {
"\@id": "http://uts.edu.au"
},
"temporalCoverage": "2017"
}
The ability to resolve URIs is an important one if the CATALOG.html
file is to be useful to humans. Many URIs in scientific and other ontologies do not resolve to a web-page but to an ontology file, which is unhelpful for people trying to understand what metadata terms mean, so in DataCrate we introduced a mechanism for mapping non-human-friendly URIs to useful ones.
For example. This URI from the [BIBO] ontology <“http://purl.org/ontology/bibo/interviewee”> resolves to an ontology file which downloads as bibo.php
. A user would then have to locate the definition, embedded in an rdfs:comment
element in that file:
<rdfs:comment xml:lang="en">
An agent that is interviewed
by another agent.
</rdfs:comment>
By contrast, http://neologism.ecs.soton.ac.uk/bibo.html#interviewee is human-readable; the DataCrate solution is to map the “official”, machine readable URI to the more useful one, using schema:sameAs and to link the term “interviewee” in the DataCrate Website to the useful URL, for the use of humans. The HTML in the DataCrate Website is self-documenting:
The other main alternative evaluated was the Research Object Bundle specification [8].
Rather than BagIt, the original version of Research Object Bundle used the Zip-based Universal Container Format - a format for which the documentation now seems to be unavailable from Adobe. It does not have integrity features such as checksums but there is a version of Research Object which uses BagIt4.
RO BagIt does use Linked-Data and for that reason was given careful consideration as a base-format for DataCrate. However, there were some implementation details that we thought would make it harder for tool-makers (including the core team at UTS); the use of “aggregations” and “annotations” introduces two extra layers of abstraction for describing resources.
For example, using the sample from the bagit-ro tool, there is a section in the manifest that lists aggregated files:
"aggregates": [
{ "uri": "../data/numbers.csv",
"mediatype": "text/csv"
}, ..
]
And a separate place to describe annotations on those files:
"annotations": [
{
"about": "../data/numbers.csv",
"content": "annotations/numbers.jsonld",
"createdBy": {
"name": "Stian Soiland-Reyes",
"orcid": "http://orcid.org/0000-0001-9842-9718"
}
}
]
With the actual description of the numbers.csv file residing in annotations/numbers.jsonld
.
{ "\@context": { "\@vocab": "http://purl.org/dc/terms/",
"dcmi": "http://purl.org/dc/dcmitype/Dataset"},
"\@id": "../../data/numbers.csv",
"\@type": "dcmi:Dataset",
"title": "CSV files of beverage consumption",
"description": "A CSV file <...>"
}
In our judgement, the level of indirection and number of files involved in the Research Object approach were not suitable for DataCrate as the implementation cost for tool makers would be too high. In making this choice we forfeited the benefits of being able to make assertions about the provenance of annotations as distinct resources, and the more intellectually satisfying abstractions about aggregations offered by ORE.
We settled on a an approach which used:
DataCite.xml
file containing a data citation (a text version of which is prominent in the HTML file if it exists).Uses for DataCrates planned at UTS include: - Making DataCrates available for download from a public website, both packaged as Zip files and expanded so that users can peruse the DataCrate Website via the CATALOG.html
file and access individual files. - Using the DataCrate with additional metadata for archiving and preserving data (in a project to begin in 2019). - Using the DataCrate format to allow exchange of data between systems, for example sending data from a repository in a university facility such as the Omero microscopy repository [9] to a git project system like GitLab. - Automatically detecting metadata in DataCrates which are uploaded to our research-integrity driven Research Data Management system [6].
The initial version of DataCrate (v0.1) was developed in 2017. V0.1 persisted with HTML+RDFa for human and machine readability but this was cumbersome to generate and was removed at [the suggestion of Eoghan Ó Carragáin][(https://github.com/UTS-eResearch/DataCrate/issues/14)5 in favor of an approach where the human-centred HTML page is generated from a machine-readable JSON-LD file rather than the other way around.
Prior to selecting schema.org, we evaluated at a variety of standards, including Dublin Core [10] which is very limited in coverage and DCAT [11] which is more complete for describing datasets at the top level, but silent on the issue of describing files or other contextual entities and relationships between them. We discovered that schema.org has the widest range of terms needed to describe “who, what, where” metadata for datasets. Over the course of more than a year we have worked through the process of creating DataCrates for data from a variety of disciplines and found that schema.org has the coverage to meet the requirement.
Some examples are:
Data relating to the IDRC funded project (described in https://doi.org/11.0897/rio.2.e8880) to examine data management policies and implementation for development funders6. This example shows the use of the schema.org sameAs
property to provide a human readable gloss for the key ‘interviewee’
Some Matlab code[^Matlab code] that supports a research article. This is a good illustration of how common names like Lu, J can be disambiguated using [ORCID] identifiers.
Victoria Arch7 contains data from a 3d survey of a cave, conducted using a Lidar scanner mounted on a drone. This example contains examples of data-provenance, via the use of schema.org CreateAction
properties that show that files result from observation-actions on the object
of study (the cave).
Some clinical trial data9 showing how researcher affiliations can be modeled using linked data. See an example of a Person with four affiliations.
For describing datasets such as exported content from digital object repository systems DataCrate uses the Portland Common Data Model [12], (PCDM), which is a simple ontology for describing nested Collections of Objects, with Objects having Files. See an example of a complete social-history repository which has been exported to DataCrate using PCDM to model its structure10.
There are a number of tools for DataCrate in development.
At the University of Technology Sydney [6], the Provisioner is an open framework for integrating good research data management practices into everyday research workflows. It provides a user-facing research data management planning tool which allows researchers to describe and publish datasets and create and share workspaces in different research apps such as lab notebooks, code repositories (where data is included by-reference), survey tools and collection management tools. DataCrates are used as an interchange format to move data between the different research apps, and as an ingest, archive and publication format. Lightweight adaptors coded against each research app’s native API allow export and import of DataCrates, which are then used to move data from one app to another, while recording a provenance history in the DataCrates’ metadata. Examples of DataCrates moving through the research lifecycle will be provided.
HIEv DataCrate - At the Hawkesbury Institute for the Environment at Western Sydney University, HIEV harvests a wide range of environmental data (and associated file level metadata) from both automated sensor networks and analysed datasets generated by researchers. Leveraging built-in APIs within the HIEv a new packaging function has been developed11, allowing for selected datasets to be identified and packaged in the DataCrate standard, complete with metadata automatically exported from the HIEv metadata holdings into the JSON-LD format. Going forward this will allow datasets within HIEv to be published regularly and in an automated fashion, in a format that will increase their potential for reuse.
MIF DataCrate - The Microbial Imaging Facility (MIF) is part of the ithree institute and School of Life Sciences at University of Technology Sydney. It supports 100 active researchers by providing training, access and support for ten microscopes. There is broad range of equipment, which covers four major brands, and ranges from basic optical microscopes to cutting edge super-resolution. Handling the images and their associated metadata from the array of instruments is made possible though the effort of the Open Microcopy Environment. Specifically, the MIF extensively uses:
Bioformats software [13] which enables a wide range of proprietary image formats to be read.
OME-XML [14], a schema that can describes how microscope are setup.
OMERO [15], a client-server software for storing, sharing, viewing, organizing and analyzing microscopy images.
While the OME comprehensively covers the metadata associated with images, it is necessary to extend for including facility-specific and broader experimental metadata. Several tools have developed to integrate OMERO with DataCrate, including scripts for:
adding external metadata and to manage internal metadata in OMERO12
integrating imaging data with other experimental data with a DataCrate
identity management across systems
Calcytejs14 is a command line tool for packaging data into DataCrate being developed at the University of Technology Sydney which allows researchers to describe any data set via the use of spreadsheets which the tool auto-creates in a directory tree.
Omeka DataCrate Tools15 is Python tool to export data from Omeka Classic repositories into the DataCrate format and to import them into Omeka S.
DSpace to DataCrate16 is an early-stage Nodejs tool for extracting repository data from DSpace repositories into DataCrate format.
There are difficulties in using discipline-specific linked data. To illustrate this we will look at a case study from the Microbial Imaging Facility at UTS. The goal is to package microscope-captured images with detailed provenance information, such as the microscope model, lens, light source, and filters used to make an image, as the most important requirement is that a lab technician can re-set a microscope to the same configuration as was used to capture an image-set.
There is a well established standard for recording this information, using The Open Microscopy Environment (OME) Data Model in XML [16], but to find DataCrate-friendly URIS for classes and properties we tried searching in Bioportal [17]. It became apparent that trying to pick URIs would involve choosing between and using terms from several ontologies to stitch-together a JSON @context
; this approach would have involved a lot of detailed investigation and maintenance.
Instead, the solution to this we have adopted in DataCrate is to create a web page for each element and attribute in the OME schema and to use those in combination with schema.org.
Firstly, make sure that the most important information is captured using the schema.org CreateAction class. That is, the act of creation is recorded with links the the instruments used.
Each piece of equipment has its own URI, resolving to a page, and is listed in the DataCrate-flattened JSON-LD with as much detail as possible. Each scientific instrument will have at least the @type
IndividualProduct but may also have additional types and properties from the OME schema, via the pages we created, and could also include URIs from other ontologies if that will aid in understanding for users, or discovery. An example is available in the DataCrate sample17 dataset.
There is a distinct lack of simple to use JSON-LD tools for programers. While there are libraries for JavaScript and Python (the languages the authors use most frequently) to do high level operations such as normalizing, framing or flattening JSON documents, we have not found libraries with utility functions such the ability to resolve context keys, which is a non-trivial programming task, or to traverse the JSON @graph
. Thus, the DataCrate specification mandates that JSON-LD must be organized in a particular, predictable way (flattened) and the the ??? is a simple mapping of key-to-URI with no indirection.
DataCrate has just reached version one as of the fourth quarter of 2018, and has not yet been widely used, so it is impossible to evaluate its success. Anecdotal feedback from conferences and researchers has been positive, but we will need to evaluate the usability and utility of the system when significant numbers of packages have been distributed and used.
In conclusion, we offer some proposals for further discussion and possible action.
An effort is needed to produce libraries to make consuming JSON-LD easy for developers, libraries for (at least) Python, Javascript and R that can load arbitrary JSON-LD, resolve its context and vocabularies, and provide an index and graph traversal.
More work should be undertaken to align DataCrate with preservation workflows.
The DataCrate specification should be extended to allow for the description of the contents of files, such as column headers in tabular data formats.
As specification developers we note that working with schema.org and JSON-LD has been a straightforward process (lack of library support notwithstanding), it is easy to find and use schema.org terms and DataCrate approach of using Flattened JSON-lD, with a simplified context has been productive for development. Our recommendation to research communities wanting to embrace linked data, and the benefits of the human-readable metadata in data crates, and where there is no established, usable ontology would be to explore the idea of creating a schema.org external extension for their discipline or technical specialization.
[1] P. Sefton and P. Bugeia, “Introducing next year’s model, the data-crate; applied standards for data-set packaging,” in eResearch Australasia 2013, 2013[Online]. Available: http://eresearchau.files.wordpress.com/2013/08/eresau2013_submission_57.pdf.
[2] T. Berners-Lee, Linked data, 2006. [Online]. Available: http://www.w3.org/DesignIssues/LinkedData.html.
[3] J. Kunze, A. Boyko, B. Vargas, L. Madden, and J. Littman, “The BagIt File Packaging Format (V0.97).” [Online]. Available: http://tools.ietf.org/html/draft-kunze-bagit-06. [Accessed: 01-Mar-2013].
[4] “RDFa,” Wikipedia, the free encyclopedia, Feb. 2014[Online]. Available: http://en.wikipedia.org/w/index.php?title=RDFa\\\&oldid=592600280. [Accessed: 26-Feb-2014].
[5] P. Sefton, P. Bugeia, and V. Picasso, “Pick, Package and Publish research data: Cr8it and Of The Web,” in eResearch Australasia 2014, 2014[Online]. Available: http://eresearchau.files.wordpress.com/2014/07/eresau2014_submission_30.pdf.
[6] L. Wheeler, S. Wise, and P. Sefton, “End-to-End Research Data Management for the Responsible Conduct of Research at the University of Technology Sydney,” 2018[Online]. Available: https://eresearch.uts.edu.au/2018/07/04/APRI_2018_provisioner.htm. [Accessed: 10-Jul-2018].
[7] “The Frictionless Data Field Guide.” [Online]. Available: https://frictionlessdata.io/specs/data-package/. [Accessed: 26-Sep-2018].
[8] S. Soiland-Reyes, M. Gamble, and R. Haines, “Research object bundle 1.0,” Specification, researchobject. org, 2014.
[9] “OMERO: Flexible, model-driven data management for experimental biology | Nature Methods.” [Online]. Available: https://www.nature.com/articles/nmeth.1896. [Accessed: 13-Jul-2018].
[10] J. Kunze and T. Baker, “The Dublin Core Metadata Element Set,” 2007[Online]. Available: http://www.rfc-editor.org/info/rfc5013. [Accessed: 12-Jul-2018].
[11] F. Maali, J. Erickson, and P. Archer, “Data catalog vocabulary (DCAT),” W3C Recommendation, vol. 16, 2014[Online]. Available: https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/.
[12] “Portland Common Data Model.” [Online]. Available: https://pcdm.org/2016/04/18/models. [Accessed: 12-Jul-2018].
[13] M. Linkert, C. T. Rueden, C. Allan, J.-M. Burel, W. Moore, A. Patterson, B. Loranger, J. Moore, C. Neves, D. MacDonald, A. Tarkowska, C. Sticco, E. Hill, M. Rossner, K. W. Eliceiri, and J. R. Swedlow, “Metadata matters: Access to image data in the real world,” The Journal of Cell Biology, vol. 189, no. 5, pp. 777–782, 2010[Online]. Available: http://jcb.rupress.org/content/189/5/777.
[14] I. G. Goldberg, C. Allan, J.-M. Burel, D. Creager, A. Falconi, H. Hochheiser, J. Johnston, J. Mellen, P. K. Sorger, and J. R. Swedlow, “The open microscopy environment (ome) data model and xml file: Open tools for informatics and quantitative analysis in biological imaging,” Genome Biology, vol. 6, no. 5, p. R47, May 2005[Online]. Available: https://doi.org/10.1186/gb-2005-6-5-r47.
[15] C. Allan, J.-M. Burel, J. Moore, C. Blackburn, M. Linkert, S. Loynton, D. MacDonald, W. J. Moore, C. Neves, A. Patterson, and others, “OMERO: Flexible, model-driven data management for experimental biology,” Nature methods, vol. 9, no. 3, p. 245, 2012.
[16] I. G. Goldberg, C. Allan, J.-M. Burel, D. Creager, A. Falconi, H. Hochheiser, J. Johnston, J. Mellen, P. K. Sorger, and J. R. Swedlow, “The Open Microscopy Environment (OME) Data Model and XML file: Open tools for informatics and quantitative analysis in biological imaging,” Genome Biology, vol. 6, no. 5, p. R47, May 2005[Online]. Available: https://doi.org/10.1186/gb-2005-6-5-r47. [Accessed: 26-Sep-2018].
[17] P. L. Whetzel, N. F. Noy, N. H. Shah, P. R. Alexander, C. Nyulas, T. Tudorache, and M. A. Musen, “BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications,” Nucleic Acids Research, vol. 39, no. suppl_2, pp. W541–W545, Jul. 2011[Online]. Available: https://academic.oup.com/nar/article/39/suppl_2/W541/2507188. [Accessed: 26-Sep-2018].
https://github.com/UTS-eResearch/DataCrate/↩
https://frictionlessdata.io/specs/data-package/↩
https://developers.google.com/search/docs/data-types/dataset↩
https://github.com/ResearchObject/bagit-ro↩
https://github.com/UTS-eResearch/DataCrate/issues/14↩
https://data.research.uts.edu.au/examples/v1.0/Data_Package-IDRC_Opportunities_and_Challenges_Open_Research_Strategies [^Matlab code]: https://data.research.uts.edu.au/examples/v1.0/GTM/↩
https://data.research.uts.edu.au/examples/v1.0/Victoria_Arch_pub/↩
https://data.research.uts.edu.au/examples/v1.0/sample/↩
https://data.research.uts.edu.au/examples/v1.0/timluckett/↩
https://data.research.uts.edu.au/examples/v1.0/farms_to_freeways/↩
https://github.com/gdevine/hiev_DataCrate↩
https://github.com/evenhuis/omero-user-scripts↩
https://code.research.uts.edu.au/MIF/Workflows/omero-DataCrate↩
https://code.research.uts.edu.au/eresearch/CalcyteJS↩
https://github.com/UTS-eResearch/omeka-DataCrate-tools↩
https://github.com/UTS-eResearch/DataCrate-dspace-tools↩
https://data.research.uts.edu.au/examples/v1.0/sample/↩