Uploading Datasets to DataONE

2022-06-10

Introduction

This document describes how to use the dataone R package to upload data to DataONE, and how to perform maintenance operations on the data after upload.

The dataone R package provides methods to enable R scripts to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN), to search for, download, upload and update data and metadata. The dataone R package takes care of the details of calling the corresponding DataONE web service on a DataONE node. For example, the dataone createObject R method calls the DataONE web service MNStorage.create() that uploads a dataset to a DataONE MN.

Before uploading any data to a DataONE MN, it is necessary to obtain a DataONE user identity that will be provided with each request to upload or update data. The method that DataONE uses to achieve this is known as user identity authentication, and requires that an authentication token, which is a character string, be provided during upload. The process to obtain this token is described in the DataONE Federation vignette, in the section DataONE User Authentication With Tokens, which is viewable with the R command vignette("dataone-overview"). (Note: DataONE originally used X.509 certificates for authentication, which are still supported.)

Uploading A Package Using uploadDataPackage

Datasets and metadata can be uploaded individually or as a collection. Such a collection, whether contained in local R objects or existing on a DataONE repository, will be informally referred to as a package or ‘data package’. Figure 1. is a diagram of a typical DataONE package showing a metadata file that describes, or documents the data granules that the package contains.

The steps necessary to to prepare and upload a package to DataONE using the uploadDataPackage method will be shown. A complete script that uses these steps is shown here, with detailed explanations of the steps following.

library(dataone)
library(datapack)
library(uuid)

dp <- new("DataPackage")

emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile)
dp <- addMember(dp, metadataObj)

sourceData <- system.file("extdata/OwlNightj.csv", package="dataone")
sourceObj <- new("DataObject", format="text/csv", filename=sourceData) 
dp <- addMember(dp, sourceObj, metadataObj)

progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, metadataObj)

outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData) 
dp <- addMember(dp, outputObj, metadataObj)

myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission") 
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE)

This particular package contains the R script filterObs.R, the input file OwlNightj.csv that was read by the script and the output file Strix-occidentalis-obs.csv that was created by the R script, which was run at a previous time.

The following sections describe each line of the above script in detail.

1. Create a DataPackage object.

In order to use uploadDataPackage, it is necessary to prepare an R DataPackage object which is a container for the set of files that will be included in the package. The following commands load the required libraries and creates an empty DataPackage object that will be added to later:

library(dataone)
library(datapack)
library(uuid)

dp <- new("DataPackage")

When using the uploadDataPackage method, data structures that are required by DataONE are created, configured and uploaded automatically with the package. These data structures include a ResourceMap that details the contents of the package, and SystemMetadata objects that contain DataONE system information for each of the science datasets and associated science metadata.

A dataone DataObject is a container that holds both the data bytes and the system information for a metadata file, data or other type of file. A DataObject is created for each file that will be included in a DataPackage.

2. Prepare a metadata file that will describe the files in the package

The next step is to prepare a metadata file that will describe the science datasets and other files in the package. The most common metadata format used in the DataONE network is the Ecological Metadata Langauge (EML). Other supported formats include FGDC, ISO 19115 and others. Additional information about EML is available at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html.

Detailed directions regarding authoring metadata documents are outside the scope of this document.

DataONE requires that any file uploaded to a member node have a unique identifier associated with it.

When a DataObject is created, a unique identifier is generated for the DataObject if one is not specified using the id parameter. This automatically generated identifier has the format “urn:uuid:”, for example “urn:uuid:c3443142-6260-4ea5-aaa1-1114981e04ad”.

The following commands create the DataObject for the science metadata, using an automatically generated identifier:

emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile)

Now add the metadata object to the DataPackage:

dp <- addMember(dp, metadataObj)

Files are considered members of a package when they are enumerated and described by a metadata file, and a relationship between the metadata and data object is explicitly stated.

DataONE (and the dataone R package) has adopted the package guidelines detailed by the DataONE package implementation. In this specification, the relationship that links a metadata object and a science object is CiTO (Citation Typing Ontology) documents.

This relationship between the science metadata and data objects will be added to the DataPackage automatically for each data object as it is added to the DataPackage, if the metadata object is first added, then referenced as the DataObjects are added.

As the metadata object has already been added, it can be referenced as each DataObject is added.

dp <- addMember(dp, sourceObj, metadataObj)

Since metadataObj is included as the third argument here, the CiTO documents relationship will automatically be added between metadataObj and sourceObj.

Alternatively, this relationship between the metadata and science objects can be made explicitly using the insertRelationship() method:

dp <- addMember(dp, metadataObj)
dp <- addMember(dp, sourceObj)
dp <- insertRelationship(dp, getIdentifier(metadataObj), getIdentifier(sourceObj))

Note that the relationship type, using the insertRelationship() predicate argument does not have to be specified in this case, as the CiTO documents relationship is the default value for insertRelationship.

3. Create and add a DataObject for each data file

A DataObject must be created for each metadata file, data file or any other type of file that will be included in the package.

A dataone SystemMetadata R object will be created automatically and stored in each DataObject. The information from the SystemMetadata R object will be used by DataONE to maintain low level information about the dataset, such as the access policy, the user identity of the rightsholder (the user identity that can modify access the dataset), which Member Nodes it can be replicated to, etc.

The example below creates a DataObject for a science dataset:

sourceData <- system.file("extdata/OwlNightj.csv", package="dataone")
sourceObj <- new("DataObject", format="text/csv", filename=sourceData) 
dp <- addMember(dp, sourceObj, metadataObj)

An optional user argument can be specified when creating a DataObject, which will be used to set the DataONE submitter and rightsholder of the dataset when it is uploaded. The rightsholder is granted all access privileges to the object.

If user is not specified for a DataObject, then the submitter and rightsholder for an object will automatically be set, when the object is uploaded to DataONE, to the DataONE user that created the authentication token or X.509 certificate.

Now DataObjects for an R script and for a file created by the R script will be created:

progFile <- system.file("extdata/filterObs.R", package="dataone")
progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc")
dp <- addMember(dp, progObj, mo=metadataObj)

outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone")
outputObj <- new("DataObject", format="text/csv", filename=outputData) 
dp <- addMember(dp, outputObj, mo=metadataObj)

Important Note: files previously uploaded to DataONE can be added to the DataPackage, for more information see the section “Adding Existing DataONE Files To A Package”.

4. Determine what access your data and metadata should have

DataONE provides a mechanism that allows data submitters to control access to their data.

The levels of access available to objects in DataONE are “read”, “write”, and “changePermission”.

The “read” permission allows a user the ability to view the content of a DataONE object. The “write” permission allows a user the ability to change the content of an object via update services. The “changePermission” permission allows the ability to change the access policy for an object and includes both read and write permissions.

The access rules that are added to DataObjects in a DataPackage will determine the access that is granted to users accessing the package after it is uploaded to DataONE.

Each of these permissions can be granted to a single user, a group of users, or the special public user which means all users.

Each object in DataONE can have one or more access rules that control the access of that object. The complete set of access rules for an object is referred to as its access policy.

Access rules can be added to each DataObject individually after it has been created.

Alternatively, access rules can be specified for all package members when a package is uploaded using uploadDataPackage. This method is shown at the end of this section.

To grant read permission to all users:

sourceObj <- setPublicAccess(sourceObj)

Individual access rules to be added for a DataONE user identity can also be added to the access policy.

Access rules are added to a DataObject using the addAccessRule method. The following access rule will grant the user with the ORCID https://orcid.org/0000-0002-2192-403X changePermission access to the dataset:

myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission") 
sourceObj <- addAccessRule(sourceObj, myAccessRules)

DataONE user identities and user authentication are described in section DataONE User Authentication in the vignette dataone-overview (to view this vignette, type this command in the R console: vignette("dataone-overview"))

5. Upload the DataPackage

When all DataObjects have been added to the DataPackage, call the uploadDataPackage method to upload the entire DataPackage.

As mentioned previous, as an alternative to adding access rules to each DataObject individually before adding it to the DataPackage, the access rules can be specified once when the package is uploaded to DataONE. For example, to add public access to every object in the package, and add the custom access rule show above, the public and accessRules arguments are used when calling updateDataPackage:

d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE)
message(sprintf("Uploaded package with identifier: %s", packageId))

(Note that the example uses a DataONE test environment STAGING, and not the production environment.)

After uploadDataPackage has been called successfully, the package can be viewed on the member node, searched for using the DataONE search facility. Note that if objects in DataONE are not publicly readable, and the authenticated user performing the search isn’t granted access in an object’s access policy, then the objects will not be viewable or discoverable via the search facility for that user.

Using a Digital Object Identifier with a package.

A Digital Object Identifier (DOI) may be assigned to the metadata DataObject, using the generateIdentifier method:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
doi <- generateIdentifier(mn, "DOI")
metadataObj <- new("DataObject", id=doi, format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)

The generateIdentifier method requests that the DataONE MN generate a properly formatted DOI.

Adding Existing DataONE Files To A Package

Files that have previously been uploaded to DataONE can be added to a new DataPackage. New DataObjects can be created by downloading an existing DataONE item:

dataObj <- getDataObject(d1c, id="urn:uuid:1234", lazyLoad=T, limit="1TB")
dp <- addMember(dp, dataObj, mo=metadatqObj)

The above example creates a new DataObject by downloading the existing item from DataONE that has the identifier “urn:uuid:1234” and using the ’d1c` variable previously defined. This DataObject is then added to the package. In this example, the optional “lazyLoad” and “limit” parameters are used. These parameters cause getDataObject to only download the system information (i.e. access rules, format information) and not the data bytes associated with this identifier. Using these parameters can be helpful if the file associated with this identifier is very large. For the purpose of adding this DataONE identifier to a DataPackage, the data bytes are not required.

Reusing DataONE items this way results in one copy of the item existing in DataONE but being included in multiple DataONE packages.

Uploading Individual Data And Metadata Files

A single data or metadata file can be uploaded to a DataONE MN using the createObject method. When uploading a single file using this method, additional information must be supplied to DataONE that controls how DataONE interacts with the uploaded file. This additional information is stored in DataONE as a system metadata object and contains information such as who can access or update the file, how many copies of the file should be maintained, whether the file has been superseded by another object, etc. The system metadata information that will be uploaded to DataONE is collected and stored in an R object type datapack::SystemMetadata, as shown below:

library(digest)
# Create a system metadata object for a data file. 
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256)
sysmeta <- addAccessRule(sysmeta, "public", "read")

Alternatively, the system metadata could have been created with a seriesId. The seriesId is explained in the dataone_overview vignette. The following example shows the creation of a SystemMetadata object using the optional seriesId:

# Create a system metadata object for a data file. 
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
# The seriesId can be any unique character string.
seriesId <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256,  seriesId=seriesId)

A unique identifier must be specified for each system metadata, whether or not a seriesId is used.

The dataset can now be uploaded to DataONE with the associated system metadata:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
response <- createObject(mn, pid, csvfile, sysmeta)

Note that for this example, the DataONE test environment STAGING is used, and not the production environment.

Maintaining Uploaded Datasets

After data has been uploaded to DataONE, maintenance operations can be performed on these objects using the methods described in the following sections.

Update the DataONE system metadata for an object (MNode: updateSystemMetadata)

The system metadata can be updated for an object in DataONE without updating the data bytes of the object itself. For example, if an object was only readable by the data submitter, the access policy for an object can be updated to enable public access. System metadata is updated for an object using the updateSystemMetadata method:

The following example first downloads the current system metadata for the pid from the previous example, then updates the access policy that will be applied to the object and uploads the new system metadata to DataONE so that the changes will be applied:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
sysmeta <- getSystemMetadata(mn, pid)
sysmeta <- addAccessRule(sysmeta, "public", "read")
status <- updateSystemMetadata(mn, pid, sysmeta)

Note that updating an object in DataONE requires the proper access. For example, for the identifier shown above (urn:uuid:17d61d5a-061a-4778-9cdf-4e14751aaddc), only the DataONE user identity that is the rightsholder, or another user identity that has been granted write access by the rightsholder will be able to update the object on the DataONE member node. Running the previous example without the proper authentication will produce an error.

Replace an object with a newer version (MNode: updateObject)

The updateObject updates an existing object by creating a new object identified by a new PID on the Member Node. The new object replaces and obsoletes the old object. An obsoleted object in DataONE does not appear in search results, however it is still available for download if the identifier is known.

# Update object from previous example with a new version
updateid <- sprintf("urn:uuid:%s", UUIDgenerate())
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
size <- file.info(csvfile)$size
sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE)
# Start with the old object's sysmeta, then modify it to match
# the new object. We could have also created a sysmeta from scratch.
sysmeta <- getSystemMetadata(mn, pid)
sysmeta@identifier <- updateid
sysmeta@size <- size
sysmeta@checksum <- sha256
sysmeta@obsoletes <- pid
# Now update the object on the member node.
response <- updateObject(mn, pid, csvfile, updateid, sysmeta)
# Get the new, updated sysmeta and check it to ensure that the update
# worked, i.e. "obsoletes" is the old pid that was replaced by the update.
updsysmeta <- getSystemMetadata(mn, updateid)
updsysmeta@obsoletes

The Member Node will mark the object as being obsolete by setting a property in the system metadata on the object being replaced. An object marked as obsolete will not appear in search results, however, such an object is still available for download if the PID is known.