Skip to contents

R packages that interact with EML

Introduction

EML (Ecological Metadata Language; (M. Jones et al. 2019)) is a version of the machine and human readable XML (Extensible Markup Language) which is designed to facilitate data documentation (i.e. metadata) to ease open data and sharing. XML is a language related to HTML, but was designed to tag document content and allows for validation against specific schemas. XML is well supported in many different programming languages and can be readily translated to other formats (McCartney and Jones 2002).

EML consists of “modules” (with separate schemas) that allow the description of data attributes (e.g. spatial, temporal, taxonomic extent). EML defines several resource types (e.g. “dataset”) that inherit a common set of elements or “tags” which facilitate discovery of metadata. One of the biggest advantages of EML is that it is extensible with additional modules easily integrated (McCartney and Jones 2002).

GBIF (the Global Biodiversity Information Facility), which is an international network and research infrastructure, provides open access to data about biodiversity. The GBIF metadata profile is based on EML and as such anyone interacting with GBIF needs to be able to read, write or visualise data in the EML format.

A serious barrier to data sharing is the learning curve and time cost needed to produce metadata. EML is a complex set of elements (many 100’s of them are available) most of which are not relevant to any single dataset (McCartney and Jones 2002). Writing directly in EML can be a technical barrier to many researchers who, in ecology, may only have experience in the R language which has a very different syntax and data structure. The challenge is, therefore, to make filling in metadata as easy as possible for all researchers regardless of their data type.

Several R packages have functions to read, write or interact with EML files and here we aim to identify those that will be useful for researchers working with ecological data.

Aim

To identify R packages and functions that can read, write or interact/visualise data in the EML format

Objectives
  1. To review the available packages on CRAN and on GitHub that have functions to interact with EML
  2. To identify the main functions from these packages that allow reading, writing and visualising data in EML format

Methods

We searched for R packages on CRAN (The Comprehensive R Archive Network - https://cran.r-project.org/) using the search function which is provided by Google. We used the search term “Ecological Metadata Language site:r-project.org”. We repeated the search for “Ecological Metadata Language” on GitHub and identified repositories which used the R language. We identified all packages returned through our search that were on CRAN or on GitHub and that consisted of functions that are aimed at interacting (either reading, writing or viewing) with the Ecological Metadata Language (EML).

Results

The search on Google (via CRAN) returned 25 hits. Seven of these were R packages. The search on GitHub returned 15 repositories of which 6 were in the R language. Two of these returned errors when we attempted to download them in RStudio using the devtools::install_github() function, and two have been archived. We identified 9 R packages (6 from CRAN and 3 from GitHub) that were active in the last two years and explicitly stated that functions were used for interacting (either reading, writing or viewing) Ecological Metadata Language (EML). Since the original google search, several additional packages have become available on GitHub and are not also included in the following list. One of these, NPSdataverse is a wrapper package that will install and load multiple EML-related packages including EML, EMLAssemblyline, EMLeditor, DPchecker, and NPSutils. In addition, as of 2023-11-29, the package rbefdata is no longer available via CRAN and is not being maintained by on GitHub.

Package Package_description
EML Work with Ecological Metadata Language (‘EML’) files. ‘EML’ is a widely used metadata standard in the ecological and environmental sciences, described in Jones et al. (2006), doi:10.1146/annurev.ecolsys.37.091305.110031.
emld This is a utility for transforming Ecological Metadata Language (‘EML’) files into ‘JSON-LD’ and back into ‘EML.’ Doing so creates a list-based representation of ‘EML’ in R, so that ‘EML’ data can easily be manipulated using standard ‘R’ tools. This makes this package an effective backend for other ‘R’-based tools working with ‘EML.’ By abstracting away the complexity of ‘XML’ Schema, developers can build around native ‘R’ list objects and not have to worry about satisfying many of the additional constraints of set by the schema (such as element ordering, which is handled automatically). Additionally, the ‘JSON-LD’ representation enables the use of developer-friendly ‘JSON’ parsing and serialization that may facilitate the use of ‘EML’ in contexts outside of ‘R,’ as well as the informatics-friendly serializations such as ‘RDF’ and ‘SPARQL’ queries.
RNeXML Provides access to phyloinformatic data in ‘NeXML’ format. The package should add new functionality to R such as the possibility to manipulate ‘NeXML’ objects in more various and refined way and compatibility with ‘ape’ objects.
datapack Provides a flexible container to transport and manipulate complex sets of data. These data may consist of multiple data files and associated meta data and ancillary files. Individual data objects have associated system level meta data, and data files are linked together using the OAI-ORE standard resource map which describes the relationships between the files. The OAI- ORE standard is described at https://www.openarchives.org/ore/. Data packages can be serialized and transported as structured files that have been created following the BagIt specification. The BagIt specification is described at https://tools.ietf.org/html/draft-kunze-bagit-08.
geometa Provides facilities to handle reading and writing of geographic metadata defined with OGC/ISO 19115, 11119 and 19110 geographic information metadata standards, and encoded using the ISO 19139 (XML) standard. It includes also a facility to check the validity of ISO 19139 XML encoded metadata.
MetaEgress Functions to create Ecological Metadata Language (EML) XML documents from metadata stored in a metadata database design by the Long Term Ecological Research Network.
EMLassemblyline For scientists and data managers to create high quality EML metadata for dataset publication. EMLassemblyline is optimized for automating recurring publications (timeseries or data derived from timeseries sources) but works well for “one-off” publications, especially through the MetaShARK interface. EMLassemblyline prioritizes automated metadata extraction from data objects to minimize required human effort and encourages EML best practices to make publications Findable, Accessible, Interoperable, and Reusable.
pkEML Use this package to convert an existing corpus of Ecological Metadata Language (EML) documents into table form, plus identify and consolidate re-used metadata elements. Existing metadata can then be imported or migrated into a relational database system with re-usable metadata elements (with “primary keys” hence pkEML). This package is developed with the intention to help established LTER sites import their EML corpus into LTER-core-metabase. See https://github.com/lter/LTER-core-metabase. However please use to whatever use you may find, and
LivingNorwayR The package provides a workflow for creating a Darwin Core standard-compliant data archive (“a data package”). This facilitates FAIR (Findable, Accessible, Interoperable, Reusable; https://www.go-fair.org/fair-principles/) data sharing and uploading of Darwin Core archives (data packages) to repositories such as GBIF. The Living Norway package also provides tools for the processing and manipulation of metadata associated with Darwin Core archives and for the import and export of metadata according to the EML (Ecological Metadata Language; https://eml.ecoinformatics.org/) standard.
NPSdataverse Loads a suite of R packages for creating and manipulating data packages including interacting with DataStore.
EMLeditor This package will be of most use to the U.S. National Park Service data scientists and managers seeking to generate EML-formatted metadata for datapackages. EML-formatted .xml files are typically constructed using EDI’s EMLassemblyline package and then imported as an R-object using the EML package. EMLeditor allows the user to view the contents of the R object and add/edit aspects of metadata crucial for publication in the U.S. National Park Service DataStore repository. For instance, a user can view and edit a DOI, a link to a DRR, Park Unit connections, information about Confidential Unclassified Information (CUI), and more. EMLeditor allows the user to write a mockup of a README.txt to preview what the README automatically generated by DataStore upon upload will look like.
DPchecker Allows the user (and reviewer) to check a data package and test whether it meets the congruence standards set forth by NPS for upload to DataStore as a datapackage.
NPSutils NPSutils is a collection of functions for interacting with NPS DataStore repository.
emld package

The emld package (Boettiger 2019) is closely related to the EML package (Boettiger and Jones 2021) and its functions are heavily relied upon in the latest version of EML.

emld has an advantage over EML where there are large, highly nested EML files. It can flatten EML in to common R formats that can be manipulated in R.

# Example from https://github.com/ropensci/emld
f <- system.file("extdata/example.xml", package="emld")
eml <- emld::as_emld(f)
eml$dataset$title
## [1] "Data from Cedar Creek LTER on productivity and species richness\n  for use in a workshop titled \"An Analysis of the Relationship between\n  Productivity and Diversity using Experimental Results from the Long-Term\n  Ecological Research Network\" held at NCEAS in September 1996."

emld objects are nested lists so to write EML in emld you can create a list.

# Example from https://github.com/ropensci/emld

me <- list(individualName = list(givenName = "Joe", surName = "Bloggs"))

eml <- list(dataset = list(
              title = "The biggest fish in the sea",
              contact = me,
              creator = me),
              system = "doi",
              packageId = "10.xxx")

ex.xml <- tempfile("ex", fileext = ".xml") # use your preferred file path

as_xml(eml, ex.xml)

emld::as_xml() reorders the list if necessary to ensure that it matches the EML required format. You can use emld::eml_validate() to check that the EML has been written correctly and is a valid EML file.

eml_validate(ex.xml)
## [1] TRUE
## attr(,"errors")
## character(0)
EML package

The EML (Boettiger and Jones 2021) package uses emld (Boettiger 2019) as a basis for many of its functions.

The read_eml() function reads eml files in to the R session.

f <- system.file("extdata", "example.xml", package = "emld")
eml <- EML::read_eml(f)
eml$dataset$title
## [1] "Data from Cedar Creek LTER on productivity and species richness\n  for use in a workshop titled \"An Analysis of the Relationship between\n  Productivity and Diversity using Experimental Results from the Long-Term\n  Ecological Research Network\" held at NCEAS in September 1996."

Several functions (with the prefix set_) allow the user to import attributes of the dataset file (e.g. the methods) as word or Markdown documents and merge them in to a single EML file (see https://docs.ropensci.org/EML/ for a detailed tutorial).

EML has several functions (with the prefix get_) which allow a user to extract elements of an XML file.

f <- system.file("tests", emld::eml_version(), 
  "eml-datasetWithAttributelevelMethods.xml", package = "emld")
eml <- EML::read_eml(f)
EML::get_attributes(eml$dataset$dataTable$attributeList)
## $attributes
##    id...1 attributeName                    attributeLabel
## 1   att.1           fld                             Field
## 2   att.2          year                              year
## 3   att.3            sr                  Species Richness
## 4   att.4        pctcov                     percent cover
## 5   att.5       avesr91 Average Species Richness for 1991
## 6    <NA>       avesr92 Average Species Richness for 1992
## 7    <NA>       avesr93 Average Species Richness for 1993
## 8    <NA>       avesr94 Average Species Richness for 1994
## 9    <NA>       avesr95 Average Species Richness for 1995
## 10   <NA>       avesr96 Average Species Richness for 1996
## 11   <NA>        MeanSR             mean species richness
## 12 att.14          time                              Time
##                                              attributeDefinition storageType
## 1                   Field where the data was collected\n              string
## 2                      The year the data was collected\n               gYear
## 3                                       Species richness for CDR       float
## 4                The percent ground cover on the field\n               float
## 5   The average species richness for the field in 1991\n               float
## 6   The average species richness for the field in 1992\n               float
## 7   The average species richness for the field in 1993\n               float
## 8   The average species richness for the field in 1994\n               float
## 9   The average species richness for the field in 1995\n               float
## 10  The average species richness for the field in 1996\n               float
## 11         the mean species richness from 1991 to 1996\n               float
## 12 The time of day for this observation, 24 hour clock\n                time
##    id...6            definition measurementScale         domain formatString
## 1    nd.1 Valid names of fields          nominal     textDomain         <NA>
## 2    <NA>                  <NA>         dateTime dateTimeDomain         YYYY
## 3    <NA>                  <NA>         interval  numericDomain         <NA>
## 4    <NA>                  <NA>            ratio  numericDomain         <NA>
## 5    <NA>                  <NA>            ratio  numericDomain         <NA>
## 6    <NA>                  <NA>            ratio  numericDomain         <NA>
## 7    <NA>                  <NA>            ratio  numericDomain         <NA>
## 8    <NA>                  <NA>            ratio  numericDomain         <NA>
## 9    <NA>                  <NA>            ratio  numericDomain         <NA>
## 10   <NA>                  <NA>            ratio  numericDomain         <NA>
## 11   <NA>                  <NA>            ratio  numericDomain         <NA>
## 12   <NA>                  <NA>         dateTime dateTimeDomain   hh:mm:ss.s
##    dateTimePrecision id...8    minimum          unit precision numberType
## 1               <NA>   <NA>       <NA>          <NA>      <NA>       <NA>
## 2                  1   dd.2       1944          <NA>      <NA>       <NA>
## 3               <NA>   nd.3          0 dimensionless       0.5       real
## 4               <NA>   nd.4          0 dimensionless       0.1       real
## 5               <NA>   nd.5          0 dimensionless       0.1       real
## 6               <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 7               <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 8               <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 9               <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 10              <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 11              <NA>   <NA>       <NA> dimensionless       0.1       <NA>
## 12               0.1   dd.3 15:00:00.0          <NA>      <NA>       <NA>
##    exclusive...10 exclusive...12    maximum     id exclusive...9 exclusive...11
## 1            <NA>           <NA>       <NA>   <NA>          <NA>           <NA>
## 2            <NA>           <NA>       <NA>   <NA>          <NA>           <NA>
## 3            <NA>           <NA>       <NA>   <NA>          <NA>           <NA>
## 4            true           true        100   <NA>          <NA>           <NA>
## 5            <NA>           <NA>       <NA>   <NA>          <NA>           <NA>
## 6            <NA>           <NA>       <NA>  att.6          <NA>           <NA>
## 7            <NA>           <NA>       <NA>  att.7          <NA>           <NA>
## 8            <NA>           <NA>       <NA>  att.8          <NA>           <NA>
## 9            <NA>           <NA>       <NA>  att.9          <NA>           <NA>
## 10           <NA>           <NA>       <NA> att.10          <NA>           <NA>
## 11           <NA>           <NA>       <NA> att.11          <NA>           <NA>
## 12           <NA>           <NA> 19:00:00.0   <NA>          true           true
## 
## $factors
## NULL

RNeXML package

RNeXML (Boettiger et al. 2016) is a package for reading and writing phylogenetic, character and trait metadata focussed on taxonomy. The package stands parallel to EML and emd using similar functions but is focused on a different (but related) standard NeXML rather than EML per se.

datapack package

The datapack package (M. B. Jones and Slaughter 2020) provides functions for collating multiple data and metadata objects of different types into a bundle that can be transported and loaded using a single composite file. It is primarily meant as a container to bundle together files for transport to or from DataONE data repositories. Metadata in the form of EML can be attached to the databundle.

## Members:
## 
## filename       format    mediaType  size     identifier          modified local 
## sample-eml.xml eml...1.0 NA         5990     urn:uuid...c33ce2ba n        y     
## 
## Package identifier: NA
## RightsHolder: NA
## 
## 
## This package does not contain any provenance relationships.

rbefdata package

This package does not appear to be available on CRAN or GitHub as of 2023-11-29

Similarly to the datapack package rbefdata (Pfaff, Nadrowski, and Man 2013) links to the BEF data portal (https://fundiv.befdata.biow.uni-leipzig.de/). The function rbefdata::bef.portal.get.metadata() extracts the metadata from a file that the user has downloaded from the BEF data portal.

geometa package

geometa (Blondel 2021) provides functions for reading and writing geographic metadata. It suggests EML and emld. It can convert metadata to and from EML using the geometa::convert_metadata() function.

MetaEgress package

MetaEgress (Nguyen and Kui 2021) is designed to create EML for Long Term Ecological Research metabase ( https://github.com/lter/LTER-core-metabase). Validation of EML is done through emld::eml_validate()

EMLassemblyline package

EMLassemblyline (Smith 2021) is a metadata builder that has functions to autoextract metadata. The package is centered around a “data package” that is a collection of data objects and metadata. The templating functions allow the development of metadata templates that the user can add information to. EMLassemblyline also has a “living data” set of functions that allow ongoing data collection to be published at regular periods. See https://ediorg.github.io/EMLassemblyline/index.html for examples of the workflow.

library(EMLassemblyline)
make_eml(
  path = "./metadata_templates",
  data.path = "./data_objects",
  eml.path = "./eml",
  dataset.title = "Sphagnum and Vascular Plant Decomposition under Increasing Nitrogen Additions",
  temporal.coverage = c("2014-05-01", "2015-10-31"),
  maintenance.description = "Completed: No updates to these data are expected",
  data.table = c("decomp.csv", "nitrogen.csv"),
  data.table.description = c("Decomposition data", "Nitrogen data"),
  other.entity = c("ancillary_data.zip", "processing_and_analysis.R"),
  other.entity.description = c("Ancillary data", "Data processing and analysis script"),
  user.id = "myid",
  user.domain = "EDI",
  package.id = "edi.260.1")

MetaInbase (pkEML) package

MetaInbase (c)) 2021) has functions to convert EML to tables. It appears to be in an early stage of development. Renamed to pkEML.

MetaShARK package

MetaShARK (Metadata Shiny Automated Resource & Knowledge; https://github.com/earnaud/MetaShARK-v2) is a R shiny app allowing the user to get information about EML and to fill in metadata for datasets according to this standard. This package has developed a “user-friendly” Shiny App for researchers not familiar with EML standards.

LivingNorwayR

LivingNorwayR addresses the EML metadata in a slightly different way to the other packages. It uses RMarkdown with specific tagging functions in order to “knit” together the EML document. The package is focused on creating Darwin Core compliant data archives (“data packages”) that can be stored locally or shared with, for example, GBIF. A worked example is available as a vignette.

NPSdataverse

NPSdataverse is a wrapper package that will install a suite of packages for generating, editing, checking, up/downloading (to/from the NPS DataStore) and accessing EML. NPSdataverse will install and load the following packages: EML, EMLassemblyline, QCkit, EMLeditor, DPchecker, NPSutils. Of these, QCkit is the only package that does not involve EML in one way or another.

EMLeditor

EMLeditor is a package aimed at editing EML. EMLeditor package contains a number of “get_” and “set_” class functions for retrieving and editing EML objects in R. Installing EMLeditor will also install a sample .rmd script accessible via Rstudio for generating EML via a combination of EML, EMLassemblyline, and EMLeditor. One major goal of EMLeditor is to make editing EML fast and easy without needing to call the make_eml() function from EMLassemblyline. Another goal is to add EML elements that may be specific to National Park Staff, collaborators, or partners. The package will also authenticated users to generate references on the NPS DataStore and upload data sets/metadata to DataStore.

DPchecker

DPchecker is a package that aimed at checking whether a data package - consisting of several flat files in .csv format and a single EML file (*_metadata.xml) is ready to be uploaded to the NPS DataStore. Although many of the checks will be NPS specific, the DPchecker also performs a number of more general checks for internal consitency in the EML file as well as congruence between the EML file and the data files.

NPSutils

NPSutils is primarily a public-facing package aimed at accessing data packages on the NPS DataStore. NPSutils contains functions to download data packages (.csv files and an EML file), load the data and metadata in to R, and load the data and EML metadata into dashboard visualizers such as Power BI.

Relationship between packages

By looking at the dependency relationship between the packages we can get an understanding of how they are structured and how similar each package is. From the figure and table below you can see that jsonlite and XML2 are core packages that underpin many of the packages that deal with EML in R. jsonlite helps to convert the list-like structure of EML in to different R classes. XML2 allows the user to read in XML files (EML is a type of XML) to R as a list and to also write EML as an XML structured file. Three of the packages use the emld package as a basis (most use the validation function) and from the network plot you can see that all the packages except EMLassemblyline are quite closely related in terms of their underlying dependencies. A large number of the dependencies of EMLassemblyline are not shared by the other packages.

Which package to use?

Many of the packages we have found are focused on single data repositories. These packages might be useful for these specific tasks, however the main packages for creating EML are {emld} and {EML} as these are the basis to many of the functions used in the other packages. The {LivingNorwayR} package uses RMarkdown as a tool to create EML metadata and might be preferable to those with some experience with RMarkdown documents.

References

Blondel, Emmanuel. 2021. Geometa: Tools for Reading and Writing ISO/OGC Geographic Metadata. https://CRAN.R-project.org/package=geometa.
Boettiger, Carl. 2019. “Ecological Metadata as Linked Data. Journal of Open Source Software.” The Journal of Open Source Software 4 (34): 1276. https://doi.org/10.21105/joss.01276.
Boettiger, Carl, Scott Chamberlain, Rutger Vos, and Hilmar Lapp. 2016. RNeXML: A Package for Reading and Writing Richly Annotated Phylogenetic, Character, and Trait Data in R.” Methods in Ecology and Evolution 7: 352–57. https://doi.org/10.1111/2041-210X.12469.
Boettiger, Carl, and Matthew B. Jones. 2021. EML: Read and Write Ecological Metadata Language Files. https://CRAN.R-project.org/package=EML.
c)). 2021. MetaInbase: Convert Corpus of Ecological Metadata Language (EML) Documents to Table Form.
Jones, Matthew B., and Peter Slaughter. 2020. datapack: A Flexible Container to Transport and Manipulate Data and Associated Resources. https://doi.org/10.5063/F1QV3JGM.
Jones, Matthew, Margaret O’Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, and Steven Chong. 2019. “Ecological Metadata Language Version 2.2.0.” https://doi.org/10.5063/f11834t2.
McCartney, P, and M Jones. 2002. “Using XML-Encoded Metadata as a Basis for Advanced Information Systems for Ecological Research.” In Proc. 6th World Multiconference Systemics, Cybernetics and Informatics, 7:379–84.
Nguyen, An T., and Li Kui. 2021. MetaEgress: Create Ecological Metadata Language from LTER Core Metabase.
Pfaff, Claas-Thido, Karin Nadrowski, and Xingxing Man. 2013. Rbefdata: BEFdata r Package. https://CRAN.R-project.org/package=rbefdata.
Smith, Colin. 2021. EMLassemblyline: A Tool Kit for Building EML Metadata Workflows. https://github.com/EDIorg/EMLassemblyline.