October 11, 2018

Semantic approaches to enable USGS data to be FAIR (Findable, Accessible, Interoperable, Reusable)

We used the list of FAIR Principles at https://www.go-fair.org/fair-principles/, which includes links to explanations (links in the left column). Notes on the discussion are arranged by principle number:

F1. (Meta)data are assigned a globally unique and persistent identifier

Discussion was about separate metadata identifiers. In some cases the DOI for a data release might be considered the metadata DOI, or the IGSN for a sample might point to the metadata. Advantages of an identifier for the metadata are that it would be a way of discovering the most current version of the metadata, when the metadata is separated from the data landing page, and would be useful for managing collections of metadata.

F2. Data are described with rich metadata

Do USGS scientific communities have specific guidelines for "rich metadata"? The CSDGM Biological Profile is one. For data submitted to NWIS or Genbank, such guidelines are provided. In many cases, this guidance is provided by reviewers. The CDI project working on content specifications for ISO metadata is working on this.

F3. Metadata clearly and explicitly include the identifier of the data they describe

When data repositories assign identifiers, are they inserted in the metadata? The DOI could go into the online linkage.

F4. (Meta)data are registered or indexed in a searchable resource

The Science Data Catalog takes care of this!

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

Covered by the Public Access Policy and requirements for Trusted Digital Repositories.

A2. Metadata are accessible, even when the data are no longer available

A new USGS policy is needed. The metadata could continue to be provided on the landing page associated with the DOI.

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

One model: NWISWeb provides RDF snippets in the XML metadata that include URIs for detailed descriptions.

I2. (Meta)data use vocabularies that follow FAIR principles

USGS vocabularies have a lot of features that are needed to follow FAIR principles. The problem is institutional commitment to a stable Web home for URI stability. We also need cross-vocabulary mapping especially for interdisciplinary science and conversations.

** This is an important topic for promoting data interoperability in USGS.

I3. (Meta)data include qualified references to other (meta)data

We need vocabularies and placement standards for specifying relationships with the items in references. It will be possible in ISO metadata. ScienceBase specifies it on landing pages, but that needs to be pulled into metadata for stand-alone use.

** This is an important topic. We could start by looking at DataCite specifications, the ScienceBase term list, and the ADIwg term list.

R1.1. (Meta)data are released with a clear and accessible data usage license

We have use constraints, but nothing called a license. We tend to assume people know they can use our data, but even saying “Public Domain” doesn’t specify what it means. CDI Software group is talking about it. Does ISO have a way? Does Lisa Zolly know more about what is possible?

** This is a thorny issue.

R1.2. (Meta)data are associated with detailed provenance

Provenance can be a huge amount of information. There is a Provenance Challenge in ESIP (linked below).

** This is an important topic to address.


Here are additional links from community members:

Link related to R1.2 Provenance: http://www.esipfed.org/community-prov-challenge 

 


September 13, 2018

FY18 has been a quiet year for the Semantic Web Working Group. Do we still need the Semantic Web Working Group? If so, what do we need it to do?

A small group talked about the history and status of the working group, and concluded that we have achieved our original purpose of learning about semantic web technologies. We are ready to start assisting USGS in taking semantic approaches that will improve data management and integration. Improving USGS use of controlled vocabularies is an obvious place to start, and we agreed to encourage Peter Schweitzer to meet with us to identify opportunities.

For the FY19 CDI proposal cycle, we discussed proposing to host a USGS-wide workshop to design a road map to make USGS data more consistent with the FAIR Data Principles – not just focusing on integrating data to support a particular use, but improving our data practices so that all USGS data is findable, accessible, interoperable, and re-usable for multiple unanticipated uses.

Initial work on this proposal will happen over email.


June 14, 2018

We explored Loterre together.

Inist-CNRS (France) is rolling out Loterre (Linked open terminology resources), a multidisciplinary linked-data platform to make scientific terminology sources available on the web.

To facilitate exchanges and interoperability it is built on triplestore*and provides for resources retrieval and downloading together with a query API.

Loterre is not restricted to the terminology resources Inist-CNRS produces. It has also been designed as an open platform to host terminological data from other data producers.

The Inist also provides technological support for those who need to convert resources into SKOS/RDF format.

 --------------------------

*A triplestore is a database especially designed for the storage and recovery of RDF data (Resource Description Framework)

See Loterre: https://www.loterre.fr

Contact: contact-terminologietal@inist.fr

Stay tuned to Loterre on Twitter: https://twitter.com/INIST_Loterre


May 10, 2018

News sharing and discussion.

Notes on topics discussed:

April 12, 2018

Demonstrating the use of permanent identifiers in linked data

Example:

Data Categories for Marine Planning (DCMP), a vocabulary of 90 terms (https://pubs.er.usgs.gov/publication/ofr20151046)
Triple store of DCMP terms (created by Rensselaer Polytechnic Institute): https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/terms

In the human-readable interface of the triple store, each of the 90 DCMP terms has its own URL:

https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/assessments
https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/bathymetry-and-elevation
https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/biodiversity
https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/biological-occurrence
https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/biological-production
.
.
.
etc.

... but the linked data files themselves (RDF/XML, N3, JSON, and other formats) use w3id permanent identifiers (https://w3id.org/) that point to those URLs:

https://w3id.org/national-ocean-council/api/vocab/dcmp/term/assessments
https://w3id.org/national-ocean-council/api/vocab/dcmp/term/bathymetry-and-elevation
https://w3id.org/national-ocean-council/api/vocab/dcmp/term/biodiversity
https://w3id.org/national-ocean-council/api/vocab/dcmp/term/biological-occurrence
https://w3id.org/national-ocean-council/api/vocab/dcmp/term/biological-production
.
.
.
etc.


DEMONSTRATION: Creating a new set of permanent identifiers using the PURL system (https://archive.org/services/purl/). Leslie Hsu kindly offered the CDI PURL "sandbox" account for the demo.

In the demo, Alan created one "partial" PURL: http://purl.org/gs-cdi-sandbox/dcmp/term/ that redirects to the target URLs beginning with: https://nocv.tw.rpi.edu/elda/api/vocab/dcmp/term/ (see screenshot, below)

... and the PURL resolver redirects to each of the 90 individual terms:

http://purl.org/gs-cdi-sandbox/dcmp/term/assessments
http://purl.org/gs-cdi-sandbox/dcmp/term/bathymetry-and-elevation
http://purl.org/gs-cdi-sandbox/dcmp/term/biodiversity
http://purl.org/gs-cdi-sandbox/dcmp/term/biological-occurrence
http://purl.org/gs-cdi-sandbox/dcmp/term/biological-production
.
.
.
etc.

Why bother? In this example, we might want to export the DCMP linked data from the RPI triple store and import it into a USGS triple store. If we do that we'd need to replace the RPI-controlled permanent identifiers with USGS-controlled permanent identifiers.

Another example of why it's advisable to use permanent identifiers in linked data: Alan and Fran are co-authors of an OFR that provides a shapefile of the spatial extents (as simple polygons) of 300 undersea features (https://pubs.er.usgs.gov/publication/ofr20141040). Version 1.1 of the OFR, to be published soon, will also include a linked-data representation of the polygons (in WKT). For now, the PURLs in the linked data will point to the OFR, but if USGS establishes a permanent triple store for linked data, we may want to change the targets of the PURLs without having to revise the linked data files themselves.

We may continue this discussion at a later SWWG meeting.


January 11, 2018

Planned agenda:

  1. Experiment with Google Hangouts
  2. Continuing last month's discussions.

Attending: Alan Allwardt, Peter Schweitzer, Ken Bagstad, Fran Lightsom, Andy LaMotte, Dave Govoni, Matt Arsenault, Mike Ierardi, Dave Coyle

CDI proposal discussion: What would be the benefit of doing the CDI project that would make it worthwhile to jump through the hoops? What would we do with the CDI money and where would we get the matching funds? Some of us need funding for our time to participate in a project. The rest of us could provide our time as matching.

Getting organized for respondent panel at Feb. CDI meeting. Ken will get preview of presentation. Fran will recruit some more panel members and invite the panel to a preliminary discussion on about Feb. 1.

Google Hangout worked pretty well, after some fumbling around at the beginning of the call.