Confluence Retirement

Due to the feedback from stakeholders and our commitment to not adversely impact USGS science activities that Confluence supports, we are extending the migration deadline to January 2023.

In an effort to consolidate USGS hosted Wikis, myUSGS’ Confluence service is targeted for retirement. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact myusgs@usgs.gov. Thank you for your prompt attention to this matter.
Skip to end of metadata
Go to start of metadata

This Forum post is the parent page for additional questions related to the July 13th CDI Monthly Meeting presentation, Implementing Controlled Vocabulary Services in USGS Fran Lightsom, Peter Schweitzer, and Alan Allwardt, (USGS).

CDI Members can view the full set of slides at the July Monthly Meeting page

Here's a selected slide that points to more information on the Controlled Vocabulary Services.

Q&A that occurred during the meeting are copied here from the July Monthly Meeting page.

Cassandra: How can this vocabulary technology be applied to entity and attribute fields in metadata?

Peter: The Entity and Attributes section that you are talking about is the cell values in a table of data. You want to use consistent terms in your data, not just in your metadata. You could build a data editor in which you have an interface to the services and then set up a vocabulary that contains values for those fields. Another way to handle this is to put your vocabulary onto a server and run a checking service to see which values in a database do not match what came from the vocabulary server. So, vocabularies should be used in metadata for keywords, but there is no reason they also shouldn’t be used in the data. There is benefit to having standardization in the data as well.

Question: Where are we in USGS in terms of CSDGM vs ISO?

Peter: Most people in USGS are using CSDGM. Mostly because ISO is drastically more complicated. My own feeling is that to make more effective use of ISO we need interfaces for ISO. There are a lot of similarities between the standards. It is really just a question of interfaces. The part of metadata that we are talking about here are the keywords and those fields are used in a really similar way in both standards so implementing vocabulary services should be very similar.

Leslie: There is an upcoming CDI presentation on a project that is working on an ISO metadata tool. That should be in the next few months, so stay tuned.

Viv Hutchison: If you read the policy for the data management requirements, in the metadata chapter, it doesn’t specify one or the other. You can use either standard. USGS can poise itself to move in the ISO direction. Otherwise the CSDGM standard can be used.

Alan: Also, note that we need to put metadata into the Science Data Catalog, which currently cannot handle ISO.

Viv: We will be working on accepting ISO metadata in the SDC in the future.

Alan: In the Coastal and Marine Geology Program, we have time series data that uses ISO metadata.

Ra’ad: Can she [Fran] elaborate on the need for a business model for the vocabulary?  And how will this model be developed?

Fran: Right now the vocabulary services are living on a mineral resources data server that Peter takes care of. We need to make sure that they don’t depend on one person or one program. No clue how to go forward with that. Hopefully, someone will join our working group and help us figure that out.

Peter: The services are located in two locations: On mineral site and science topics site on USGS homepage. This is really a question that all of CDI should think about: When you work across the org chart, there are questions about where you put your stuff because it doesn’t fit within one of the mission areas. We talk about collaboration across disciplines and mission areas, but when you do this you need to figure out where to put it. We’ve talked with WRET about how to handle this, but we didn't come to a conclusion about that yet.


  • No labels

11 Comments

  1. Question from metadata novice:

    Do you have any suggestions of best practices for which vocabularies to use?

    If I were writing metadata for the first time, I would be confused which vocabulary to use since there are multiple vocabularies available. Do I start with the USGS Thesaurus and if I find something there, stop at that, or should I be using multiple vocabularies?

    1. The USGS Thesaurus is a good place to start for data and information produced by USGS.  It was developed to make it easier for people outside the USGS find scientific information without having to know our organizational structure.  It was implemented on the USGS home page through the Science Topics catalog, a collection of web resources categorized by topic, that was developed and maintained for eleven years.  See https://www2.usgs.gov/science/about/ for more information about it.  It was specifically designed to be browsable on the web–it is a single hierarchy, and from any concept the number of narrower concepts is generally small, to facilitate visual scanning.

      The USGS Thesaurus generally contains types of things, not names of things.  Consequently it does not contain geographic areas, formal biological, lithologic, stratigraphic, or chemical names.  Other vocabularies should be used for those.

      We did not include types of geographic features in the thesaurus because another vocabulary existed that contained those types, and it was built in the same way as the USGS Thesaurus: the Alexandria Digital Library Feature Type Thesaurus.

      In addition to the Science Topics catalog, similar interfaces using the USGS Thesaurus are at Mineral Resources Online Data Catalog and the USGS Geoscience Data Catalog.

      The editorial group managing the USGS Thesaurus is interested in learning about concepts present in your information that are not represented in the thesaurus.  Let me know what you don't see there, and we will be happy to discuss changes.

      1. Unknown User (aallwardt@usgs.gov)

        The core editorial group for the USGS Thesaurus currently consists of Peter Schweitzer (the ringleader), Alan Allwardt, Dave Govoni, Lisa Zolly, and Fran Lightsom. Others have come and gone, often contributing to special projects (such as coordinating WRET tags with the USGS Thesaurus).

        1. And the history of people who have participated in this work is included in our FAQ list.

  2. If a term is marked as "non-preferred," how strongly does that mean that it should be changed to the preferred term? How did it come to be non-preferred? Who makes these decisions? Is the idea that a tool might use the services to suggest the preferred name?

    1. Non-preferred terms exist to simplify the presentation of the concepts in a thesaurus, and to make it easier to search for topics by entering text.  They aren't deprecated words, so no official opprobrium is intended by this idea.  But in scientific work, it's common for some concept to be referred to using a variety of related words, so we choose one as the label we show, and the others are these "non-preferred" terms, sometimes called "lead-in" terms or "use-for" terms.  These might be variations in spelling or wording, and they might even be concepts that are different in detail but not different enough for the purpose of helping people find information resources.

      The editorial group managing the USGS Thesaurus decides what texts to associate with concepts, and these do sometimes change.  We're open to discussions about how concepts are arranged and how texts are attached to concepts in the thesaurus.

  3. Not sure if this addresses the sustainability question, but could there be any connection between what was presented here and the "Community Ontology Repository" being developed at ESIP (http://cor.esipfed.org/). Recently, the EarthCube Semantics working group decided that a useful next step would probably be to "enter" all of their semantics resources into that repository (recent EarthCube tech committee meeting notes) (ESIP-related slides right after the intro slides here). 

    Part of my ignorance is not really knowing the differences between Vocabularies and Ontologies and how to tell the difference. (help?) Sorry if it's completely irrelevant. The contact there at ESIP was Tom Narock though. 

    1. Unknown User (aallwardt@usgs.gov)

      The Marine Metadata Interoperability (MMI) Guides discuss controlled vocabularies (CV) here: https://marinemetadata.org/guides/vocabs. Of particular interest is the child page that presents a classification of controlled vocabularies (https://marinemetadata.org/guides/vocabs/voctypes), ranging from simple, flat controlled vocabularies (like authority files and glossaries) to complex, relational controlled vocabularies (like thesauri and ontologies). Two things to bear in mind: First, the CSDGM metadata standard refers to ALL controlled vocabularies as "keyword thesauri" which is, strictly speaking, using the term "thesaurus" incorrectly. Second, most of us are pretty sloppy in how we use CV terminology when speaking.

      In the semantic web world, a controlled vocabulary expressed in SKOS is probably a thesaurus, whereas a controlled vocabulary expressed in OWL is probably an ontology.

      Also note that the ESIP page you cite is hosted by MMI, which has its own Ontology Registry and Repository here: http://mmisw.org/

      Another example is the NERC vocabulary server hosted by the British Oceanographic Data Centre (http://www.bodc.ac.uk/products/web_services/vocab/) which has ingested controlled vocabularies from a variety of sources (BODC, SeaDataNet, GCMD, ICAN) and provided a set of web services for accessing them. It is this second aspect (providing web services) that distinguishes the NERC server from the ESIP and MMI registries. The model we presented in today's talk for our Objective 1 (see also the Manifesto, figure 3, p. 12) is closer to NERC than to ESIP or MMI because we want to 1) provide web services for the vocabularies that we host and 2) also provide a way to access web services for vocabularies hosted by other agencies (we mentioned GCMD, for instance).

    2. We are not the only people doing work like this.  And other groups might have good ways to store or provide this type of information.  What's important, in my view, is that we make the vocabularies available, we tend them and grow them and we make effective use of them in our metadata and in our web interfaces.  It's probably valuable for us to provide our information in multiple formats (JSON, XML, RDF, HTML).  I might be open to working with outside groups if their systems accommodate our vocabularies (need flexibility, not strict SKOS) and provide the methods of access that we need, but my first priority (my only priority, really) is to make our scientific information easier to find, get, and use.

      It's just my somewhat irreverent opinion, and I'd love to be proven wrong, but I've found that the more different organizations get involved in a coordinating effort like this, the harder it is to understand and use anything they might create.  Instead, the value of such meetings is often in learning what some specific people are doing (Bob Arko, for example, is quite practical in his approaches).

      1. Unknown User (aallwardt@usgs.gov)

        To elaborate: Bob Arko (Lamont) is instrumental in the Rolling Deck to Repository (R2R) effort (http://www.rvdata.us), which has developed controlled vocabularies for describing "underway" data (i.e., the ship leaves its instruments turned on as it travels from site to site). Their vocabularies are here: http://www.rvdata.us/voc

  4. I do think we may need to have some open discussion or training in which we help people understand how they should expect to use controlled vocabularies.  Generally the ideal arrangement is that the category terms are reviewed and revised by people who have a strong understanding of the vocabularies and how they are used by the systems that process the information "downstream", and who have a broad and deep enough scientific knowledge that they won't make too many category mistakes.  Ideally the keywords provided by metadata authors are subject to revision at a later stage by people who can see the entire collection of metadata–authors of scientific publications are typically too narrowly focused in their interests to have the proper viewpoint from which to do that work.  Not that their input is to be ignored or discarded, but it should be reviewed with the collection in mind.

    Also people need to learn the difference between identifiers and category terms–that's often a stumbling block.