Semantic Technologies for Integrating USGS Data
Semantic Web technologies represent a promising approach for integrating data from multiple USGS data systems to address interdisciplinary scientific questions. The proposed project will test and demonstrate this approach, by integrating data from five USGS data systems into an information foundation appropriate for research on aquatic habitats. A task such as this will encounter numerous challenges, particularly the integration of data with variable formats, characteristics, and meanings of data terms. Theoretical and applied solutions to resolve these problems have been proposed by an emerging community building semantic technology (Berners-Lee and others 2001).
Semantic technology is based on a data model for specifically and unambiguously describing data subjects and their relation to other entities. Specific nodes of information are programmed to link to each other according to formal semantic rules provided by an ontology (see Noy and McGuinness 2001). These automatically created networks of knowledge can access any part of their structure so that information users can query and customize the data. These functions serve to more precisely integrate data and convert information from one form to another, and thus allow a more complex context of meaning to develop around data. When connected over the Internet, these networks are often called the Semantic Web or linked data.
This proposal aims to develop and test the semantic approach to data integration by focusing on the problem of fish habitat modeling. Effective prediction of the abundance of particular species at particular locations is a primary objective of both ecology and natural resource management. Better knowledge of aquatic fish ecology and habitat requirements and improved tools for assessment and planning are needed to help conserve and rehabilitate populations throughout their native range. USGS scientists working on the National Fish Habitat Action Plan (http://www.fishhabitat.org) and aquatic aspects of the GAP Analysis Program (http://gapanalysis.usgs.gov) have these goals: (1) develop empirical species--habitat models that effectively predict the potential of specific stream reaches as habitats for important fish species, (2) describe the predicted distribution of habitats of various qualities, and (3) compare predictions with observed fish abundances. The resulting models, data, and tools will help managers assess the status of their stream habitat resources and prioritize conservation efforts. Evaluation of the model structure and predicted habitat distribution will also provide insight into the suite of conditions that best support important fish species and how those conditions vary within and between watersheds. Currently the research is conducted by discovering and collecting data, converting it to compatible formats, and using GIS systems to combine the data and create a model. We propose to investigate whether semantic techniques could automate and expedite the data discovery and integration, producing an information foundation for project scientists.
The proposed semantic demonstration project will produce an information foundation for fish habitat research that will be a “mashup” of data from multiple USGS data systems that are fragmented among the former USGS Divisions:
The proposed approach to semantic system development follows prototypes being implemented for Data.gov by researchers from Rensselaer Polytechnic Institute and Stanford University (see http://www.data.gov/semantic). The approach is iterative, with the stages diagrammed in Fig. 1.
Stages in the prototype development
We propose to complete one cycle of the iterative process by undertaking the following tasks:
1. Access points for querying integrated use case data sets
If successful, the prototype will bring insights to the USGS science community regarding advantages and disadvantages of using semantic technology for scientific monitoring, modeling, and research.
Consulting Costs: $4,000
--Fees for leadership at a workshop and preparation.
Project Team Costs: $12,000
Hardware + System Admin Time: in-kind from CSAS
Total Costs = $16,000
Berners-Lee, T., Hendler, J., and Lassila, O. 2001. The Semantic Web. Scientific American, May 17, 2001, available online at http://www.scientificamerican.com/article.cfm?id=the-semantic-web
Brady, S.R., Sinha, A.K., and Gundersen, L.C., editors, 2006, Geoinformatics 2006 - Abstracts: U.S. Geological Survey Scientific Report 2006-5201, 60 p. Section 1 (p. 1-5) have a number of abstracts semantics and ontologies for geosciences.
Noy, N., and McGuinness, D.L. 2001. Ontology development 101: A guide to creating your first ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, available online at http://www.ksl.stanford.edu/KSL_Abstracts/KSL-01-05.html
Sinha, A. K., Malik, Z., Rezgui, A., Barnes, C.G., Lin, K., Heiken, G., Thomas, W.A., Gundersen, L.C., Raskin, R., Jackson, I., Fox, P., McGuinness, D., Seber, D., and Zimmerman, H. 2010. Geoinformatics: Transforming data to knowledge for geosciences. GSA Today, v. 20, no. 12, p. 4-10., available online at http://www.geosociety.org/gsatoday/archive/20/12/article/i1052-5173-20-12-4.htm