Skip to end of metadata
Go to start of metadata

Continuing our exploration of 2019's CDI funded projects, June's monthly meeting included updates on projects involving extending Sciencebase's current capabilities to aid disaster risk reduction, coupling hydrologic models with data services, and standardizing and making available 40 years of biosurveillance data. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

Screenshot of a beta version of ScienceBase where an option to publish all files to ScienceBase appears.

Extending ScienceBase for Disaster Risk Reduction - Joe Bard, USGS 

The Kilauea volcano eruption in 2018 revealed a need for near real-time data updates for emergency response efforts. During the eruption, Bard and his team created lava flow update maps to inform decision-making, using email to share data updates. This method proved to be flawed, causing issues with versioning of data files and limitations on sharing with all team members at the same time. 

ScienceBase has emerged as an alternative way to share data for use by emergency response workers. When GIS data is uploaded to ScienceBase, web services are automatically created. Web services are a type of software that facilitates computer to computer interaction over a network. Users don't need to download data to access it; instead it can be easily accessed problematically. Additionally, data updates can be automatically propagated through web services, to avoid versioning issues. However, use of ScienceBase during the Kilauea volcano crisis met unforeseen issues around reliability related to hosting on the USGS server and an overload of simultaneous connections. 

This project explores a cloud-based instance of Geoserver on the AWS S3 platform wherein the user can publish geospatial services to this cloud-based server. This method is more resilient to simultaneous connections and takes into account load-balancing and auto-scaling. It also opens the possibility of dedicated Geoserver instances based on a team's needs. ScienceBase is currently working on a function to publish data directly to S3. 

A related Python tool for downloading data from the internet and posting on ScienceBase using ASH3D as an example is available on GitLab for USGS users.

Next steps for this project include finalizing cloud hosting service deployment and configuration settings, checking load balancing and quantifying performance, exploring set-up of multiple Geoserver instances in the cloud, evaluating load balancing technologies (e.g., Cloudfront), and ensuring all workflows are possible using a SB Python library. 

Presentation slide explaining the concept of a modeling sandbox.

Coupling Hydrologic Models with Data Services in an Interoperable Modeling Framework - Rich McDonald, USGS  

Integrated modeling is an important component of USGS priority plans. The goal of this project is to use an existing and mature modeling framework to test a Modeling and Prediction Collaborative Environment "sandbox" that can be used to couple hydrology and other environmental simulation models with data and analyses. 

Modeling frameworks are founded on the idea of component models. Model components encapsulate a set of related functions into a usable form. For example, going through a Basic Model Interface (BMI) means that no matter what the underlying language is, the model component can be made available as a Python component. 

To test the CSDMS modeling framework, the team took the PRMS (Precipitation-Runoff Modeling System) modeling system and broke it down into its 4 reservoirs (surface, soil, groundwater, and streamflow) and wrapped them in a BMI. They then re-coupled them back together. The expectation is that the user could then couple PRMS with other models. 

See the meeting recording for demonstration of the tool. You may note the model run-time interaction during the demo. You'll also see that PRMS is in Fortran, but is being run in Python. Code for this project is available on GitHub. 

Presentation slide of the interface of the Wildlife Health Information Sharing Partnership event reporting system, abbreviated Whispers

Transforming Biosurveillance by Standardizing and Serving 40 Years of Wildlife Disease Data - Neil Baertlein, USGS 

Did you know that over 70% of emerging infectious diseases originate in wildlife? The National Wildlife Health Center (NWHC) has been dedicated to wildlife health since 1975. Biosurveillance the NWHC has been involved in includes: lead poisoning, West Nile Virus, Avian influenza, white-nose syndrome, and SARS-CoV2. 

NWHC has become a major data repository for wildlife health data. To manage this data, WHISPers (Wildlife Health Information Sharing Partnership event reporting system) and LIMS (laboratory information management system) are utilized. WHISPers is a portal for biosurveillance data in which events are lab verified and the portal allows collaboration with various state and federal partners, as well as some international partners, such as Canada. 

There is a need to leverage NWHC data to inform public, scientists, and decision makers, but substantial barriers stand in the way of this goal: 

  1. Data is not FAIR (findable, accessible, interoperable, and reusable) 
  2. There are nearly 200 datasets in use 
  3. Data is not easy to find 
  4. Data exists in various file formats 
  5. There is limited to no documentation for data 

As a result, this project has formulated a five step process for making NWHC data FAIR: 

  1. Definition: creating a definition.  
    1. NWHC created a template in which they capture information such as users responsible for data, the file type of the data, and where the data is stored. A data dictionary was also created. 
  2. Classification: provide meaning and context for data.  
    1. In this step, NWHC classifies relationships with other datasets and other databases, and identifies inconsistencies in data. 
  3. Prioritization: identify high-priority datasets.  
    1. High-priority datasets are ones that NWHC needs to continue to use down the road or are currently high-impact. Non-priority datasets can be archived. 
  4. Cleansing: Next step for high-priority datasets.  
    1. Includes fixing data errors and standardizing data. 
  5. Migrating: map and migrate the cleansed data. 

To put this five step process into effect, NWHC hired two dedicated student service contractors to work on the project. Interviews with lab technicians, scientists, and principal investigators were conducted to gather input and identify high-priority datasets. Dedicated staff also documented datasets, organized said documentation, and began cleansing high-priority datasets by fixing errors and standardizing data. At the time of this presentation, 130 datasets are ready for archiving and cleansing. 

There have been some challenges faced during this process so far. Training of the staff responsible for making NWHC data FAIR and easier to work with has been a substantial time investment. The work is labor and time-intensive, and some datasets do not have any documentation readily available. The current databases in use were built with limited knowledge of database design. Finally, there are variations in laboratory methodology, field methodology, and between individuals or different teams. 

The project team are able to share several takeaways. Moving forward, data collectors need to think through data collection methods and documentations more thoroughly. Some questions a data collector may ask about their process are: Is it FAIR? Are my methods standardized? How is the data collected now and how will it be collected in the future? Documenting the process and management of data collection and compilation is also important. 

--   
All CDI Blog Posts