Confluence Retirement

Due to the feedback from stakeholders and our commitment to not adversely impact USGS science activities that Confluence supports, we are extending the migration deadline to January 2023.

In an effort to consolidate USGS hosted Wikis, myUSGS’ Confluence service is targeted for retirement. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact myusgs@usgs.gov. Thank you for your prompt attention to this matter.

Blog from July, 2020

Artificial Intelligence/Machine Learning, 6/9 - Tallgrass Supercomputer for AI/ML

Natalya Rapstine presented "USGS Tallgrass Supercomputer 101 for AI/ML," an overview of the new USGS Tallgrass supercomputer designed to support machine learning and deep learning workflows at scale, and deep learning software and tools for data science workflows. Natalya's slides covered the software stack that supports Deep Learning, including PyTorch, Keras, and TensorFlow. She then illustrated the capabilities with the "Hello World!" example of Deep Learning - the MNIST Database of Handwritten Digits.

See many more links to resources in the Slides and recording available at AI/ML Meeting Notes

ASCII art for the Tallgrass supercomputer

Data Management, 6/8 - Data Curation Network - extending the research data toolkit

Guests from the Data Curation Network, Lisa Johnston, Wendy Kozlowski, Hannah Hadley, and Liza Coburn presented on their recent work.

CURATED stands for: Check files/code; Understand the data; Request missing info or changes; Augment metadata; Transform file formats, Evaluate for FAIRness; Document curation activities.

Checklists and primers related to these topics for specific file formats are available at: https://datacurationnetwork.org/resources/

Also of interest is an Excel Archival Tool, which programmatically converts Microsoft Excel files into open-source formats suitable for long-term archival, including .csv, .png, .txt, and .html: https://github.com/mcgrory/ExcelArchivalTool


Data Curation Network infographic at https://datacurationnetwork.org/resources/

DevOps, 6/2 - Elevation data processing at scale

Josh Trahern, Project Manager of the NGTOC Elevation Systems Team led a discussion titled "Elevation Data Processing At Scale - Deploying Open Source GeoTools Using Docker, Kubernetes and Jenkins CI/CD Pipelines"

The presentation highlighted the Lev8 (pronounced as "elevate" and doing petabyte scale processing of DEMs) & QCR Web Applications, produced by the Elevation team.  These tools are used by Production Ops to generate the National Elevation Dataset (NED). The NED dataset is a compilation of data from a variety of existing high-precision datasets such as LiDAR data, contour maps, USGS DEM collection, SRTM, and other sources which are combined into a seamless dataset, designed to cover all the United States territory in its continuity.

Moving away from proprietary software and owning the code base – to prevent trying to fit a square peg into a round hole. Working toward 100% automation, 100% documentation and moving to Linux environment. Making all of these changes while the system was operational.

See the recording on the DevOps Meetings page.

Fire Science, 6/16 - NLCD Rangeland Fractional Component Time-Series: Development and Applications

Fire Update: Paul Steblein gave a Fire Update and Matthew Rigge, (EROS) – presented on "NLCD Rangeland Fractional Component Time-Series: Development and Applications."

The Fire Science coordinators and CDI staff are working on syncing content on the internal OneDrive and the CDI wiki, contact Paul at psteblein@usgs.gov if you have any questions about the group.

Metadata Reviewers, 6/1 - What is important to metadata reviewers

The Metadata Reviewers group had a discussion about what matters to them when reviewing metadata. Some themes were making USGS data as findable and reusable as possible, avoiding unnecessary complexity, and making metadata easier to write.

See more notes on the discussion at their Meetings wiki page.

Open Innovation, 6/18 and 6/19 - Paperwork reduction and Community-based water quality monitoring

On June 18, the topic was "Tackling the Paperwork Reduction Act (PRA) in the Age of Social Media and Web-based Interactive Technology." Three Information Collection Clearance Officers from DOI (Jeff Parrillo), USGS (James Sayer), and FWS (Madonna Baucum) explained the basics of the Paperwork Reduction Act (PRA), discussed how the PRA applies to crowdsourcing, citizen science, and prize competition activities, and participated in a Q&A discussion with the audience. More information on the Open Innovation wiki.

On June 19, Ryan Toohey and Nicole Herman-Mercer presented on "Indigenous Observation Network (ION): Community-Based Water Quality Monitoring Project." ION, a community-based project, was initiated by the Yukon River Inter-Tribal Watershed Council (YRITWC) and USGS. Capitalizing on existing USGS monitoring and research infrastructure and supplementing USGS collected data, ION investigates changes in surface water geochemistry and active layer dynamics throughout the Yukon River Basin. More information on the Open Innovation wiki.

Risk, 6/18 - Funded Project Reports

This was "round 1" of final project presentations from the FY19 Risk RFP awardees. Please see the list below for presenters - each one is about 10-12 minutes in length. PIs from each project provided a project overview, a description of their team, accomplishments, deliverables, and lessons learned.

  • Quantifying Rock Fall Hazard and Risk to Roadways in National Parks: Yosemite National Park Pilot Project, Brian Collins, Geology, Minerals, Energy, and Geophysics Science Center
  • The State of Our Coasts: Coastal Change Hazards Stakeholder Engagement & User Need Assessment, Juliette Finzi-Hart, Pacific Coastal and Marine Science Center
  • Re-visiting Bsal risk: how 3 years of pathogen surveillance, research, and regulatory action change our understanding of invasion risk of the exotic amphibian pathogen Batrochochytrium salamandrivorans, Dan Grear, National Wildlife Health Center
  • Communications of risk - uranium in groundwater in northeastern Washington state, Sue Kahle, Washington Water Science Center

See more at the Risk community of practice wiki page.

Tech Stack, 6/11 - ESIP Collaboration Infrastructure 2.0

In June the joint Tech Stack and ESIP IT&I meeting hosted three presentations

Ike HechtWikiWorks on the ESIP Wiki. Mediawiki upgrade from v1.19 to 1.34 of the ESIP wiki

Lucas CioffiQiQoChat lead developer on the technical side of QiQoChat. Utilizing QiqoChat to bring together our asynchronous workspaces with our virtual conferences and meetings

Sheila Rabun, ORCID US Community Specialist on the ORCID API. Becoming an ORCID member to gain access to ORCID API keys to integrate ORCID authentication into the wiki.

See more at the IT&I meetings page

Software Dev, 6/25 - Serverless!

Carl Schroedl presented on "Using Serverless and GitLab CI/CD to Continuously Deliver AWS Step Functions." See: https://aws.amazon.com/lambda

Notes and more links:

-- 
All CDI Blog Posts 

Continuing our exploration of 2019's CDI funded projects, June's monthly meeting included updates on projects involving extending Sciencebase's current capabilities to aid disaster risk reduction, coupling hydrologic models with data services, and standardizing and making available 40 years of biosurveillance data. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

Screenshot of a beta version of ScienceBase where an option to publish all files to ScienceBase appears.

Extending ScienceBase for Disaster Risk Reduction - Joe Bard, USGS 

The Kilauea volcano eruption in 2018 revealed a need for near real-time data updates for emergency response efforts. During the eruption, Bard and his team created lava flow update maps to inform decision-making, using email to share data updates. This method proved to be flawed, causing issues with versioning of data files and limitations on sharing with all team members at the same time. 

ScienceBase has emerged as an alternative way to share data for use by emergency response workers. When GIS data is uploaded to ScienceBase, web services are automatically created. Web services are a type of software that facilitates computer to computer interaction over a network. Users don't need to download data to access it; instead it can be easily accessed problematically. Additionally, data updates can be automatically propagated through web services, to avoid versioning issues. However, use of ScienceBase during the Kilauea volcano crisis met unforeseen issues around reliability related to hosting on the USGS server and an overload of simultaneous connections. 

This project explores a cloud-based instance of Geoserver on the AWS S3 platform wherein the user can publish geospatial services to this cloud-based server. This method is more resilient to simultaneous connections and takes into account load-balancing and auto-scaling. It also opens the possibility of dedicated Geoserver instances based on a team's needs. ScienceBase is currently working on a function to publish data directly to S3. 

A related Python tool for downloading data from the internet and posting on ScienceBase using ASH3D as an example is available on GitLab for USGS users.

Next steps for this project include finalizing cloud hosting service deployment and configuration settings, checking load balancing and quantifying performance, exploring set-up of multiple Geoserver instances in the cloud, evaluating load balancing technologies (e.g., Cloudfront), and ensuring all workflows are possible using a SB Python library. 

Presentation slide explaining the concept of a modeling sandbox.

Coupling Hydrologic Models with Data Services in an Interoperable Modeling Framework - Rich McDonald, USGS  

Integrated modeling is an important component of USGS priority plans. The goal of this project is to use an existing and mature modeling framework to test a Modeling and Prediction Collaborative Environment "sandbox" that can be used to couple hydrology and other environmental simulation models with data and analyses. 

Modeling frameworks are founded on the idea of component models. Model components encapsulate a set of related functions into a usable form. For example, going through a Basic Model Interface (BMI) means that no matter what the underlying language is, the model component can be made available as a Python component. 

To test the CSDMS modeling framework, the team took the PRMS (Precipitation-Runoff Modeling System) modeling system and broke it down into its 4 reservoirs (surface, soil, groundwater, and streamflow) and wrapped them in a BMI. They then re-coupled them back together. The expectation is that the user could then couple PRMS with other models. 

See the meeting recording for demonstration of the tool. You may note the model run-time interaction during the demo. You'll also see that PRMS is in Fortran, but is being run in Python. Code for this project is available on GitHub. 

Presentation slide of the interface of the Wildlife Health Information Sharing Partnership event reporting system, abbreviated Whispers

Transforming Biosurveillance by Standardizing and Serving 40 Years of Wildlife Disease Data - Neil Baertlein, USGS 

Did you know that over 70% of emerging infectious diseases originate in wildlife? The National Wildlife Health Center (NWHC) has been dedicated to wildlife health since 1975. Biosurveillance the NWHC has been involved in includes: lead poisoning, West Nile Virus, Avian influenza, white-nose syndrome, and SARS-CoV2. 

NWHC has become a major data repository for wildlife health data. To manage this data, WHISPers (Wildlife Health Information Sharing Partnership event reporting system) and LIMS (laboratory information management system) are utilized. WHISPers is a portal for biosurveillance data in which events are lab verified and the portal allows collaboration with various state and federal partners, as well as some international partners, such as Canada. 

There is a need to leverage NWHC data to inform public, scientists, and decision makers, but substantial barriers stand in the way of this goal: 

  1. Data is not FAIR (findable, accessible, interoperable, and reusable) 
  2. There are nearly 200 datasets in use 
  3. Data is not easy to find 
  4. Data exists in various file formats 
  5. There is limited to no documentation for data 

As a result, this project has formulated a five step process for making NWHC data FAIR: 

  1. Definition: creating a definition.  
    1. NWHC created a template in which they capture information such as users responsible for data, the file type of the data, and where the data is stored. A data dictionary was also created. 
  2. Classification: provide meaning and context for data.  
    1. In this step, NWHC classifies relationships with other datasets and other databases, and identifies inconsistencies in data. 
  3. Prioritization: identify high-priority datasets.  
    1. High-priority datasets are ones that NWHC needs to continue to use down the road or are currently high-impact. Non-priority datasets can be archived. 
  4. Cleansing: Next step for high-priority datasets.  
    1. Includes fixing data errors and standardizing data. 
  5. Migrating: map and migrate the cleansed data. 

To put this five step process into effect, NWHC hired two dedicated student service contractors to work on the project. Interviews with lab technicians, scientists, and principal investigators were conducted to gather input and identify high-priority datasets. Dedicated staff also documented datasets, organized said documentation, and began cleansing high-priority datasets by fixing errors and standardizing data. At the time of this presentation, 130 datasets are ready for archiving and cleansing. 

There have been some challenges faced during this process so far. Training of the staff responsible for making NWHC data FAIR and easier to work with has been a substantial time investment. The work is labor and time-intensive, and some datasets do not have any documentation readily available. The current databases in use were built with limited knowledge of database design. Finally, there are variations in laboratory methodology, field methodology, and between individuals or different teams. 

The project team are able to share several takeaways. Moving forward, data collectors need to think through data collection methods and documentations more thoroughly. Some questions a data collector may ask about their process are: Is it FAIR? Are my methods standardized? How is the data collected now and how will it be collected in the future? Documenting the process and management of data collection and compilation is also important. 

--   
All CDI Blog Posts