Natalya Rapstine presented "USGS Tallgrass Supercomputer 101 for AI/ML," an overview of the new USGS Tallgrass supercomputer designed to support machine learning and deep learning workflows at scale, and deep learning software and tools for data science workflows. Natalya's slides covered the software stack that supports Deep Learning, including PyTorch, Keras, and TensorFlow. She then illustrated the capabilities with the "Hello World!" example of Deep Learning - the MNIST Database of Handwritten Digits.
See many more links to resources in the Slides and recording available at AI/ML Meeting Notes
ASCII art for the Tallgrass supercomputer
Guests from the Data Curation Network, Lisa Johnston, Wendy Kozlowski, Hannah Hadley, and Liza Coburn presented on their recent work.
CURATED stands for: Check files/code; Understand the data; Request missing info or changes; Augment metadata; Transform file formats, Evaluate for FAIRness; Document curation activities.
Checklists and primers related to these topics for specific file formats are available at: https://datacurationnetwork.org/resources/
Also of interest is an Excel Archival Tool, which programmatically converts Microsoft Excel files into open-source formats suitable for long-term archival, including .csv, .png, .txt, and .html: https://github.com/mcgrory/ExcelArchivalTool
Data Curation Network infographic at https://datacurationnetwork.org/resources/
Josh Trahern, Project Manager of the NGTOC Elevation Systems Team led a discussion titled "Elevation Data Processing At Scale - Deploying Open Source GeoTools Using Docker, Kubernetes and Jenkins CI/CD Pipelines"
The presentation highlighted the Lev8 (pronounced as "elevate" and doing petabyte scale processing of DEMs) & QCR Web Applications, produced by the Elevation team. These tools are used by Production Ops to generate the National Elevation Dataset (NED). The NED dataset is a compilation of data from a variety of existing high-precision datasets such as LiDAR data, contour maps, USGS DEM collection, SRTM, and other sources which are combined into a seamless dataset, designed to cover all the United States territory in its continuity.
Moving away from proprietary software and owning the code base – to prevent trying to fit a square peg into a round hole. Working toward 100% automation, 100% documentation and moving to Linux environment. Making all of these changes while the system was operational.
See the recording on the DevOps Meetings page.
Fire Update: Paul Steblein gave a Fire Update and Matthew Rigge, (EROS) – presented on "NLCD Rangeland Fractional Component Time-Series: Development and Applications."
The Metadata Reviewers group had a discussion about what matters to them when reviewing metadata. Some themes were making USGS data as findable and reusable as possible, avoiding unnecessary complexity, and making metadata easier to write.
See more notes on the discussion at their Meetings wiki page.
On June 18, the topic was "Tackling the Paperwork Reduction Act (PRA) in the Age of Social Media and Web-based Interactive Technology." Three Information Collection Clearance Officers from DOI (Jeff Parrillo), USGS (James Sayer), and FWS (Madonna Baucum) explained the basics of the Paperwork Reduction Act (PRA), discussed how the PRA applies to crowdsourcing, citizen science, and prize competition activities, and participated in a Q&A discussion with the audience. More information on the Open Innovation wiki.
On June 19, Ryan Toohey and Nicole Herman-Mercer presented on "Indigenous Observation Network (ION): Community-Based Water Quality Monitoring Project." ION, a community-based project, was initiated by the Yukon River Inter-Tribal Watershed Council (YRITWC) and USGS. Capitalizing on existing USGS monitoring and research infrastructure and supplementing USGS collected data, ION investigates changes in surface water geochemistry and active layer dynamics throughout the Yukon River Basin. More information on the Open Innovation wiki.
This was "round 1" of final project presentations from the FY19 Risk RFP awardees. Please see the list below for presenters - each one is about 10-12 minutes in length. PIs from each project provided a project overview, a description of their team, accomplishments, deliverables, and lessons learned.
See more at the Risk community of practice wiki page.
In June the joint Tech Stack and ESIP IT&I meeting hosted three presentations
Sheila Rabun, ORCID US Community Specialist on the ORCID API. Becoming an ORCID member to gain access to ORCID API keys to integrate ORCID authentication into the wiki.
Carl Schroedl presented on "Using Serverless and GitLab CI/CD to Continuously Deliver AWS Step Functions." See: https://aws.amazon.com/lambda
Notes and more links:
Continuing our exploration of 2019's CDI funded projects, June's monthly meeting included updates on projects involving extending Sciencebase's current capabilities to aid disaster risk reduction, coupling hydrologic models with data services, and standardizing and making available 40 years of biosurveillance data.
For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki.
The Kilauea volcano eruption in 2018 revealed a need for near real-time data updates for emergency response efforts. During the eruption, Bard and his team created lava flow update maps to inform decision-making, using email to share data updates. This method proved to be flawed, causing issues with versioning of data files and limitations on sharing with all team members at the same time.
ScienceBase has emerged as an alternative way to share data for use by emergency response workers. When GIS data is uploaded to ScienceBase, web services are automatically created. Web services are a type of software that facilitates computer to computer interaction over a network. Users don't need to download data to access it; instead it can be easily accessed problematically. Additionally, data updates can be automatically propagated through web services, to avoid versioning issues. However, use of ScienceBase during the Kilauea volcano crisis met unforeseen issues around reliability related to hosting on the USGS server and an overload of simultaneous connections.
This project explores a cloud-based instance of Geoserver on the AWS S3 platform wherein the user can publish geospatial services to this cloud-based server. This method is more resilient to simultaneous connections and takes into account load-balancing and auto-scaling. It also opens the possibility of dedicated Geoserver instances based on a team's needs. ScienceBase is currently working on a function to publish data directly to S3.
A related Python tool for downloading data from the internet and posting on ScienceBase using ASH3D as an example is available on GitLab for USGS users.
Next steps for this project include finalizing cloud hosting service deployment and configuration settings, checking load balancing and quantifying performance, exploring set-up of multiple Geoserver instances in the cloud, evaluating load balancing technologies (e.g., Cloudfront), and ensuring all workflows are possible using a SB Python library.
Integrated modeling is an important component of USGS priority plans. The goal of this project is to use an existing and mature modeling framework to test a Modeling and Prediction Collaborative Environment "sandbox" that can be used to couple hydrology and other environmental simulation models with data and analyses.
Modeling frameworks are founded on the idea of component models. Model components encapsulate a set of related functions into a usable form. For example, going through a Basic Model Interface (BMI) means that no matter what the underlying language is, the model component can be made available as a Python component.
To test the CSDMS modeling framework, the team took the PRMS (Precipitation-Runoff Modeling System) modeling system and broke it down into its 4 reservoirs (surface, soil, groundwater, and streamflow) and wrapped them in a BMI. They then re-coupled them back together. The expectation is that the user could then couple PRMS with other models.
See the meeting recording for demonstration of the tool. You may note the model run-time interaction during the demo. You'll also see that PRMS is in Fortran, but is being run in Python. Code for this project is available on GitHub.
Did you know that over 70% of emerging infectious diseases originate in wildlife? The National Wildlife Health Center (NWHC) has been dedicated to wildlife health since 1975. Biosurveillance the NWHC has been involved in includes: lead poisoning, West Nile Virus, Avian influenza, white-nose syndrome, and SARS-CoV2.
NWHC has become a major data repository for wildlife health data. To manage this data, WHISPers (Wildlife Health Information Sharing Partnership event reporting system) and LIMS (laboratory information management system) are utilized. WHISPers is a portal for biosurveillance data in which events are lab verified and the portal allows collaboration with various state and federal partners, as well as some international partners, such as Canada.
There is a need to leverage NWHC data to inform public, scientists, and decision makers, but substantial barriers stand in the way of this goal:
As a result, this project has formulated a five step process for making NWHC data FAIR:
To put this five step process into effect, NWHC hired two dedicated student service contractors to work on the project. Interviews with lab technicians, scientists, and principal investigators were conducted to gather input and identify high-priority datasets. Dedicated staff also documented datasets, organized said documentation, and began cleansing high-priority datasets by fixing errors and standardizing data. At the time of this presentation, 130 datasets are ready for archiving and cleansing.
There have been some challenges faced during this process so far. Training of the staff responsible for making NWHC data FAIR and easier to work with has been a substantial time investment. The work is labor and time-intensive, and some datasets do not have any documentation readily available. The current databases in use were built with limited knowledge of database design. Finally, there are variations in laboratory methodology, field methodology, and between individuals or different teams.
The project team are able to share several takeaways. Moving forward, data collectors need to think through data collection methods and documentations more thoroughly. Some questions a data collector may ask about their process are: Is it FAIR? Are my methods standardized? How is the data collected now and how will it be collected in the future? Documenting the process and management of data collection and compilation is also important.