Confluence Retirement

In an effort to consolidate USGS hosted Wikis, myUSGS’ Confluence service is scheduled for retirement on January 27th, 2023. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact Thank you for your prompt attention to this matter.


For August's CDI Monthly Meeting, we heard a presentation on integrating short-term climate forecast into a restoration management support tool, and had our first session of the CDI Pop-Up Lab. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

News & Announcements 

Look out for Machine Learning Mondays: a weekly course on image analysis using machine learning. Dan Buscombe will be offering this course covering image recognition, object recognition, image segmentation, and semi-supervised image classification. The course is targeted at USGS employees and contractors in the fields of satellite aerial imaging, image analysis, geospatial analysis, machine learning software development. The course is only available to those with a USGS email, at no charge. Experience in Python and the command line interface are recommended as a pre-requisite.  

Integrating short-term climate forecast into a restoration management support tool - Caitlin Andrews, USGS 


Alternate text: slide summarizing the goals of a short-term soil moisture forecaster and resulting example heat maps of the U.S. 

The goal of this FY19 project is to create a link between data and how it can be used in a management context. Climate forecasts are typically spatially or temporally coarse data, while managers need more temporally fine and site-specific data. For example, the success of seeding and planting rely on short-term monthly and seasonal climate that occurs immediately after seeding and planting. There is a 90% failure rate for seeding and planting in the western U.S. 

The project facilitates the link between climate data/climate knowledge and management need by creating a short term moisture forecaster application. In the western U.S., water is a limiting factor and drought is a natural part of ecosystem, and expected to be exacerbated further in the coming years. For managers, seeding/planting and drought are connected, and managers are in need of more information on climate forecast for after they seed or plant. Climate knowledge for this use case generates probabilities on whether conditions will be hotter or colder and dryer or wetter. This is coarse information that needs translation so managers can use it. 

The SOILWAT2 model is essentially a translation tool, wherein the user provides the model info on a specific site (climate, vegetation, soil), and the model output will provided probabilities on where water moves on a daily basis and measurements of soil moisture at different depths. The National Weather Service provides one prediction for each of 102 regions for a time period, but this multi-month forecast data is very coarse. 

 The application team is currently developing code to synthesize short term climate predictions to a finer temporal and spatial scale in order to derive a better soil moisture model. 

Spatially and temporally refining this data was a challenge. A Jupyter Notebook that details the steps the project team took is available to USGS employees: 

A quick summary of the process: 

  1. Gather a historical record of site-specific data from GridMET (1980-yesterday) 
  2. Generate samples of what the future will look like (30 future realizations) 
  3. Apply future realization to the years in the historical record. This is how future anomalies are integrated with historical patterns. 
  4. Produces 900 climate futures 

This process produces an example output that is explained in detail in the meeting recording (log in to Confluence to access). The application will be integrated into the Land Treatment Exploration Tool (LTET), a Bureau of Land Management and USGS collaboration intended for managers planning restoration projects. 

CDI Pop-up Lab: Q&A with the CDI community 

Alternate text: slide showing the information on cloud-optimized GeoTIFFs summarized below, as well as a map and code snippet. 

Cloud optimized files and new transfer options  - Theo Barnhart and Drew Ignizio 

The CDI project Theo Barnhart is working on this year involves generating a set of continuous basin characteristics for all of contiguous U.S., resulting in many very large GeoTIFFs. The need arose for a solution with the following characteristics: geospatial format, easy to generate, good compression, stand-alone, and avoiding maintaining a server to access the data. and Rasterio were identified through a trial & error process of working through examples using Jupyter Notebooks. 

Drew Ignizio is working on an approach for handling large files from the ScienceBase side. What is a Cloud-Optimized GeoTIFF (COG) and why is it useful? In a previous approach, a user can download a 240 gig file from where it is stored in an S3 bucket. After downloading, the user can then work with data locally. With COG, users can avoid downloading data, instead just accessing the file in place. COG enables users to publish to a public S3 bucket and connect to the COG through a Jupyter Notebook. They can also be read directly from a viewer. 

Irregular meshes for data operations - quadtrees  - Thomas Rapstine 

While mapping out ground failure for a project in Alaska, an issue was identified with the diversity and variety of data inputs. The inputs to models can differ in many ways. They can be: 

  • Grids, points, polygons, lines and more 
  • Categorical, physical, or temporal 
  • With their own notion of uncertainty, or not 
  • Pulled from a global or local raster 

How can we structure diverse datasets in a way that enables robust, calculable integration and evaluation? Rapstine proposed using multi-scale, hierarchical data structure to represent data on varying scales; representation that allows for multiple resolution grids to be put together (a quadtree). A quadtree divides regions into squares. Quadtree mesh areas (using Python package discretize) result in finer representation in the mesh areas. 

Questions for the CDI community: 

  1. How are others solving these data integration issues? 
  2. Any other solution recommendations other than quadtrees? 
  3. Thoughts on using quadtrees for solving these challenges? 
  4. Are you using quadtrees? What packages would you recommend? 

See the slides for more info on the wiki and reach out to if you have an answer to these questions or would like further discussion. 

Streamstats - Kitty Kolb 

StreamStats is a USGS website that allows users to  delineate watersheds for an area of interest, built to be used by civil engineers to design highway bridges and culverts. Kolb wanted to know the answers to these questions: What's the biggest flood I can expect in a given year? How do we get information on un-gaged areas? To answer these questions, there is a need for a GIS system to calculate things quickly and efficiently. 

StreamStats is built on ArcHydro, SSHydro, and Leaflet. StreamStats provides an image of your watershed and a report, with an option to download the watershed outline and table. StreamStats Training docs and webinars, as well as classes on ArcHydro are useful in learning how to harness this tool. 

Speaking Git 

"At today's Metadata Reviewers meeting, I had the feeling that many of us were discovering that we need to know what these Git terms mean: main branch, fork, issue, tag." 

Some places to start: 

18F: How do I speak Git(hub) 

Git(hub) Glossary 

GS-Software Microsoft Team 

USGS Software Management Website 

All CDI Blog Posts  

CDI collaboration areas bring us focused information and tools to help us work with our data. See all collaboration areas and how to join. 

Slides from Dan Beckman's presentation to the Software Development Cluster, where he discussed the creation of synthetic data for training artificial intelligence algorithms.

Data Management, 8/10 - Department of Interior Records Management Repository and Data Exit Story Time on "the data they left behind"

Lynda Speck and Jim Nagode from the U.S. Bureau of Reclamation presented on their records and document management cloud solution, eERDMS. Tara Bell, Robin Tillitt, and Sue Kemp shared experiences on "Departing Scientists and the Data They Left Behind." Recording and other resources at the wiki meeting page.

DevOps, 8/4 - EPA Data Management and Analytics Platform DevOps

Dave Smith from the Environmental Protection Agency presented on "EPA Data Management and Analytics Platform DevOps." Included in the discussion was - How to get to DevSecOps? (How to add Security to Development and Operations.) "Security as usual breaks DevOps automation." Recording and slides available on the DevOps meeting page.

Fire Science, 8/18 - Department of Interior Wildland Fire Information & Technology Strategy

Roshelle Pederson from the Dept of Interior Office of Wildland Fire presented on the Wildland Fire Information & Technology Strategy. The discussion included the role of USGS research and successful paths to integrate research information, data, and tools in fire management information systems. Join the Fire Science mailing list here.

Metadata Reviewers CoP, 8/3 - Metadata for public release of legacy data

Tara Bell, Matt Arsenault, and Sofia Dabrowski led a discussion on metadata for public release of legacy data for which full documentation is not available.

Risk, 8/11-8/13 - Annual Risk Meeting

The Risk Community of Practice held their Annual Risk Meeting virtually, from August 11-13. The meeting agenda included a keynote on "An evaluation of the risk of SARS-CoV2 transmission from humans to bats" by Mike Runge, a session with the EarthMAP project management team, presentations from FY19 Risk Proposal Awardees, a risk analysis panel discussion, virtual networking, and sessions on engaging diverse stakeholders and tools for virtual stakeholder meetings. To join the Risk Research and Applications Community of Practice, visit

Usability, 8/19 - Human-Centered Approach and Usability

Jamie Albrecht from Impact360 Alliance presented on Inclusive Problem-Solving to Reduce Natural Hazard Impacts & Disaster Risk. Inclusive problem-solving is Impact360’s process to bring together natural hazard researchers and practitioners to solve wicked problems. Several contributing foundational frameworks on the topics of mutual gains, joint fact finding, systems thinking, design thinking, social innovation, and equity-centered community design were introduced for consideration. Notes, slides, and recording are accessible on the meeting page.

Software Dev, 8/27 - Synthetic data and build process for AI imagery and deep learning methods

Dan Beckman presented on "Synthetic data and build process for AI imagery and deep learning methods." He described a solution for the challenge of not having enough training data, using synthetic stand-in data to make the volume of data needed. Dan referenced some code he used from Adam Kelly and here is a related medium post. Read the post to follow up on the statement "I’ve found, from both researching and experimenting, that one of the biggest challenges facing AI researchers today is the lack of correctly annotated data to train their algorithms." Software Development Cluster wiki page.

All CDI Blog Posts 

CDI collaboration areas bring us focused information and tools to help us work with our data. See all collaboration areas and how to join. 

Screenshots from the USGS COVID-19 Case Finder and Viz Palette - two resources discussed at the July Data Viz call.

Artificial Intelligence/Machine Learning,  7/14 - Gage Cam - computer vision for water surface elevation

Daniel Beckman presented on Gage Cam, a low cost, custom built wireless web camera paired with a custom deep learning algorithm that allows for a computer vision method to measure water surface elevation (stage). Daniel's slides also cover a list of additional topics include U-Nets, synthetic data, algorithms for text, suggested books on deep learning, and more!

Slides and recording at the AI/ML Meeting Notes page.

Data Management, 7/13 - Collections management Informational Memo and Center-level collection management plans

Lindsay Powers presented on a new Collections Management Instructional Memo (IM CSS 2019-01) and associated website, released last August, providing policy and guidance for the management of scientific working collections.

Brian Buczkowski, from the Woods Hole Coastal and Marine Science Center, presented on Center-level collection management plans, which can help ensure that these samples and specimens continue to have value as assets to the public and scientific community.

Slides and recording can be found on the meeting notes page.

Data Visualization, 7/2 - Kickoff meeting COVID-19 Case Finder

Chuck Hansen from the California Water Science Center presented on the COVID-19 Case Finder, built on Tableau. The app that allows a USGS employee planning a trip to get COVID information on their destination, with preloaded USGS facilities and gage sites. A conversation on color maps ensued, sharing tools like this one - - which enables you to import your own color schemes and see what they look like based on different types of color deficiencies.

The Data Visualization group plans to hold quarterly calls. See more at their wiki page.

Fire Science, 7/21 - Climate-fire science synthesis

As fire continued to increase in July, Paul Steblein and Rachel Loehman led the Fire Science Community of Practice call. After a Fire update from Paul, Madeleine Ruben stein from the Climate Adaptation Science Centers presented on a workplan to conduct a synthesis of Climate-Fire Science.

Join the Fire Science mailing list here.

Metadata Reviewers, 7/6 - Metadata for software and code

Eric Martinez joined the Metadata Reviewers group to chat about different types of code releases, different options for code repositories at USGS, code.json documentation, and more. He shared some links including the USGS Software Management website and the code.json schema, where controlled vocabularies can be found (search for 'enum' for enumerated lists).

See more notes on the Metadata Reviewers meeting notes page.

Model Catalog Working Group - Scientific model categorization and finding information about USGS models

A working group that is advising on the development of a new USGS Model Catalog was briefed (by email) on the sources used for populating the initial model catalog and asked about categorization of models by type and action. Project updates can be seen on this wiki page. Anyone interested in contributing to the direction of the model catalog can find out more on the working group home page, subscribe to the mailing list, and get in touch with the point of contact, which would be me, Leslie Hsu,

Risk CoP, 7/16 - Project presentations from the FY19 Risk RFP awardees (Round 2)

Four speakers gave final Risk project presentations on the topics of the global copper supply disruption from earthquakes (Kishore Jaiswal), how scientific research affects policy and earthquake preparedness (Sara McBride), the Hazard Exposure Analyst Tool (HEAT) (Jason Sherba), and ecological forecasts for risk management (Jake Weltzin and Alyssa Rosemartin).

See more at the Risk CoP meeting notes page (sign in as a CDI member to view).

Semantic Web, 7/9 - the Semantic Zoo

A group from the Semantic Web WG discussed the article "The Semantic Zoo - Smart Data Hubs, Knowledge Graphs, and Data Catalogs." This led to a discussion on the basic question of "How do we get data cleaned up so that many different places can use it?"

Usability Resource Review, 7/15 - Mobile UX Design Principles and Best Practices

Sophie Hou posted a resource review on Mobile UX Design Principles and Best Practices. The resource addresses topics like creating a seamless experience across devices, allowing for personalization, good onboarding practices, using established gestures, mobile layout design, focusing on speed, minimizing data input, and more.

See the full review and summary on the resource review wiki page.

All CDI Blog Posts 

For July's CDI Monthly Meeting, we heard two presentations: one on science data management within USGS, and another on the NGA's new mobile and web applications for field data collection! 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

It's July 2020: Do you know what your data are doing? SAS Science Data Management: Contributing to USGS progress in management of scientific data - Viv Hutchison, USGS

Overview of CSS-SAS-SDM 

Most of us have probably heard of data management, but why should we take care to manage our scientific data well? Good data management increases reproducibility and integrity for Earth science data. As such, it's important that data is FAIR (findable, accessible, interoperable, and reusable) and well-maintained. 

The Science Data Management (SDM) branch within Science Analytics and Synthesis (SAS) leverages tools and expertise for good data management, and encourages community engagement around this topic. SDM has made strides towards better data management and measuring impact of data. 

ScienceBase (SB), an online digital repository for science data, became a trusted digital repository (TDR) in 2017, meeting rigorous standards to attain this status. Many journals require that data accompanying an article is made publicly available, and ScienceBase is an easy way to accomplish this requirement. The ScienceBase Data Release (SBDR) Tool, which allows scientists to easily complete a data release, connects seamlessly to other USGS tools such as the DOI Tool, IPDS, and the Science Data Catalog (SDC)The SBDR Tool can be customized to reflect a science center's specific workflow as well. There are currently 92 USGS science centers use SB for data release. The TDR has seen a steady increase in usage over time, and is now accompdating approximately 1,000 data releases per year. The upcoming SBDR Summary Dashboard will share data release metrics by science center, program, region, mission area. For help with ScienceBase data release, find the instructions page here, or contact with other questions. 

SDM has strengthened the connection between USGS publications and supporting data. The team worked with the USGS Pubs Warehouse to collect information on known primary publications related to a data release, then added those related publications to the ScienceBase landing page and the DOI Tool. This link has proven useful for letting data authors know how others are using their data, and for understanding some impacts of the data. In a similar vein, SDM uses xdd (previously GeoDeepDive) to track references to USGS data DOIs, with plans to display these data citations on ScienceBase landing pages in the future.  

Citation of data is an emerging practice, but many data releases in ScienceBase have seen multiple instances of reuse in subsequent scientific research. For example, data for a Geospatial Fabric data release has been cited or reused by seventeen publications. Another data release on the global distribution of minerals was cited in U.S. public policy on critical minerals. 

Other projects in the works are aimed at analyzing USGS data for reuse. A recent "state of the data" project, consisting of analyzing 165 random data release samples against several established data maturity matrices, aims to determine how mature and FAIR USGS data contained in ScienceBase is, and to document an assessment methodology that is scalable to other bureau data repositories and platforms. 

SDM has undertaken initiatives in the past few years to make data easier to work with, access, and publish. USGS Data Tool, a python wrapper around a set of system APIs, is one such tool. USGS Data Tools creates a bridge between various systems (DOI Tool, Pubs Warehouse, BASIS plus, Metadata Parser, SBDR), making data management easier and more intuitive. Other systems have also recently gained connections, such as the SBDR Tool which now contains an option to auto fill information from IPDS. 

The USGS Model Catalog is another recent project spearheaded by SDM. The goals for the Catalog are to increase discovery and awareness of scientific models and link models to their related literature, code, data and other resources. The Model Catalog effort is informing practices that will allow the latest information for models to be dynamically updated. CDI is currently assisting by gathering input from modelers across the Bureau - contact Leslie Hsu at for more information. 

So, in sum, what are your data doing?: 

  1. Your data are contributing to USGS successes in open and FAIR data 
  2. Citations to your data are being tracked 
  3. Data are connected between ScienceBase and your publication in Pubs Warehouse 
  4. Your data are accessible to the scientific community! 

For more information, see the USGS Data Management Website. 

Field Data Collection using NGA's free and open source Mobile Awareness GEOINT Environment (MAGE) and MapCache Mobile Apps 338px- Ben Foster and Justin Connelly, National Geospatial-Intelligence Agency 

Overview of the MAGE mobile application. 

The MAGE and MapCache mobile applications are open source field data collection apps designed to reach a wide audience, even those without GIS experience. The GitHub repository for these applications can be found here: 

Please see the meeting recording for a live demonstration. Some highlights from the demonstration: 

  1. The mobile application allows users to upload geo-located video and photo observations. 
  2. The mobile app also allows creation of lines and polygons. 
  3. Information added will be visible to other team members on the app. 
  4. The web application has a similar interface, but with more robust features. 

To join the CDI Event on MAGE for govt employees: (1) Request an account from NGA's Protected Internet Exchange (PiX): (use government email address for account registration) Once you have an account on PiX, send an email to and request to be added to the "MAGE USGS CDI" event.

All CDI Blog Posts 

Artificial Intelligence/Machine Learning, 6/9 - Tallgrass Supercomputer for AI/ML

Natalya Rapstine presented "USGS Tallgrass Supercomputer 101 for AI/ML," an overview of the new USGS Tallgrass supercomputer designed to support machine learning and deep learning workflows at scale, and deep learning software and tools for data science workflows. Natalya's slides covered the software stack that supports Deep Learning, including PyTorch, Keras, and TensorFlow. She then illustrated the capabilities with the "Hello World!" example of Deep Learning - the MNIST Database of Handwritten Digits.

See many more links to resources in the Slides and recording available at AI/ML Meeting Notes

ASCII art for the Tallgrass supercomputer

Data Management, 6/8 - Data Curation Network - extending the research data toolkit

Guests from the Data Curation Network, Lisa Johnston, Wendy Kozlowski, Hannah Hadley, and Liza Coburn presented on their recent work.

CURATED stands for: Check files/code; Understand the data; Request missing info or changes; Augment metadata; Transform file formats, Evaluate for FAIRness; Document curation activities.

Checklists and primers related to these topics for specific file formats are available at:

Also of interest is an Excel Archival Tool, which programmatically converts Microsoft Excel files into open-source formats suitable for long-term archival, including .csv, .png, .txt, and .html:

Data Curation Network infographic at

DevOps, 6/2 - Elevation data processing at scale

Josh Trahern, Project Manager of the NGTOC Elevation Systems Team led a discussion titled "Elevation Data Processing At Scale - Deploying Open Source GeoTools Using Docker, Kubernetes and Jenkins CI/CD Pipelines"

The presentation highlighted the Lev8 (pronounced as "elevate" and doing petabyte scale processing of DEMs) & QCR Web Applications, produced by the Elevation team.  These tools are used by Production Ops to generate the National Elevation Dataset (NED). The NED dataset is a compilation of data from a variety of existing high-precision datasets such as LiDAR data, contour maps, USGS DEM collection, SRTM, and other sources which are combined into a seamless dataset, designed to cover all the United States territory in its continuity.

Moving away from proprietary software and owning the code base – to prevent trying to fit a square peg into a round hole. Working toward 100% automation, 100% documentation and moving to Linux environment. Making all of these changes while the system was operational.

See the recording on the DevOps Meetings page.

Fire Science, 6/16 - NLCD Rangeland Fractional Component Time-Series: Development and Applications

Fire Update: Paul Steblein gave a Fire Update and Matthew Rigge, (EROS) – presented on "NLCD Rangeland Fractional Component Time-Series: Development and Applications."

The Fire Science coordinators and CDI staff are working on syncing content on the internal OneDrive and the CDI wiki, contact Paul at if you have any questions about the group.

Metadata Reviewers, 6/1 - What is important to metadata reviewers

The Metadata Reviewers group had a discussion about what matters to them when reviewing metadata. Some themes were making USGS data as findable and reusable as possible, avoiding unnecessary complexity, and making metadata easier to write.

See more notes on the discussion at their Meetings wiki page.

Open Innovation, 6/18 and 6/19 - Paperwork reduction and Community-based water quality monitoring

On June 18, the topic was "Tackling the Paperwork Reduction Act (PRA) in the Age of Social Media and Web-based Interactive Technology." Three Information Collection Clearance Officers from DOI (Jeff Parrillo), USGS (James Sayer), and FWS (Madonna Baucum) explained the basics of the Paperwork Reduction Act (PRA), discussed how the PRA applies to crowdsourcing, citizen science, and prize competition activities, and participated in a Q&A discussion with the audience. More information on the Open Innovation wiki.

On June 19, Ryan Toohey and Nicole Herman-Mercer presented on "Indigenous Observation Network (ION): Community-Based Water Quality Monitoring Project." ION, a community-based project, was initiated by the Yukon River Inter-Tribal Watershed Council (YRITWC) and USGS. Capitalizing on existing USGS monitoring and research infrastructure and supplementing USGS collected data, ION investigates changes in surface water geochemistry and active layer dynamics throughout the Yukon River Basin. More information on the Open Innovation wiki.

Risk, 6/18 - Funded Project Reports

This was "round 1" of final project presentations from the FY19 Risk RFP awardees. Please see the list below for presenters - each one is about 10-12 minutes in length. PIs from each project provided a project overview, a description of their team, accomplishments, deliverables, and lessons learned.

  • Quantifying Rock Fall Hazard and Risk to Roadways in National Parks: Yosemite National Park Pilot Project, Brian Collins, Geology, Minerals, Energy, and Geophysics Science Center
  • The State of Our Coasts: Coastal Change Hazards Stakeholder Engagement & User Need Assessment, Juliette Finzi-Hart, Pacific Coastal and Marine Science Center
  • Re-visiting Bsal risk: how 3 years of pathogen surveillance, research, and regulatory action change our understanding of invasion risk of the exotic amphibian pathogen Batrochochytrium salamandrivorans, Dan Grear, National Wildlife Health Center
  • Communications of risk - uranium in groundwater in northeastern Washington state, Sue Kahle, Washington Water Science Center

See more at the Risk community of practice wiki page.

Tech Stack, 6/11 - ESIP Collaboration Infrastructure 2.0

In June the joint Tech Stack and ESIP IT&I meeting hosted three presentations

Ike HechtWikiWorks on the ESIP Wiki. Mediawiki upgrade from v1.19 to 1.34 of the ESIP wiki

Lucas CioffiQiQoChat lead developer on the technical side of QiQoChat. Utilizing QiqoChat to bring together our asynchronous workspaces with our virtual conferences and meetings

Sheila Rabun, ORCID US Community Specialist on the ORCID API. Becoming an ORCID member to gain access to ORCID API keys to integrate ORCID authentication into the wiki.

See more at the IT&I meetings page

Software Dev, 6/25 - Serverless!

Carl Schroedl presented on "Using Serverless and GitLab CI/CD to Continuously Deliver AWS Step Functions." See:

Notes and more links:

All CDI Blog Posts 

Continuing our exploration of 2019's CDI funded projects, June's monthly meeting included updates on projects involving extending Sciencebase's current capabilities to aid disaster risk reduction, coupling hydrologic models with data services, and standardizing and making available 40 years of biosurveillance data. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

Screenshot of a beta version of ScienceBase where an option to publish all files to ScienceBase appears.

Extending ScienceBase for Disaster Risk Reduction - Joe Bard, USGS 

The Kilauea volcano eruption in 2018 revealed a need for near real-time data updates for emergency response efforts. During the eruption, Bard and his team created lava flow update maps to inform decision-making, using email to share data updates. This method proved to be flawed, causing issues with versioning of data files and limitations on sharing with all team members at the same time. 

ScienceBase has emerged as an alternative way to share data for use by emergency response workers. When GIS data is uploaded to ScienceBase, web services are automatically created. Web services are a type of software that facilitates computer to computer interaction over a network. Users don't need to download data to access it; instead it can be easily accessed problematically. Additionally, data updates can be automatically propagated through web services, to avoid versioning issues. However, use of ScienceBase during the Kilauea volcano crisis met unforeseen issues around reliability related to hosting on the USGS server and an overload of simultaneous connections. 

This project explores a cloud-based instance of Geoserver on the AWS S3 platform wherein the user can publish geospatial services to this cloud-based server. This method is more resilient to simultaneous connections and takes into account load-balancing and auto-scaling. It also opens the possibility of dedicated Geoserver instances based on a team's needs. ScienceBase is currently working on a function to publish data directly to S3. 

A related Python tool for downloading data from the internet and posting on ScienceBase using ASH3D as an example is available on GitLab for USGS users.

Next steps for this project include finalizing cloud hosting service deployment and configuration settings, checking load balancing and quantifying performance, exploring set-up of multiple Geoserver instances in the cloud, evaluating load balancing technologies (e.g., Cloudfront), and ensuring all workflows are possible using a SB Python library. 

Presentation slide explaining the concept of a modeling sandbox.

Coupling Hydrologic Models with Data Services in an Interoperable Modeling Framework - Rich McDonald, USGS  

Integrated modeling is an important component of USGS priority plans. The goal of this project is to use an existing and mature modeling framework to test a Modeling and Prediction Collaborative Environment "sandbox" that can be used to couple hydrology and other environmental simulation models with data and analyses. 

Modeling frameworks are founded on the idea of component models. Model components encapsulate a set of related functions into a usable form. For example, going through a Basic Model Interface (BMI) means that no matter what the underlying language is, the model component can be made available as a Python component. 

To test the CSDMS modeling framework, the team took the PRMS (Precipitation-Runoff Modeling System) modeling system and broke it down into its 4 reservoirs (surface, soil, groundwater, and streamflow) and wrapped them in a BMI. They then re-coupled them back together. The expectation is that the user could then couple PRMS with other models. 

See the meeting recording for demonstration of the tool. You may note the model run-time interaction during the demo. You'll also see that PRMS is in Fortran, but is being run in Python. Code for this project is available on GitHub. 

Presentation slide of the interface of the Wildlife Health Information Sharing Partnership event reporting system, abbreviated Whispers

Transforming Biosurveillance by Standardizing and Serving 40 Years of Wildlife Disease Data - Neil Baertlein, USGS 

Did you know that over 70% of emerging infectious diseases originate in wildlife? The National Wildlife Health Center (NWHC) has been dedicated to wildlife health since 1975. Biosurveillance the NWHC has been involved in includes: lead poisoning, West Nile Virus, Avian influenza, white-nose syndrome, and SARS-CoV2. 

NWHC has become a major data repository for wildlife health data. To manage this data, WHISPers (Wildlife Health Information Sharing Partnership event reporting system) and LIMS (laboratory information management system) are utilized. WHISPers is a portal for biosurveillance data in which events are lab verified and the portal allows collaboration with various state and federal partners, as well as some international partners, such as Canada. 

There is a need to leverage NWHC data to inform public, scientists, and decision makers, but substantial barriers stand in the way of this goal: 

  1. Data is not FAIR (findable, accessible, interoperable, and reusable) 
  2. There are nearly 200 datasets in use 
  3. Data is not easy to find 
  4. Data exists in various file formats 
  5. There is limited to no documentation for data 

As a result, this project has formulated a five step process for making NWHC data FAIR: 

  1. Definition: creating a definition.  
    1. NWHC created a template in which they capture information such as users responsible for data, the file type of the data, and where the data is stored. A data dictionary was also created. 
  2. Classification: provide meaning and context for data.  
    1. In this step, NWHC classifies relationships with other datasets and other databases, and identifies inconsistencies in data. 
  3. Prioritization: identify high-priority datasets.  
    1. High-priority datasets are ones that NWHC needs to continue to use down the road or are currently high-impact. Non-priority datasets can be archived. 
  4. Cleansing: Next step for high-priority datasets.  
    1. Includes fixing data errors and standardizing data. 
  5. Migrating: map and migrate the cleansed data. 

To put this five step process into effect, NWHC hired two dedicated student service contractors to work on the project. Interviews with lab technicians, scientists, and principal investigators were conducted to gather input and identify high-priority datasets. Dedicated staff also documented datasets, organized said documentation, and began cleansing high-priority datasets by fixing errors and standardizing data. At the time of this presentation, 130 datasets are ready for archiving and cleansing. 

There have been some challenges faced during this process so far. Training of the staff responsible for making NWHC data FAIR and easier to work with has been a substantial time investment. The work is labor and time-intensive, and some datasets do not have any documentation readily available. The current databases in use were built with limited knowledge of database design. Finally, there are variations in laboratory methodology, field methodology, and between individuals or different teams. 

The project team are able to share several takeaways. Moving forward, data collectors need to think through data collection methods and documentations more thoroughly. Some questions a data collector may ask about their process are: Is it FAIR? Are my methods standardized? How is the data collected now and how will it be collected in the future? Documenting the process and management of data collection and compilation is also important. 

All CDI Blog Posts   

CDI's May monthly meeting included updates on CDI projects focusing on FAIR data, a grassland productivity forecast, and animal movement visualization. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

Building a Roadmap for Making Data FAIR in the U.S. Geological Survey, Fran Lightsom, USGS 

Fran Lightsom presented on the process of building a roadmap for making USGS data FAIR. FAIR stands for Findable, Accessible, Interoperable and Reusable and has become a popular way for organizations to improve the value and usefulness of data products. 

To begin building a roadmap for FAIR data, the project team conducted a survey of data producers, collected use cases of projects that integrate data, hosted a workshop on September 9th-11th, 2019, and drafted a report & list of recommendations. The workshop produced about 100 discrete recommendations, with 14 being deemed essential, 38 important, and 44 useful. 

Some broad thoughts that came out of the workshop included the assertion that open science requires extension of FAIR beyond data to samples, methods, software, and tools; a less-explored application of FAIR. Implementing recommendations would be the responsibility of many groups, and would require input from representatives of these groups. There may be a place for CDI to step in and coordinate in the future, as this effort continues. 

Further objectives coming out of this effort include increasing use of globally unique persistent identifiers (especially with physical samples and software), developing policy, researching best practices, creating support tools, enabling creation of digital products that are interoperable and usable by making use of existing standards, and improving interoperability through coordinated creation of shared vocab and ontology. 

An opportunity for CDI to view and provide feedback for the FAIR roadmap is upcoming. 

Implementing a Grassland Productivity Forecast Tool for the U.S. Southwest, Sasha Reed, USGS 

Grass-Cast is a CDI-funded project that is focused on producing near-term forecasts of grassland productivity for the U.S. southwest. The goal of the project is to bring together different kinds of data in order to provide upcoming growing season forecasts, updated very 2 weeks. This work started in the Great Plains to provide information about seasonal outlooks to ranchers. 

So, why are grasslands important? Grasslands provide a critical amount of ecosystem services. They are one of the largest single providers of agro-ecological services in the U.S., and they supply important habitat and food provision for wildlife. Productivity of grasslands helps to determine fire routines and how much carbon is coming from the atmosphere into the grass and soil. Dust reduction and problems associated with air quality can also be thought about from a grassland productivity perspective. 

Near-term productivity forecasts for grasslands can provide information to stakeholders on cattle stocking rates, where and how to allocate resources towards fire management, and rates of carbon sequestration. Grasslands are notably responsive to subtle changes in the environment and climate, and thus, they vary from year to year, making productivity predictions difficult. 


The diagram above outlines the process that informs Grass-Cast for the Great Plains, but the project team wants to expand to include the Southwest region. The Southwest region differs from the Great Plains in that it does not have the same homogeneous coverage of grasses, meaning that bare ground is often exposed, complicating the interpretation of remotely sensed data. The Southwest also has a more varied mix of vegetation types, including cacti and shrubs, which needs to be differentiated from grass cover. 

The Grass-Cast team aimed to take the same overarching process used in the Great Plains Grass-Cast, but adjust the methods to effectively use Grass-Cast in the Southwest. First, the team looked at different satellite indices for estimating grassland productivity in the hopes they might better address the challenges of the Southwest. They found that the previously utilized NDVI (normalized difference vegetation index) greenness index did work well in a lot of places in the Southwest, but not as well in others. These results supported the idea to try newer remote sensing platforms that don't rely on a greenness index, such as SIF (solar induced fluorescence). SIF is a different way of looking at plant activity that uses plant physiology to monitor how electrons are moving trough the photosynthetic chain. The Southwest is different from the Great Plains in that the dry environment means that you can have plants that are green but not very active, making the relationship between greenness and productivity more challenging. Additionally, many Southwestern grasslands have two growing seasons - spring and summer, representing a temporal challenge. Other remote sensing methods examined here were NIRv (near-infrared reflectance of vegetation), a greenness index that hones in specifically on green parts of remotely sensed pixels in images, and SATVI (Soil-Adjusted Total Vegetation Index), which takes into account soil brightness. 

The team compared results from these different indices using eddy covariance data, and found that neither SIF or NDVI provided good results. However, NIRv and SATVI did a good job of predicting grassland productivity for the Southwest, and there is some promise in SIF as a proxy for capturing the timing of the growing season. 

Grass-Cast now plans to incorporate data for the Southwest (Arizona and New Mexico) into the current tool. Ultimately,  the team wants to integrate across these different methods and go beyond Arizona and New Mexico. There is a lot of room for collaboration; stay tuned for upcoming workshops and seminars. 

GrassCast is available here. 

A generic web application to visualize and understand movements of tagged animals, Ben Letcher, USGS 

Tracking and tagging data on individual animals provides key information about movements, habitat use, interactions and population dynamics, and there is a lot of this type of data currently available. For example, the Movebank database currently has 2 billion observations. Tracking data is expensive and requires time and effort to collect; TAME (tagged animal movement explorer) aims to help maximize the value of this data and make it easier to interact with these complex data. 

TAME is a data exploration tool in the form of a web application, based on open source libraries. The TAME team's goal is to make TAME as easy to use as possible, and to allow for interaction and exploration of tagging data. Currently, TAME features include: 

  1. Four introduction videos 
  2. A user account system where users can upload their own data, with an option to publish and/or share 
  3. Ability to map observations to color, size, or outline 
  4. Ability to select individuals or select by area, with multiple area selections available 
  5. Ability to cross filter where users can filter any one variable, or multiple variables, and output a movie/time series of the data. 

 Screenshot of a slide showing features of the web application TAME

See the monthly meeting recording on the wiki page for a live demonstration of TAME, or explore for yourself on the TAME website. 

Ben Letcher ( is excited to explore a podcast or video series centered on animal movement stories – please reach out to him if you have experience in this area! 

All CDI Blog Posts  

Highlight images from the May 2020 Collaboration Area topics, from left to right: The User Experience Honeycomb (
source) (Usability), interfacing with hydrologic data with Hydroshare (Tech Stack), machine learning Train and Tune steps covered by SageMaker (AIML). 

CDI collaboration areas bring us focused information and tools to help us work with our data. Do you have an idea for a topic that you want to learn about or present to a group? Get in touch with us to coordinate! - Leslie,

Artificial Intelligence / Machine Learning, 5/12 - SageMaker for machine learning models 

Amazon Web Services personnel and USGS scientists presented on SageMaker and an example of its use at the USGS Volcano Science Center. SageMaker provides the ability to build, train, and deploy machine learning models quickly. Phil Dawson of USGS showed an application to the continuous seismic data that is collected at all USGS volcanic observatories, and how to apply the models even though "every volcano speaks a different dialect" (the seismic energy looks different).  

The recording is posted at the meeting wiki page 

Data Management, 5/11 - records management for electronic records 

Chris Bartlett presented on how records management is moving more aggressively to electronic records management, and it is a ripple of changes. She discussed what this means in relation to our records including data, our processes, and expectations. 

Slides and recording are posted at the meeting wiki page. 

Fire Science, 5/19 - scaling up tree-ring fire history 

The Fire Science Community of Practice heard the monthly fire update, discussion about fire science communications, and a science presentation from Ellis Margolis on Scaling up tree-ring fire history: from trees to the continent and seasons to centuries. 

Contact Paul Steblein or Rachel Loehman for more information. Future meeting dates are listed on the Fire Science wiki page

Metadata Reviewers, 5/4 - data publication versus research publication 

The group discussed the question "What type information (in the metadata) is necessary for a data publication vs research publication?" In addition, links were shared about an ongoing discussion on metadata for software and code.  

See more notes on the discussion at their Meetings wiki page. 

Risk, 5/12 - communicating hazard and risk science 

The Risk community of practice hosted a panel discussion on communicating hazard and risk science. The speakers were Sara McBride (USGS), Kerry Milch (Temple University), and Nanciann Regalado (Dept of Interior, US Fish and Wildlife Service). Each speaker shared news on some of their recent projects and lessons learned on the job. Projects discussed included ShakeAlert and aftershock forecasts, the USGS circular "Communicating Hazards – A Social Science Review to Meet U.S. Geological Survey Needs", and the Deepwater Horizon Oil Spill Natural Resource Damage Assessment Trustee Council.  

See more at the Risk community of practice wiki page 

Semantic Web, 5/14 - concept maps for modeling traceable adaptation, mitigation, and response plans 

Brian Wee presented on an experiment to use concept maps for documenting science-informed, data-driven workflows for climate-related adaptation, mitigation, and response planning. The ESIP wiki page on the concept map repository describes how concept maps can be used to describe your own data-to-decisions narrative, as a just-in-time (i.e. as needed) educational resource, to provide context awareness about where you fit in the big picture, and to experiment with ideas for context-aware knowledge discovery. 

See a link to the slides and recording at the Semantic Web meetings page. 

Software Dev, 5/28 - data warehousing and ETL pipelines 

May's topic was data warehousing and ETL (Extract, Transform, Load) pipelines. Cassandra Ladino presented on the use of Amazon Web Services (AWS) Redshift Data Warehouse as applied to the USGS Configuration Management Committee. Jeremy Newson presented on ETL pipelines using AWS Glue. 

See more at the Software Dev wiki meetings page. 

Tech Stack, 5/14 - HydroShare for sharing hydrologic resources 

The joint CDI Tech Stack and ESIP IT&I Tech Dive hosted a presentation on CUAHSI HydroShare by Jerad Bales, Anthony Castronova, and Jeff Horsburgh. HydroShare is a platform for sharing hydrologic resources (data, models, model instances, geographic coverages, etc.), enabling the scientific community to more easily and freely share products, including the data, models, and workflow scripts used to create scientific publications. 

Slides and recording on the joint CDI Tech Stack and ESIP IT&I webinars on the ESIP page. 

Usability, 5/20 - usability and trust 

resource review was posted on the topic of how usability and interface influence user experience, including credibility and use. "The resource highlights that user interface and credibility influence user experience because design elements can impact whether users trust and believe what is being presented or delivered to them." 

See more of the group's activity and resources on the Usability wiki page  

More CDI Blog Posts 

The CDI Collaboration Areas are keeping me busy. You can get to all of these groups and sign up for mailing lists on the CDI Collaboration Area wiki page.

From upper left corner, clockwise: DevOps: image from Tidelift website; SoftwareDev: logo for uvicorn; Risk: Impact360 worksheet; AI/ML: image from AI/ML DELTA presentation; Semantic Web: image from Garillo and Poveda-Villalon; Open Innovation: image from OI wiki page; Tech Stack: image from Unidata gateway webpage; Usability: image from Sayer's Paperwork Reduction Act presentation

4/6 Metadata Reviewers - revision or release information in titles

In April the Metadata Reviewers group dove into a question about including the date of a revision or release in the title of the data release. Doing so would help to distinguish between different versions of a data release. After much discussion the group concluded that two metadata records should not have the same title in their citation elements.

See more notes on the discussion at their Meetings wiki page.

4/7 DevOps - managed open source with Tidelift

The DevOps group heard a presentation from Tidelift. Tidelift partners with open source maintainers in order to support application development teams. This saves time and reduces risk when using open source packages to build applications.

See the recording and slides on the DevOps Meeting page. If you are interested in using Tidelift for a USGS application, get in touch with Derek Masaki at If you'd like a presentation from Tidelift, contact Melanie Gonglach at

4/9 Semantic Web - implementing FAIR vocabularies and ontologies

The group discussed  "Best Practices for Implementing FAIR Vocabularies and Ontologies on the Web" by Daniel Garijo and Marıa Poveda-Villalon. The discussion focused on sections 2 and 3 of the paper, URIs (uniform resource identifiers) and Documentation. The group recognized that implementation of the best practices in the paper (for example, stable, permanent identifiers) would depend not only on semantic specialists, but also those who set policy for the USGS network. This point was communicated to the group that is working on enabling FAIR practices in the USGS.

See more at the Semantic Web meetings page.

4/9 Tech Stack - Unidata Science Gateway

Julien Chastang presented on the Unidata Science Gateway ( Unidata is exploring cloud computing technologies in the context of accessing, analyzing, and visualizing geoscience data. From the abstract: "With the aid of open-source cloud computing projects such as OpenStack, Docker, and JupyterHub, we deploy a variety of scientific computing resources on Jetstream for our scientific community. These systems can be leveraged with data-proximate Jupyter notebooks, and remote visualization clients such as the Unidata Integrated Data Viewer (IDV) and AWIPS CAVE."

Slides and recording on the joint CDI Tech Stack and ESIP IT&I webinars on the ESIP page.

4/13 CDI Data Management - changes to the USGS Science Data Catalog

Lisa Zolly presented on changes coming with the USGS Science Data Catalog version 3. Today, the Science Data Catalog ( has more than 21,000 metadata records. In order to serve its human and machine stakeholders, a number of changes are planned in order to address the changing landscape of federal data policy, substantial growth of the catalog, improvement of workflows, improvement of usability, and more robust reporting and metrics.

Slides and recording are posted at the meeting wiki page.

4/14 Artificial Intelligence / Machine Learning - fine scale mapping of water features at the national scale

Jack Eggleston (USGS), John Stock (USGS), and Michael Furlong (NASA) presented on "Fine scale mapping of water features at the national scale using machine learning analysis of high-resolution satellite images: Application of the new AI-ML natural resource software - DELTA." The availability of high-resolution satellite imagery, combined with machine learning analysis to rapidly process the satellite imagery, provides the USGS with a new capability to map natural resources at the national scale.

The recording is posted at the meeting wiki page.

4/15 Usability - how the Paperwork Reduction Act affects usability studies

James Sayer presented on the Paperwork Reduction Act (PRA) and Usability Testing. The PRA is designed to protect the public from inappropriate data collection. All agencies have their own PRA procedures, so implementation in other agencies won't necessarily translate to USGS implementation. James reviewed Fast Track procedures and exclusions. His advice included to start early in thinking about PRA in your usability work, and to talk to your ICCO (Information Collection Clearance Officer) if you have any questions.

The slides, notes, and recording are posted on the meeting wiki page. Do you have more questions? Contact James at

4/16 Risk - Product evaluation/testing and integrating solutions into strategy

The Risk Community of Practice April meeting was part 3 of a series of training webinars provided by Impact360 Alliance on human-centered design thinking and inclusive problem solving. Emphasis was given to the tools for product evaluation/testing ("[Re]Solve") and integrating solutions into strategy ("[Re]Integrate"). Worksheets were provided to "Create and Test a Solution in Three Acts." A follow-up session on April 23 discussed examples of the worksheets.

Access the slides and recording, and handouts at the Risk Meetings page (must log in as a CDI member, join here if you're not a member yet).

4/17 Ignite Open Innovation - Open Innovation and COVID-19

April was Citizen Science Month! At the Open Innovation meeting, Sophia B Liu (USGS Open Innovation Lead) provided an overview of the various open innovation efforts inside and outside of government that have emerged in response to COVID-19. She also discussed The Opportunity Project Earth Sprint and proposed Problem Statements.

See more information and list of COVID-19 sites at the meeting wiki page.

4/21 Fire Science - stakeholder input on USGS Fire Science

James Meldrum and Ned Molder of the USGS Fort Collins Science Center presented on Analysis of stakeholder input on USGS fire science communication and outreach, science priorities, and critical science needs. The group also heard updates on the USGS Fire Science strategy, recent fire activity, and held a discussion on "How is Covid 19 affecting your fire science"?

Contact Paul Steblein ( or Rachel Loehman ( for more information.

4/23 Software Dev - FastAPI

The Software Dev cluster had Brandon Serna and Jeremy Fee present about their work using FastAPI with some comparisons to Flask. I am not a developer so I will summarize by pasting some links, tag lines, and interesting things I heard.

Recommended resources.

I'm going to take a little bit of space to list some of the things I Googled while listening to this call, because to me these descriptions (and some of the logos) are fascinating. It would be fun to do a tagline-logo-name matching game.

  1. FastAPI, FastAPI framework, high performance, easy to learn, fast to code, ready for production
  2. Flask: web development, one drop at a time
  3. Hot reloading <- this sounds very exciting, and according to the internet it is "The idea behind hot reloading is to keep the app running and to inject new versions of the files that you edited at runtime. This way, you don't lose any of your state which is especially useful if you are tweaking the UI"
  4. Uvicorn: The lightning-fast ASGI server
  5. Cookiecutter Better Project Templates
  6. Gunicorn: Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for UNIX. It's a pre-fork worker model. The Gunicorn server is broadly compatible with various web frameworks, simply implemented, light on server resources, and fairly speedy
  7. Pyenv: pyenv lets you easily switch between multiple versions of Python. It's simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well
  8. Pipenv: Pipenv is a tool that aims to bring the best of all packaging worlds (bundler, composer, npm, cargo, yarn, etc.) to the Python world. Windows is a first-class citizen, in our world
  9. Hypercorn: Hypercorn is an ASGI web server based on the sans-io hyper, h11h2, and wsproto libraries and inspired by Gunicorn

See more at the Software Dev wiki meetings page.

More CDI Blog Posts

We continued our exploration of 2019's CDI funded projects in April's monthly meeting with presentations on the Climate Scenarios Toolbox, developing cloud computing capability for camera image velocity gaging, and integrating environmental DNA (eDNA) data into the USGS Nonindigenous Aquatic Species database. 

For more information, questions and answers from the presentation, and a recording of the meeting, please visit the CDI wiki. 

Open-source and open-workflow Climate Scenarios Toolbox for adaptation planning 

Aparna Bamzai-Dodson, USGS, presented on the Climate Scenarios Toolbox (now renamed to the Climate Futures Toolbox!), an open-source tool that helps users formulate future climate scenarios for adaption planning. Scenario planning is a way to consider the range of possible outcomes by using projections based on climate data to develop usually 3-5 plausible divergent future scenarios (ex: hot and dry; moderately hot with no precipitation change; and warm and wet). Resource managers and scientists can use these scenarios to help predict the effects of climate change and attempt to select appropriate adaptation strategies. However, climate projection data can be difficult to work with in areas of discovery, access, and usage, involving multiple global climate model repositories, downscaling techniques, and file formats. The Climate Futures Toolbox aims to take the pain out of working with climate data.

Collection of photos of people collaborating around climate scenarios and adaptation planning graphs.

The creators of the Toolbox wanted a way to make working with climate data easier by lowering the barrier to entry, automating common tasks, and reducing the potential for errors. The Climate Futures Toolbox uses a seamless R code workflow to ingest historic and projected climate data and generate summary statistics and customizable graphics. Users are able to contribute open code to the Toolbox as well, building on its existing capabilities and empowering a larger user community. The Climate Futures Toolbox was created in collaboration with University of Colorado-Boulder's Earth Lab, the U.S. Fish and Wildlife Service, and the National Park Service. 

CDI members are encourage to become engaged in the Toolbox by installing and using it, providing feedback on issues, and contributing code to the package. Since April's monthly meeting, the project has developed and undergone renaming, so this is a rapidly evolving endeavor. 

Develop Cloud Computing Capability at Streamgages using Amazon Web Services GreenGrass IoT Framework for Camera Image Velocity Gaging 

Frank Engel at the USGS Texas Water Science Center presented next on a CDI project involving non-contact stream gaging within a cloud computing framework. 

Measuring stream flow is an important aspect of USGS' work in the Water Mission Area, and stream gaging, a way to measure water quantity, is a technique with which many scientists are familiar. However, it is sometimes difficult to obtain measurements with traditional stream gaging, like at times of flooding, or when measurement points are unsafe or unreachable. Additionally, post flood measurement methods can often be expensive and not as accurate. 

To get around these issues, scientists have developed non-contact methods with which to measure water quantity. For example, cameras are utilized to view a flooding river, which can produce a velocity measurement after processing and other analysis steps. This is a complicated method and requires many steps and extensive training. Thus, the goal of this project is to make this process work automatically utilizing cloud computing and IoT. 

The first step required building a cloud infrastructure, with the help of Cloud Hosting Solutions (CHS). This involves connecting the edge computing (camera and raspberry PI footage of a stream) to an Amazon Web Services (AWS) IoT system and depositing camera footage and derivative products into a S3 bucket. The code for this portion of the product is in a preliminary GitLab repository that is projected to be published as a part of the long-term project. The team is also still working toward building the infrastructure through to data serving and dissemination. 

Workflow for getting streamflow data into a cloud computing system.

Other successes accomplished with this project so far include auto-provisioning (transmitting location and metadata) of edge computing systems to the cloud; establishing global actions (data is transmitted to the cloud framework and can roll into automated processing, like extracting video into frames); and building automated time-lapse computation. 

Engel and the project team have taken away a couple lessons from their experience with this project: first, cloud computing knowledge takes a lot of work and time to acquire, and second, in the short term, It can be difficult to establish a scope that encompasses the needs and wants of all stakeholders. 

Establishing standards and integrating environmental DNA (eDNA) data into the USGS Nonindigenous Aquatic Species database 

Jason Ferrante with the Wetland and Aquatic Research Center discussed his team's project on establishing standards for eDNA data in the USGS Nonindigenous Aquatic Species database (NAS). 

eDNA is genetic material released by an organism into its environment, such as skin, blood, saliva, feces. By collecting water, soil, and air samples, scientists can detect the presence of a species with eDNA. Ferrante's project aims to combine the traditional specimen sightings already available in the NAS with eDNA detections for a more complete distribution record and improved response time to new invasions. 

There is currently a need for an open, centralized eDNA database. eDNA data is currently scattered among manuscripts and reports, and thus not easily retrievable via web searches. Additionally, there are no databases dedicated to Aquatic Invasive Species (AIS), which are the species of interest for this project. A centralized, national AIS viewer will allow vetting and integration of data from federal, academic, and other sources, increase data accessibility, and improve coordination of research and management activities. 

In order to successfully create a centralized AIS viewer, community standards need to be established so that data can be checked for quality and validity, especially within the FAIR data framework (Findable, Accessible, Interoperable, and Reusable). To establish community standards and successfully integrate eDNA into NAS, the project team accomplished several objectives: 

List of steps taken in integrating eDNA data into the Nonindigenous Aquatic Species Database

1) Experimental Standards 

  • Collating best standards and practices for sampling design and collection, laboratory processing, and data analysis, in an eDNA literature review. 

2) Stakeholder Backing 

  • Gathered a group of five other prominent/active eDNA researchers within DOI to discuss standards and vetting process 
  • Teleconferences to gain consensus 
  • Plan to produce a white paper 

3) Integration into NAS 

  • Pre-submission form about eDNA scientists' design and methodology in order to vet data 
  • Prototype web viewer (see meeting recording for more; must be logged into CDI wiki) 

Some challenges faced during the project included gaining consensus on the questions for the pre-submission form; staying organized and in communication; and meeting the needs of managers and researchers. Ferrante and the project team would love to follow up with CDI for help developing new tools which use eDNA data across databases to inform management; and providing feedback on an upcoming manuscript about the project's process. 

The CDI Monthly Meeting in March focused on three 2019 CDI-funded projects: a national-scale map of sinkhole subsidence susceptibility; a project that will allow collection of near real-time eDNA surveillance of invasive species or pathogens; and an overview of SEINed - a tool for Screening and Evaluating Invasive and Non-native Data.

Subsidence Susceptibility Map for the Conterminous U.S. Jeanne Jones, USGS 

Jeanne Jones at the Western Geographic Science Center shared progress on the creation of a Subsidence susceptibility Map for the conterminous U.S. that will identify hotspots for sinkholes and areas susceptible to developing sinkholes. Sinkholes can pose major issues by focusing contaminated and/or polluted surface water into groundwater and creating instability in the foundations of buildings and roads. As such, a consistent map for the identification of sinkhole hotspots is vital in order to anticipate and manage risks. 

The goals of this project include creating the first nationwide digital dataset of sinkhole hotspots, incorporating this dataset into the SHIRA (CDI) Risk map for use by DOI emergency agencies, and providing access to the dataset for external use by emergency managers, land use planners, and public works agencies. To meet these goals, the project team used The National Map, the National Hydrography Dataset (NHD), the Yeti supercomputer, and other data.

Jeanne shared challenges and solutions involved with several steps of the process. For instance, data had to be screened visually and manually in order to identify gaps in spatial coverage, and to screen out wetlands, open water, urban areas, and other non-karst landscape features. 

Jeanne also posed a question to the CDI: How does flow accumulation processing with DEMs compare across Arcpy, TauDem, RichDem in terms of speed, consistency of results, max size of raster for high performance computing? You can respond to her at

High-Resolution, Interagency Biosurveillance of Threatened Surface Waters in the United States Sara Eldridge and Elliott Barnhart, USGS

The project presented by Elliott Barnhart tackles the problem of rapid detection and prediction of biological hazards. USGS collects a massive amount of near real time data with stream gauges, but the analysis of this data can take much longer. To solve this problem, the project team created a cloud-hosted digital database that combines all the collected data, and can easily incorporate eDNA and other data streams into models that indicate the presence or absence of organisms. 

Challenges faced during the course of this project included creating effective quality control filters for funneling in data from multiple sources, and linking the benefits and capabilities of several different systems (like the MBARI Environmental Sample Processor, Department of Energy Systems Biology Knowledgebase, and more). 

National Public Screening Tool for Invasive and Non-native Aquatic Species Data Wesley Daniel, USGS

The Nonindigenous Aquatic Species (NAS) is the central repository for spatially referenced accounts of introduced aquatic species. NAS tracks over 1,290 aquatic species and stores over 600,000 observations from across the U.S. and spanning from the 1800's to the present. The SEINeD tool was developed to solve the problem: How does the NAS database get non-native occurrence data from groups not focused on invasive species? The SEINeD tool allows stakeholders to upload a biological dataset (fish, inverts, plants, etc.) collected anywhere in the conterminous US, Alaska, Hawaii, or US Territory that can then be screened for invasive or non-native aquatic species occurrences. 

The SEINed tool helps to filter out inaccuracies due to incorrect taxa and spatial identifications before checking the indigenous status of the species against the sighting location. The tool flags non-native species that are exotic (from other countries/continents), AND non-native species from within the U.S. (for example, rainbow trout native to the west coast on the east coast) The data is then enhanced with the addition of spatial information like hydrological unit codes (HUCs) and returned to the user. The user can then submit the enhanced/corrected CSV to the NAS program.  

The SEINed tool launches May 4thWatch the NAS website for updates. 

See the recording and slides at the March Monthly Meeting page.

More CDI Blog Posts

You can get to all of these groups and sign up for mailing lists on the CDI Collaboration Area wiki page.

3/2/20 Metadata Reviewers - sharing metadata creation practices for data and software

The Metadata Reviewers group discussed (1) At your office, how much do you create a single metadata record for? Individual data files, items in a database, collections of data, whole data releases, or what? (2) What about metadata for software or code? How can we prepare to think together about that, maybe on our next phone call? Should we invite a speaker? Bring in reference materials? Bring in good examples?

Link highlight: Data Management Training resources on the USGS Data Management Website

See other notes, links, and take-aways on the Metadata Reviewers Meetings Page! Contact: Lightsom, Frances L.

Training relevant to metadata creation is on the USGS Data Management Website!

3/3/20 Fire Science Update

At the March Fire Science Community of Practice meeting, Paul Steblein gave a fire science update. An FY2021 Request for Proposals will be issued this summer from the Joint Fire Science Program. Anna Stull presented on fire deployment requirements. Geoff Plumlee presented on the Department of Defense (DoD) Joint Artificial Intelligence Center (JAIC), Humanitarian Assistance and Disaster Relief (HADR) National Mission Initiative (NMI). During and post fire what can USGS be doing to link topography, mineralogy, debris flow, revegetation, invasive species? Rachel Loehman presented on the status and next steps of the USGS Fire Science Strategic Plan, and she is a new co-lead with Paul for the Fire Science CoP!

Past activity can be viewed at the Fire Science wiki page. Contact: Steblein, Paul Francis 

Recent wildland fire activity is shared at the Fire Science Community of Practice Meetings. 

3/9/20 Data Management - How do we convince them?

The Data Management Working Group again welcomed Science Gateways Community Institute trainers Claire Stirm and Juliana Casavan to host a working session on Communicating Data Value Propositions to Scientists. I found the concepts to be useful for communicating anything in general when trying to get "buy-in." Although some of us may be uncomfortable with the term "marketing," I think we may relate to the lessons of verbiage, graphics, actions, and strategies for trying to get buy-in for whatever we are working on.

Slides and recording are posted at the wiki meeting page. Contacts: Langseth, Madison Lee and Hutchison, Vivian B.

Slide from the "Communicating Data Value Propositions to Scientists" presentation.

3/12/20 Semantic Web - biodiversity in the world's oceans

Summary provided by Lightsom, Frances L.

Topic:  A practical example of semantic technology in action: assessing the status of biodiversity in the world’s oceans

Sky Bristol (USGS Core Science Systems, presented a big use case, in which semantic standards are needed to enable integration of multiple large datasets of ocean biological and ecological observations to understand effects of human activities on ocean ecosystems, as well as the sustainability of human uses of ocean resources. Several groups are working on ontologies that provide standard terms for the biota, ecosystem components, and relationships. A next step will be normalizing all the data with ontologies. This is an opportunity to assist with real world semantic web work.

See more at the Semantic Web meetings page.

A slide from Sky Bristol's presentation on a practical example of semantic technology - biodiversity in the world's oceans.

3/12/20 Tech Stack - provision of rapid response during Australian bushfires

Following the ESIP Theme "Putting Data to Work," the Tech Stack group had a presentation on the Discrete Global Grid System's use during the Australian bushfires.

From the abstract: The devastation caused by the Australian Bushfires highlighted the need for a new approach for rapid data integration. The total burnt area during Autumn-Summer 2019-2020 is 72,000 square miles, which is an equivalent to a half of Montana or North Dakota and Delaware areas combined. Rapid response in provision of information on areas affected by the bushfires was required to support evaluation of the impact, and also planning the recovery process and support for families, businesses and the environment. This presentation will discuss application of the Discrete Global Grid System (DGGS) in bringing together diverse complex information from multiple sources to support the response process. 

Slides and recording on the joint CDI Tech Stack and ESIP IT&I webinars on the ESIP page. Contact: Blodgett, David L.

Slide from the DGGS and Australian bushfires presentation.

3/16/20 Ignite Open Innovation Forum - volcanic hazards and problem-based learning

Jefferson Chang presented on Using Volcanic Hazards in Hawai‘i as a STEM Platform in Problem-Based Learning.

From the abstract: We use emerging technology to empower youth in a problem-based learning approach during a summer-long course. With guidance from HVO scientists, students essentially adopt the hazards mission of the USGS. Students not only aid in the volcano monitoring efforts on Hawai‘i Island, but also (1) take ownership of their own learning, (2) increase their capacity in STEM, and (3) engage the local community and address its needs.

See more at the Open Innovation wiki site. Contact: Liu, Sophia

Slide from Jefferson Chang's presentation at the Ignite Open Innovation Forum.

3/17 - 3/18/20 ICEMM - Interagency Collaborative for Environmental Modeling and Monitoring public meeting - Integrated Modeling, Monitoring, and Working with Nature.

The ICEMM group held its 2020 Public Meeting (March 17-18, 2020) at USGS Headquarters, Reston, VA.  The theme was Integrated Modeling, Monitoring, and Working with Nature.

Selected presentation titles: Engineering with nature for sustainable systems; Building smarter water systems through improved sensors, autonomy, and data processing; Black swans, disappearing lakes, and the societal value of integrated modeling and monitoring; Integrated water prediction at the USGS; Next generation integrated modeling of water availability in the Delaware River Basin and beyond.

Links: Agenda.  Abstracts and Biographical Information. Description of the meeting.  

Please contact Glynn, Pierre D  by email ( for further information or questions.

Slide from Branko Kerkez's presentation on Building smarter water systems.

3/18/20 Usability Resource Review Posting - using web analytics to inform how our web pages and tools are being used

From Sophie Hou, Unknown User ( :

For March, I have prepared a resource review to address questions relating to the “How can we use Google Analytics DataStudio to inform how our online tools are used?” topic posted to the forum.

Using analytics to inform how our web pages/tools are being used

Please note that although Google Analytics is referenced in the original question and in my resource review, there are other options. If you have used tools other than Google Analytics, please could you share the information/experience through the usability listserv?

Behavior flow snapshot from the resource: How to use the behavior flow report to improve your webpage user experience.

3/19/20 Risk Community of Practice - situation assessment, stakeholder alignment, prototyping, and strategic planning

From the Risk CoP wiki: This was part 2 of a series of training webinars provided by Impact360 Alliance on human-centered design thinking and inclusive problem solving. During this webinar, participants took a deeper dive into six of the tools from Impact360's Toolkit360. Toolkit360 includes six tools for collaboratively understanding wicked problems and six tools for collaboratively generating strategies to solve wicked problems. The twelve tools bridge the "problem space" and "solution space" using situation assessment, stakeholder alignment, prototyping, and strategic planning.

Access the slides and recording, and handouts at the Risk Meetings page (must log in as a CDI member, join here if you're not a member yet). Contact: Ludwig, Kristin A.

I appreciated the clear steps for the different tools in the toolkit, for example for [Re]Assess.

3/26/20 Software Development - using the cloud to support geophysical research

Kirstie Haynie, a Mendenhall post-doc at the USGS, is exploring how to use the cloud to support geophysical research. She presented from a geophysicist's point of view, running the Slab2 model in the cloud. With the goal of operationalization of Slab2, she demonstrated the process of using CloudFormation templates for reproducibility, automation, and long term success.

Get to more resources on the Software Dev Cluster wiki page. Contacts: Ladino, Cassandra C.Guy, Michelle , Newson, Jeremy K.

Subduction zones via Slab2 at the software dev cluster meeting!

More CDI Blog Posts

Highlights from the "Advancing FAIR and Go FAIR in the U.S." Workshop

I attended the "Advancing FAIR and Go FAIR in the U.S." workshop in February; the workshop covered topics on how to establish and promote FAIR culture and capabilities within a community. Many of the discussions were synergistic with the CDI activities, so I wanted to share some key points from the workshop with the CDI community. - Sophie Hou

(Logo from the Go FAIR Initiative)

Workshop Info 

Title: Advancing FAIR and Go FAIR in the U.S.  

Date: February 24th to 27th, 2020 

Location: Atlanta, Georgia 


  • Facilitate development of a community of practice for FAIR awareness and capacity-building in the US 
  • Improve understanding of FAIR technologies, and how to teach this to others 
  • Preparation for teaching or supporting FAIR data management and policies for researchers, local institutions, professional organizations, and others 



Overall Summary: 

  • The workshop highlighted that advancing FAIR requires communal effort. 
  • In order to "FAIRify," it is important for a community to first determine its scope, goals, and objectives. 


Key Notes: 

  • FAIR is an acronym from Findable, Accessible, Interoperable, and Reusable. 
  • Typical challenges that a community could face when working on FAIR include:
    • Knowledge gap
    • Institutional inertia
    • Community relationship building
    • Expanding FAIR capacity
    • Best way to adapt and adopt available FAIR resources
  • The ultimate goal of enabling FAIR is to allow both humans and machines (especially machines) to use digital resources, so that analytics and re-use can be optimized.
    • According to the Go FAIR Initiative (, FAIR can also be understood as Fully AI Ready. In other words, machines are able to know what the digital resources mean. Additionally, the digital resources are as distributed/open as possible, but can also be as central/closed as needed.
  • Implementation of FAIR can be challenging because many concepts in the principles are multifaceted (including social, resource, and technical considerations).
  • In order to advance FAIR, it is important to first establish a good (common) understanding of the FAIR principles.
  • FAIR requires technical and disciplinary resources, but it also requires community support.
    • When implement FAIR, we need to review choices and accept challenges; e.g. who is our "community", and determine what is specific to our "community".
    • FAIR is not a “standard”. The local community context is important and necessary.
  • The Go FAIR Initiative offers a 7-step "FAIRification" process: 
  • Options for conducting a FAIR event/activity with one's community include:
    • Multiple day, experts convening, tutorial/webinar, conference, unconference, hackathon, symposium, sprint, posters, etc.
  • Participants of an FAIR event/actiity might have the following expectations:
    • Share best practices/resources/learn new skills
    • Tackle a problem
    • Learn new concepts/skills
    • Use FAIR as a them to track for other topics
    • Collaborate to create a resource to be shared
    • And more!
  • Once a community has established its version of FAIR, it is important to connect with other communities. Convergence with different communities is key to grow FAIR. 

CDI's February meeting featured a discussion on the value of CDI to you, and a deep dive into Pangeo.

Pangeo: A flexible open-source framework for scalable, data-proximate analysis and visualization

Rich Signell, a Research Oceanographer at the Coastal and Marine Science Center in Woods Hole and member of the Pangeo Steering Council, presented an overview of Pangeo and examples of uses for Pangeo for several different types of USGS workflows. The Pangeo framework is deployed by Cloud Hosting Solutions (CHS) and funded by EarthMAP as a new form of cloud-based model data analysis. Community-driven, flexible, and collaborative, Pangeo is slowly building out a set of tools with a common philosophy. In one example, Rich used a Pangeo Jupyter Notebook to process a dataset in one minute that had previously taken two weeks. Cloud costs, skills, cloud-optimized data, and Pangeo development are issues that are currently being addressed.

For more:

Pangeo and Landsat in the Cloud

Renee Pieschke, a Technical Specialist for the Technical Services Support Contract at the Earth Resources Observation and Science Center in Sioux Falls, SD, continued our Pangeo focus with some information on Landsat in the cloud. Renee and her team is looking to a spring release of collection two data, which will exponentially increase the amount of data available. Level 2 processing will be required for the collection two data (trying to get close to what it would be like if you were looking at the ground; taking out disturbances, clouds, etc).

The Landsat Look upgrade uses a cloud-native infrastructure and a cloud-optimized GeoTIFF format. It uses new SpatioTemporal Asset Catalog metadata to programmatically access the data. The new Landsat Look can filter pixels with a QA Band so that any clouds, shadows, snow, ice, or water is removed to produce the best possible image.

The SpatioTemporal Asset Catalog was developed to help standardize metadata across the entire geospatial data provider community, using a simple JSON structure. It normalizes common names, simplifies the development of third-party applications, and helps enable querying in Pangeo. Another in-progress goal is connecting with Landsat data in the cloud. Getting this Landsat data into the cloud involves converting the data to a cloud-optimized GeoTIFF format and this kind of data is already fueling the backend of Landsat Look.

USGS users can access Pangeo and some test notebooks through and code.usgs. More information is available on the meeting slides.

Why is CDI valuable to you? Why do you participate?

A poll was administered on to participants to see what the value of CDI is to them. Some responses are below.

"I like to hear about (and share) the cool work folks are doing throughout the USGS! The Communities are valuable because they allow folks to share innovative research and discuss ways we can do so while following Department, Bureau, Mission Area policy."
"CDI provides relevant, useful, and timely data management related issues, projects, and tools."
"I learn about new technology applications and learn of colleagues I might collaborate with."
"The CDI helps me to get my work done in my daily job! I find the people who are part of the CDI are amazing to interact with - they are engaged, enthusiastic, and interested in making things better at USGS. CDI has made me feel like I am more in touch with the USGS - there is so much going on in this Bureau, and CDI keeps me informed and makes me feel like I am part of something bigger than just my daily job."
"Demonstrate that best practices in data sci/software/etc. is important to colleagues."
"Diverse community, wide range of experience and expertise."

More information, including notes, links, slides and video recordings on the meeting, are available here.

January's monthly meeting covered how to evaluate web applications and better understand how they are working for users, and explored well-established strategies for USGS crowdsourcing, citizen science, and prize competition projects. 

Application Evaluation: How to get to a Portfolio of Mission Effective Applications 

Nicole Herman-Mercer, a social scientist in the Decision Support Branch of the Water Resources Mission Area's Integrated Information Dissemination Division, presented on how to evaluate web applications based on use, value, impact, and reach, as defined below. 


Definition: take, hold, view, and/or deploy the data/application as a means of accomplishing or achieving something. 

  • How many people use this application? 
  • How many are new users? 
  • How many are returning users? 
  • Are users finding what they need through this site/application? 

Herman-Mercer used Google Analytics to answer some of these questions. Google Analytics provided information such as total daily visits, visits through time, what pages users are visiting and how they're getting there (links from another website, search, or direct visits), how often they're visiting, how many repeat visits occur, and how long users spend on individual pages. 


Definition: The importance, worth, and/or usefulness of the application to the user(s) 

  • How willing are users to pay for the application? 
  • How important is this application to the user's work and/or life? 
  • What/how large would the impact of the loss of this application be to the user? 

To estimate the value of selected applications to users, an electronic survey was sent to internal water enterprise staff, which asked respondents to indicate which applications they used for work, and then to answer a series of questions about those applications. Questions attempted to pinpoint how important applications were to users, and how affected their work would be should the application be decommissioned. 


Definition: The effect the application has on science, policy, or emergency management 

  • How many scientific journal articles use this application? 
  • Is this application relevant for policy decisions? 
  • Do emergency managers use this application? 

Publish or Perish software for text mining was used to get at some of these data points. Publish or Perish searches a variety of sources (Google Scholar, Scopus, Web of Science, etc.) and returns any citations that applications are getting. Attempts to search for policy document citations has proven more difficult, and was not factored into this evaluation as a result. 


Definition: How broadly the application reaches across the country and into society 

  • Where are users? (Geographically) 
  • Who are users? (Scientists? Academia? Government?) 

Google Analytics was again used to gather visits by state, which was then compared with the state population to get an idea of use. These analytics could also identify which networks users are on, i.e., .usgs, .gov, or .edu. Finally, an expert survey was deployed, surveying users who developed the application or currently manage it to get a sense of who the experts think the intended and actual audience is. 

Contact Nicole at for a detailed report on the full evaluation. 

Herman-Mercer's team was inspired by Landsat Imagery Use Case studies. 

USGS Open Innovation Strategy for Crowdsourcing, Citizen Science, and Competitions 

Sophia Liu, an Innovation Specialist at the USGS Science and Decisions Center in Reston, VA, as well as the USGS Crowdsourcing and Citizen Science Coordinator and Co-Chair of the Federal Community of Practice for Crowdsourcing and Citizen Science, presented an overview of well-established USGS crowdsourcing, citizen science, and prize competition projects. 

Citizen science, crowdsourcing, and competitions are all considered by Liu to be types of open innovation. Definitions of these terms are as follows: 

  • Citizen science: public participation or collaboration with professional scientists requesting voluntary contributions to any part of the scientific research process to enhance science. 
  • Crowdsourcing: a way to quickly obtain services, ideas, or content from a large group of people, often through simple and repeatable micro tasks. 
  • Competitions: challenges that use prize incentives to spur a broad range of innovative ideas or solutions to a well-defined problem. 

A popular example of citizen science/crowdsourcing is citizen seismology or public reports of earthquakes, like Did You Feel It? 

Liu has documented about 44 USGS crowdsourcing and citizen science projects, and 19 USGS prize competitions. Some examples of open innovation projects and information sources are listed here: 

Participants during the presentation were asked to use the following Mentimeter poll to answer short questions and provide feedback on the talk. 

Sophia is looking for representatives from across all USGS mission areas, regions, and science support offices interested in giving feedback on the guidance, catalog, toolkit, and policies she is developing for the USGS Open Innovation Strategy. Feedback can be provided by joining the USGS Open Innovation Strategy Teams Site or emailing her at 

See the recording and slides at the meeting page.