Confluence Retirement

Due to the feedback from stakeholders and our commitment to not adversely impact USGS science activities that Confluence supports, we are extending the migration deadline to January 2023.

In an effort to consolidate USGS hosted Wikis, myUSGS’ Confluence service is targeted for retirement. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact myusgs@usgs.gov. Thank you for your prompt attention to this matter.
Skip to end of metadata
Go to start of metadata

June 9, 2021 - Mapping with machine learning, AI for fish, and continuous grids of basin characteristics

The Community for Data Integration (CDI) meetings are held the 2nd Wednesday of each month from 11:00 a.m. to 12:30 p.m. Eastern Time.

Connection information

Connection information is sent to the CDI mailing list
Join Microsoft Teams Meeting
+1 719-733-3211   United States, Pueblo (Toll) – See more Local numbers at link below
Conference ID: 522 981 927#
Local numbers 


Meeting Recording and Slides

Recordings and slides are available to CDI Members approximately 24 hours after the completion of the meeting.

These are the publicly available materials. Log in to view all the meeting resources. If you would like to become a member of CDI, join at https://listserv.usgs.gov/mailman/listinfo/cdi-all.

During the call, you can ask and up-vote questions at slido.com, event code #CDIJUN.

Agenda (in Eastern time)

11:00 am Welcome and Opening Announcements

11:15 am Collaboration Area Announcements

11:25 am Building a framework to compute continuous grids of basin characteristics for the conterminous United States - Theodore Barnhart, USGS

11:40 am Using machine learning to map topographic-soil & densely-patterned sub-surface agricultural drainage (tile drains) from satellite imagery - Tanja Williamson, USGS

11:55 am Enabling AI for citizen science in fish ecology - Nathaniel Hitt, USGS

12:30 pm  Adjourn

Abstracts

Building a framework to compute continuous grids of basin characteristics for the conterminous United States

Theodore Barnhart, USGS

The proposed work will create a seamless pilot dataset of continuous basin characteristics (for example upstream average precipitation, elevation, or dominant land cover type) for the conterminous United States. Basin characteristic data are necessary for training or parameterizing statistical, machine learning, and physical models, and for making predictions across the landscape, particularly in areas where there are no observations. The pilot dataset will be accessible to the public via an interactive map and Web-based query service. The pilot dataset, USGS software used to produce it, and a publication on the processing methods will be generated. This work represents a substantial addition to USGS data services by delivering a suite of basin characteristics at every location in the conterminous United States, codifying the software and processing techniques needed to produce such a dataset, and delivering the data in both human and machine-readable formats.

Using machine learning to map topographic-soil & densely-patterned sub-surface agricultural drainage (tile drains) from satellite imagery

Tanja Williamson, USGS

In the mid-1800s, tile-drains were installed in poorly-drained soils of topographic lows as water management to protect cropland during wet conditions; consequently, estimations of tile-drain location have been based on soil series. Most tile drains are in the Midwest, however each state has farms with tile and tile-drain density has increased in the last decade. Where tile drains quickly remove water from fields, groundwater and stream water interaction can change, affecting water availability and flooding. Nutrients and sediment can quickly travel to streams thru tile, contributing to harmful algal blooms and hypoxia in large water bodies. Tile drains are below the soil surface, about 1 m deep, but their location can be visible in satellite imagery as patterns in soil or plant color. We will develop a machine-learning approach to: (1) identify satellite imagery with visible tile drains; (2) differentiate topographic-soil tiles from densely-patterned tile that extends to new areas.

Enabling AI for citizen science in fish ecology

Nathaniel Hitt, USGS

​Artificial Intelligence (AI) is revolutionizing ecology and conservation by enabling species recognition from photos and videos. Our project evaluates the capacity to expand AI for individual fish recognition for population assessment. The success of this effort would facilitate fisheries analysis at an unprecedented scale by engaging anglers and citizen scientists in imagery collection. This project is one of the first attempts to apply AI towards fish population assessment with citizen science.

Highlights

  1. Williamson and team presented on their process for creating a machine learning model to map tile drains from satellite imagery. A data release, Jupyter Notebooks, and a publication on this project are forthcoming.
  2. Barnhart presented on his work creating Flow-Conditioned Parameter Grids (FCPGs): data release (https://doi.org/10.5066/P9HUWM6Q) and software release (https://doi.org/10.5066/P9W8UZ47) available now.
  3. Hitt shared his findings on deep learning for individual fish recognition using convolutional neural networks; see slides for provisional software release.

Notes

Welcome and Opening Announcements

  1. Questions from CDI workshop for leadership
    1. As USGS increasingly produces more big-data science, is the agency looking at its own carbon emission footprint for the data centers, making it sustainable?
      1. Tim Quinn:
        1. We are working on it, but could do more.
        2. Looking into improving server utilization (65% or better utilization as a good goal) and conserving those resources.
        3. Looking into virtualization and how to continue this in the future.
        4. More to come!
      2. Kevin Gallagher
        1. Especially in the last 5 years, there have been requirements to develop carbon footprint of USGS. Big emphasis on efficiency of power and water systems. Statistics on the amount of power that USGS consumes has been going down over the years.
        2. EROS data center is open for business. Researchers can work with the EROS center to do remote processing/servicing of your center to reduce the number of data centers using resources.
    2. In what ways will the USGS adjust to maintain its rigorous science practices under the pressure of quickly providing actionable science?
      1. Kevin Gallagher:
        1. The external community loves our science; gold standard in many fields. However, they think it takes way too long. Always discussing how to get data out faster. Have to be careful in accelerating science because of needed rigorous science practices. Provisional data has been an option over the years; provisional data may change, but can be released more quickly. Our Fundamental Science Practices advisory community is open for participation. This group is always looking at FSP practices and evolving them. For example, there has been more of an emphasis in developing FSP guidance for data in the last five years. Want to maintain integrity of science, but also find ways to accelerate our science. In emergency situations or where the secretary needs information to make decisions, we have developed processes to get that data out quicker.
      2. Tim Quinn
        1. David Applegate has helped to speed up grants processing, which helps with speeding up the data release process. 

Collaboration Area Announcements

For full announcements, see slides above.

  1. Imagery
    1. Next meeting: June 18th
    2. Join the Imagery Data collab area Teams channel!
    3. Check out the recording of the Online Imagery Data Storage Session, and the Mural board brainstorm on discussing Imagery ideas, needs and priorities.
  2. Semantic Web
    1. Next meeting: June 10th; Wrap-up of Semantic Web 101 session at the CDI workshop
  3. Metadata Reviewers
    1. Talked about consistent placement of disclaimers in metadata records and actions repositories can take to include USGS thesaurus keywords
  4. Geomorphology
    1. Next meeting: June 22; Elevation Derived Hydrography
  5. Usability
    1. Next meeting: June 16; user needs/requirements analysis and definitions demo: mapping techniques (e.g. empathy mapping and journey mapping)
  6. Risk
    1. Next meeting: June 17; Hazards, Race and Social Justice Speaker series
    2. Annual meeting Aug 17-19
  7. Tech Stack
    1. Next event: June 10; USGS Hydro Network Linked Data Index Tools
  8. DevOps
    1. Next event: August 3; RedHat OpenShift
  9. Data Management
    1. Next event: June 14; Data release and section 508 compliance
  10. DataViz
    1. Next event: July 8; Mike Freeman on Observable as a platform for data exploration & visualization with USGS data
  11. eDNA
    1. Working on a 'how to video' for using eDNA wiki and exploring functionality of MS Teams to make things more accessible
  12. Inland/Coastal Bathymetry
    1. Next meeting: June 29

Using machine learning to map topographic-soil & densely-patterned sub-surface agricultural drainage (tile drains) from satellite imagery - Tanja Williamson, USGS

  1. Why it matters
    1. Information needs
      1. Image Analysis
        1. Total area
        2. Location
        3. Type
          1. Topographic-soil
          2. Densely-patterned
    2. Recent shift from targeting lowest/wettest part of the landscape and targeting entire field.
  2. First step was putting together a library of landscapes
    1. Positive images and negative examples
    2. Two different types of tiles (subsets showing topographic-soil tiles and patterned tiles)
    3. The concept of publishing a static version of the library has been a challenge.
  3. Traced and trained the model 
    1. Ground Truth in Amazon Web Services SageMaker allowed collaboration on tracings.
  4. General Workflow
    1. Download Imagery
      1. High Resolution
      2. Multiple Dates
      3. Move to S3 bucket for storage
    2. Split Imagery
      1. Uses geospatial osGeo library on Amazon SageMaker instance (Amazon Web Services were not easily available, and osGeo is reproducible)
      2. File IO handled within Amazon EFS
      3. Metadata exported to XML
      4. Uses a Jupyter Notebook
      5. Used elastic file system and kept metadata separate as XML; everything is kept track of
    3. Run Model
      1. Model files are stored on EFS with SageMaker
      2. Output sent to EFS
    4. Reconstruct the Images
      1. Files are rejoined together using saved metadata
      2. Output is saved to S3 bucket
  5. Modeling magic
    1. Used UNET - convolutional neural network for Image Segmentation
      1. Uses something called a spatial filter to scan through the image and looks for interesting features. It then records, and moves to the next section, ultimately producing a new image. Combining filters can find very complex features.
      2. Hundreds of thousands of filters working together, originally set to random.
      3. Manual tracing of images to train the model to see how well the model does. Then, based on the performance, we can adjust the values of the filters, until the manual aspect is no longer needed.
      4. A lot of machine learning methods need hundreds of thousands of images; human team could not move fast enough to produce this volume. Instead, transformed images (rotated, flipped, etc.), but the model sees it as a new image.
    2. Dealing with uncertainty in the modeling results
      1. Can calculate certainty based on manual tracings compared with model results
      2. We don't have 'true' groundtruth data
      3. Comparing what humans saw in images to what ML model saw
        1. Will interpretation be the same as person?
        2. Does it indicate different types of tile?
        3. Did it avoid key pitfalls (waterways and field lines)?
    3. Planned products
      1. Data Release
        1. Training libraries
        2. Aggregation of patches
          1. YES or NO for tile presence (most recent)
      2. Jupyter Notebooks with example workflows
      3. Journal manuscript discussing methods (in progress)
      4. Potential future work
        1. topographic-soil vs densely-patterned (binary)
        2. Time: how has the tile drainage changed?
        3. What months provide best capture?

Building a framework to compute continuous grids of basin characteristics for the conterminous United States - Theodore Barnhart, USGS

  1. Basin characteristics: what they are and why they're useful
    1. A basin characteristic is a metric of a watershed.
      1. Mean slope
      2. Mean precipitation
    2. Can be used to parameterize statistical or machine learning models to predict streamflow statistics such as the 100-year flood.
  2. What is a Flow-Conditioned Parameter Grid (FCPG)? 
    1. A grid where each cell in the grid represents the upstream average of the parameter of interest.
    2. Can be generated with nearly any GIS software.
  3. How are FCPGs made?
    1. Need two datasets:
      1. Flow direction grid (derived from a digital elevation model; every cell depicts the direction a drop of water would flow)
      2. Parameter grid (mean annual precipitation)
      3. Optional: watershed boundary data set (informs how you can cascade values from upstream hydrologic regions to downstream hydrologic regions).
    2. Input parameter data are reprojected and resampled to the same grid as the flow direction raster.
    3. Use a flow accumulation algorithm to accumulate the flow direction grid
      1. Accumulated area and accumulated upstream parameter value grids are generated from the flow direction grid and resampled parameter grid.
    4. Then divide the accumulated parameter values by the accumulated upstream area values to produce a grid of mean upstream parameter values.
  4. How are FCPGs useful?
    1. Rapidly parameterize machine learning, statistical, and mechanistic hydrologic (or other!) models
      1. Once you get to ~500 points, makes more sense to pre-compute values
    2. As watershed count increases increases, delineation and zonal statistics approach takes increasingly long while query time for FCPGs remains relatively constant.
  5. The CONUS Pilot FCPG Dataset
    1. Produced a suite of basin characteristics (elevation, slope, latitude, min air temperature, max air temperature, etc.)
    2. Available on ScienceBase: https://doi.org/10.5066/P9HUWM6Q
  6. Accessing FCPG Data
    1. Web-based query service
    2. Queries FCPGs hosted on USGS ScienceBase via an HTTP query and a Lambda function
    3. Query watershed pour points directly from FCPGs via Cloud Optimized GeoTiffs (COG).
  7. Build your own FCPGs
    1. Software release for building one: https://doi.org/10.5066/P9W8UZ47
  8. Conclusions
    1. FCPGs provide access to hydrologically conditioned data without zonal statistics.

Enabling AI for citizen science in fish ecology - Nathaniel Hitt, USGS

  1. Envisioning a way for researchers to get data that we need in a way that engages the public. Focus today on deep learning for individual fish recognition using convolutional neural networks.
  2. iNaturalist and others use similar techniques to identify species from an image, but not individuals (fish species vs. individual fish).
  3. Problem(s) addressed:
    1. Need new methods for population and trend analysis; we have too few observations to detect trends
    2. Species ID apps are becoming widely available but individual ID has not been developed for fisheries. 
    3. Public engagement - new approaches are needed for open science
  4. What was done
    1. Collected training images for individual brook trout in an experimental stream laboratory and in-situ
    2. Developed annotated imagery database for deep learning models
    3. Developed Python code
    4. Improved model performance
  5. Brook Trout
    1. Native cold water fish in eastern U.S. and the mountains of Appalachia.
    2. Main thing is the spotting pattern - could be a fingerprint/individual ID
  6. Held fish in an Experimental Stream Laboratory environment, photographed under controlled conditions to develop training data
    1. Created annotated imagery database
  7. Accuracy of model
    1. Getting up to ~80% accuracy for test data
    2. Limited by data size
  8. Spot patterns
    1. Essentially a fingerprint for brook trout at least
  9. Challenges
    1. Limited training data size
    2. Optimization: 1000s of permutations possible
    3. New methods needed to classify individuals not in the training dataset
    4. Requires additional work to extract spot patterns and link CNN to web platform
  10. Looking ahead:
    1. Applications for disease surveillance
    2. Tournament mark-recap experiments

Questions

  1. If the image library were continuously updated, what would be a ideal update frequency? twice a year? monthly? 
    1. Tanja Williamson: Monthly, bi-monthly, every 6 months: look at crowd-sourced images and accept them into the library.
  2. Are you able to get metrics on the use of the web-based query service? Or did you get feedback from any users on improvement?
    1. More traditional approach would be the usual StreamStats approach (could take 30 seconds or so). Not a web developer, so don't have exact metrics. Once the lambda is up and running, takes 5 seconds for a query.
  3. Can the model assign an "ID number" to a fish it has not seen before?

    1. Nathaniel Hitt: No answer yet; optimistic that there could be a neural prediction space. In the wild, most individuals will not be in the training dataset.