Peter Burkholder, a senior innovation specialist from 18F, was the guest presenter. 18F builds effective, user-centric digital services focused on the interaction between government and the people and businesses it serves. Peter is a DevOps engineer who has worked to develop cloud.gov and implement devops practices at 18F. He is also a geophysicist who previously worked at IRIS PASSCAL. His presentation covered best practices and technical implementation of automated infrastructure, resilient cloud operations, and continuous delivery pipelines.
Peter’s favorite 18F tools include
Viv Hutchison and Madison Langseth led a discussion that included a brief overview of the CDI DMWG session at the in-person meeting in June, data manager position descriptions for USGS, and contributed slides from working group members about data management staffing at their USGS science centers. In response to the poll question “If you consider yourself to be a data manager for your center, what is your current position description title?,” there were 19 different responses!
See slides and recording at the DMWG July meeting wiki page.
Josh Bradley and Dennis Walworth presented on the Open Source Metadata Toolkit, which was supported by the CDI from 2014-2015 (see project page on ScienceBase) and is still going strong!
The CDI Tech Stack group meets jointly with the ESIP IT&I group - access the slides and recording at the ESIP Tech Dive page.
Topics at the July meeting of the Fire Science Community of Practice (one of CDI’s newest collaboration areas). Mark Miller provided a short community update presentation. Josh Picotte gave a science talk describing the LANDFIRE remap effort that is currently underway. LF Remap is designed to produce vegetation and fuels data that inform wildland fire and ecological decision support systems. Sheila Murphy gave a second talk called "Arsenic and old mines - Wildfire remobilizes historical mining waste." Other relevant files from July are included on the meeting page, such as a Menlo Park lecture on USGS Fire Science that was given by Paul Steblein earlier in the month.
This month’s Fire Science CoP summary was provided by Mark Miller. See slides and other materials at the July 16 Fire Science Community of Practice Meeting page.
The Risk Community of Practice July meeting was "live" from the first Risk CoP meeting in Golden, CO. On July 18, 2019, after some brief announcements, the group heard short presentations from the PIs of the FY19 Risk CoP funded projects:
Title slide from Jaiswal, Nassar, et al. Risk project - Assessing the risk of global copper supply disruption from earthquakes.
Stakeholder engagement is an important piece of the USGS Risk Plan. But what does it mean to engage with stakeholders? What does co-production mean? What tools are used for engaging stakeholders and over what timelines during the course of a project? What types of challenges arise during stakeholder engagement? What are some of the surprising considerations to keep in mind while working with stakeholders?
This special session was live from the Risk CoP meeting in Golden, CO and featured a panel discussion on stakeholder engagement. Panelists answered the following questions: 1) What does stakeholder engagement mean to you? What does co-production mean to you? 2) When, during the course of a project, do you engage your stakeholders? 3) Describe three tools you use for engaging stakeholders? 4) Can you give an example of a challenge you have faced in doing stakeholder engagement and how you overcame these challenges? (Paperwork Reduction Act, protected information, confidentiality issues) 5) What are some surprising considerations to keep in mind when doing stakeholder engagement? (e.g., inclusivity, ethics, manner of approach).
This month’s Risk CoP summary was provided by Kris Ludwig.
Recordings are available on the Risk CoP meetings page (log in required).
Anyone following this blog may notice that I am making an effort to get up to speed to the present day, but am still a little bit behind. I still have great optimism about catching up, and these posts may help you reminisce about the summer.
At the July 10, 2019 CDI Monthly Meeting, we heard a proposal for ways to increase reusability of USGS datasets, and presentations from two map-based visualization and analysis tools. In addition, Kevin Gallagher reported on demographics, presentation materials, and take-aways from the CDI Workshop “From Big Data to Smart Data” that was held in June 2019 in Boulder, CO.
Responses to the CDI post-workshop survey showing the varied job descriptions in our community.
Richie Erickson presented a Scientist’s Challenge in exploring the use of Jupyter Notebooks to increase reusability of USGS datasets. He is focusing on smaller, project-level datasets that require explanation of disciplinary expertise and statistical analyses. To learn more, you can get in contact with Richie Erickson at firstname.lastname@example.org. See his slides here.
Image of the CDI-funded Online Landslide Inventory.
Ben Mirus’s presentation on a new national landslide inventory highlighted important considerations when integrating incomplete and disparate data. State boundaries often showed mismatches in data quantity or quality. Other topics of CDI interest included defining confidence metrics for the landslides, deciding on dataset update frequency, putting data releases through internal review, best practices for viewing heterogeneous data, identifying areas that need better data collection, and links from our science to governmental policy. Read more at Landslide Risks Highlighted in New Online Tool. This project is an FY18 CDI Funded Project, which more information at its ScienceBase page.
Example of US Topo map with National park boundary and water data.
Elizabeth McCartney and Greg Matthews’ presentation on the National Digital Trails Network showed a system that took existing trails and then uses an algorithm to identify and evaluate potential connections between trail systems using data like land type (owner), slope, and hydrography/river crossings. If you are interested in learning more you can contact the team at any of the following addresses: email@example.com, firstname.lastname@example.org, email@example.com.
The recording of the meeting is available at the monthly meeting page if you are signed in as a CDI member.
In addition to the meetings described below, several collaboration areas met during the face-to-face CDI Workshop in Boulder, CO, June 3-7!
Chris Gorgolewski presented on “Google Dataset Search: Facilitating data discovery in an open ecosystem."
Talk description: There are thousands of data repositories on the Web, providing access to millions of datasets. In this talk, I will discuss recently launched Google Dataset Search, which provides search capabilities over potentially all dataset repositories on the Web. I will talk about the open ecosystem for describing and citing datasets that we hope to encourage and the technical details on how we went about building Dataset Search. Finally, I will highlight research challenges in building a vibrant, heterogeneous, and open ecosystem where data becomes a first-class citizen.
Related links: https://toolbox.google.com/datasetsearch (Accessible when not signed in with a Dept of Interior Google account), https://www.blog.google/products/search/making-it-easier-discover-datasets/
Slide from Chris Gorgolewski's talk on Google Dataset Search.
The recording can be found on the ESIP Tech Dive meetings page. Dave Blodgett and Rich Signell are the group leads.
Pete Doucette provided a review of recent AI/ML-related Strategic Science Planning at USGS. This included thoughts captured from the recent USGS 21st Century Science Workshop (May 2019) at the National Conservation Training Center, and the CDI Workshop in Boulder, CO (June 2019).
The recording can be found on the AI/ML Meetings page. Pete Doucette and JC Nelson are the group leads.
The Semantic Web Working Group's June discussion centered on persistent identifiers for metadata records and vocabularies that are consistent with the FAIR principles. The group identified next steps on persistent identifiers for metadata records (could DataCite DOIs be used?) and next steps for achieving FAIR vocabularies (persistent identifiers for keywords, which is related to encouraging or requiring keywords that are from online vocabularies, and will be a step toward interoperability of vocabularies through use of ontologies.)
Text contributed by Fran Lightsom, SWWG lead! See more at the SWWG meeting notes page.
The group heard an engineer's perspective on risk from from Nico Luco who discussed the Earthquake Hazards Program's “Engineering and Risk” project, that contributes to delivering information for building codes and risk assessments. Next, Nate Wood provided an overview of the "Strategic Hazard Identification and Risk Assessment (SHIRA) on DOI Resources" Project, including an introduction to the DOI Risk Map, related data resources, and a relative threat matrix currently in development. This month’s summary is contributed by Risk CoP co-lead Kris Ludwig!
See more at the Risk Community of Practice Meetings page.
Related publication: Wood, N., Pennaz, A., Ludwig, K., Jones, J., Henry, K, Sherba, J., Ng, P., Marineau, J., and Juskie, J., 2019, Assessing hazards and risks at the Department of the Interior—A workshop report: U.S. Geological Survey Circular 1453, 42 p., https://doi.org/10.3133/cir1453.
The Software Development Cluster welcomed new cluster co-lead Jeremy Newson, and reminded participants that the USGS Software Management Website is up and running at https://www.usgs.gov/products/software/software-management/.
At the June meeting, the cluster reviewed the many related sessions at the CDI workshop, including Software Release Q&A, the Software Development Cluster Breakout Session, a Software Release Practicum, and a Software Birds-of-a-Feather Lunch. Discussions in those sessions included considerations and ideas for cross-USGS collaboration, institutional support, and software developer career paths at the USGS.
Some ideas and take-aways from the discussions include:
Full notes can be found at the workshop Slides, Recordings, and Notes page (if you log in as a CDI member). Cassandra Ladino, Michelle Guy, and Jeremy Newson are the cluster leads.
The May 8, 2019 CDI Monthly Meeting featured two CDI project teams and a presentation about NSF-funded lidar data management capabilities.
Hans Vraga presented on the motivation and technical details of an Ice Jam Hazard website and reporting system. The cloud-first system demonstrated use of the latest cloud technologies in a USGS mobile-friendly application. Hans is part of the Web Informatics and Mapping (WIM) team, that develops web-based tools that support USGS science and other federal science initiatives. You can see some of their other projects here: https://wim.usgs.gov/i/projects/
Jess Walker presented on her experience in developing a workflow for lidar processing and analysis in the cloud for USGS datasets. Working with the USGS Cloud Hosting Solutions team, she searched for solutions for processing and analyzing smaller-size (long-tail) lidar datasets using software like Entwine (https://entwine.io/) and Potree (http://potree.org/).
Chris Crosby from UNAVCO showed how OpenTopography (https://opentopography.org/) facilitates community access to high-resolution, Earth science-oriented, topography data, and related tools and resources. He also described upload and archiving for small to moderate sized topographic datasets in the Community Dataspace.
The OpenTopography Tool Registry provides a community populated clearinghouse of software, utilities, and tools oriented towards high-resolution topography data (e.g. collected with lidar technology) handling, processing, and analysis.
Here’s a roundup of recent CDI collaboration area topics from the month of May!
VeeAnn Cross Cross, VeeAnn A and Peter Schweitzer Schweitzer, Peter N. reviewed use of keywords in the USGS Science Data Catalog. Choosing good keywords is an important part of creating a USGS data release, and there is an opportunity to work together to better align the terms being used. One tip is to make sure there are USGS Thesaurus and ISO terms being used, and not to make up keywords that are not part of one of the suggested vocabularies.
Here’s more guidance on keywords and suggested vocabularies from our trusty Data Management website:
Recent post on the Metadata Reviewers forum: Data Dictionaries as a standalone product?
Announcement: If you are using code.usgs.gov, the official source code archive for USGS, there is a collaborator request form and a public repository request process. Contact firstname.lastname@example.org for these resources.
Kyle Goodwin from GitLab presented on “GitLab as a web-based DevOps lifecycle tool.” GitLab is not Github, although they both use git, a distributed version-control system for tracking changes in source code during software development. GitLab is a platform for a DevOps-driven software development lifecycle. GitLab is the official source code archive for USGS (https://code.usgs.gov/public/) and provides a toolchain that includes project management, software repo, CI/CD (continuous integration/continuous deployment), metrics, and monitoring. Some groups use GitHub for repository features but, GitLab for CI/CD. https://about.gitlab.com/devops-tools
Did you know that members of the Semantic Web Working Group created the buzzword bingo sheets for the CDI workshop? Thank you to Peter Schweitzer and Fran Lightsom for coordinating. Do you hear the following during CDI presentations? Leverage, game changer, compliance, smart data, portal, authoritative, internet of things, analytics, best practices, takeaway, innovative, crowdsourcing, buy-in, stakeholder, framework, workflow, carrot and stick, pitch deck, quick win, modernization, cultural change...
A couple example bingo sheets:
Chris Holmes presented on the SpatioTemporal Asset Catalog (STAC) specification, an emerging standard to make it easier to find geospatial information. It aims to enable a cloud-native geospatial future by providing a common layer of metadata for search and discovery, while playing well with the web and existing geospatial standards.
Find out more about STAC at http://stacspec.org/
Chris Holmes is also involved with the Radiant Earth Foundation, which you can read more about here: Creating a Machine Learning Commons for Global Development
The Tech Stack working group meets jointly with the ESIP Information Technology and Interoperability Committee, and is led by Dave Blodgett Blodgett, David L. and Rich Signell Signell, Richard P. .
Tamar Norkin Norkin, Tamar and Ricardo McClees-Funinan McClees-Funinan, Ricardo presented on “Behind the Scenes at ScienceBase: How Data Release happens in your USGS Trusted Digital Repository (TDR).”
They highlighted the ScienceBase Data Release (SBDR) Tool, a very handy way to start your USGS data release. The SBDR Tool can be found here: https://www.sciencebase.gov/datarelease.
For data release questions and requests, you can get in touch with the data release team at email@example.com.
After filling out the ScienceBase Data Release Tool form, you get a new landing page in ScienceBase, a reserved Digital Object Identifier, and an email with instructions!
Chris Barber Barber, Christopher from USGS EROS presented on XGBoost in Continuous Change Detection and Classification (CCDC). Chris explained how XGBoost (Extreme Gradient Boost, an open-source software library which provides a gradient boosting framework) improved the efficiency and accuracy of segment classification and land cover extraction for LCMAP (Land Change Monitoring, Assessment, Projection). Chris gave a good introduction to the concepts of decision trees, decision tree ensembles, and boosted trees. However, he urged us to remember that there is no substitute for appropriate training data. Email Chris for his extensive reference list - firstname.lastname@example.org.
Peter Esselman Esselman, Peter C. , USGS Great Lakes Science Center also presented on Deep learning to quantify benthic habitat.
From Peter Esselman's talk - image showing various tools that are the future of Great Lakes science.
The Citizen-Centered Innovation group discussed the final draft of the OSTP Report on Prizes and Citizen Science Projects, the USGS Open Innovation Strategy, and the DOI Generic Information Collection Request. They also highlighted relevant upcoming seminars and events in the broader Federal sphere.
Sara McBride @McBride, Sara K of the Earthquake Science Center presented on Social Science 101: a Primer. Some conclusions: Social science is a big field with a lot of disciplines, each examining the human experience with its own unique lens. Doing it well requires years of study, therefore DIY social science is not recommended. There are a number of social scientists within the USGS: reach out and ask us questions!
Some related resources
Wilkins EJ, Miller HM, Tilak E, Schuster RM (2018) Communicating information on nature-related topics: Preferred information channels and trust in sources. PLoS ONE 13(12): e0209013. https://doi.org/10.1371/journal.pone.0209013
https://my.usgs.gov/hd/: HDgov is a multi-agency website for all things human dimensions of natural resources. Here you can access a variety of resources to assist you in your work.
Dale Cox Cox, Dale A. presented on SAFRR (Science Application for Risk Reduction) Projects and Scenarios for Risk Reduction.Dale has been involved in many scenario projects and is in the process of looking back and evaluate some of the scenarios. What is a scenario? Principles of a scenario: A single, large, but plausible event that we need to be ready for, integrate across many disciplines, use best hazard science, consensus among leading experts, create study with community partners, and results presented in products that fit the user, not the scientist.
Some related resources:
USGS Earthquake Scenario Map: https://earthquake.usgs.gov/scenarios/related.php
Slides explaining the components of a scenario and the Science Application for Risk Reduction (SAFRR) scenarios.
The Risk group also announced its inaugural RFP Awards!
Chris Merkes Merkes, Christopher M. from UMESC presented on Choosing the right eDNA assay: Developing standards for Limit of Detection and Limit of Quantification. This work is planned to be soon released in a new environmental DNA journal.
A resource discussed: https://github.com/cmerkes/qPCR_LOD_Calc
Merkes CM, Klymus KE, Allison MJ, Goldberg C, Helbing CC, Hunter ME, Jackson CA, Lance RF, Mangan AM, Monroe EM, Piaggio AJ, Stokdyk JP, Wilson CC, Richter C. (2019) Generic qPCR Limit of Detection (LOD) / Limit of Quantification (LOQ) calculator. R Script. Available at: https://github.com/cmerkes/qPCR_LOD_Calc. DOI: https://doi.org/10.5066/P9GT00GB.
Slide fro Chris Merkes' talk illustrating Limit of Detection and Limit of Quantification.
The Software Development Cluster discussed Docker basics for code development.
More notes and links at their meeting notes in their Meeting Notes (accessible to DOI users).
All CDI Collaboration Areas may be browsed on the CDI wiki.
Summary extracted from notes of Fran Lightsom Lightsom, Frances L. , lead of the Metadata Reviewers group:
Sheryn Olson Olson, Sheryn Joy demonstrated the metadata collecting system used by MonitoringResources.org to encourage discussion of how it might be simpler and easier to use, as well as good ideas that the rest of us can copy. MonitoringResources.org is part of the Pacific Northwest Aquatic Monitoring Partnership (PNAMP) and uses the metadata to provide an index of monitoring activities, especially the ecology of streams of the U.S. Pacific Northwest, and the procedures, protocols, and monitoring designs that are in use.
View more notes and the presentation slides on the Metadata Reviewers Meetings page.
Summary provided by Derek Masaki Masaki, Derek , co-lead of the DevOps group:
Presenters: Kevin Portanova, Director of IT for Public and Indian Housing, and Mel Hurley, DevOps Manager. The presentation provided an overview of the shift that HUD is taking away from traditional on-premise IT operations toward cloud-focused DevOps. Kevin and Mel took us through their process of re-organizing a contractor based IT environment, re-factoring their development process, and creating a Federal employee centric staff oriented toward Agile and a DevOps workflow in the Microsoft Azure environment.
See the slides on the DevOps Meetings page.
The DMWG heard two presentations, first from John Faundeen Unknown User (email@example.com) and Natalie Latysh Latysh, Natalie about “Becoming a USGS Trusted Digital Repository,” and second from Viv Hutchison Hutchison, Vivian B. and John Faundeen on “Progress on a USGS Data Manager Position Description Series.”
The slides and recording are posted on the meeting page.
John Karabaic presented on Pachyderm, a data science platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Read the docs here: http://docs.pachyderm.io/en/latest/index.html
Tech Stack calls are joint with the ESIP Interoperability and Technology Tech Dive Webinars. You can review the recording here.
Kevin Lafferty Lafferty, Kevin D. , senior ecologist at Western Ecological Research Center, presented on White Shark eDNA. In recent work he has been refining methods to get better data from white shark eDNA. Kevin is based in Santa Barbara, CA, and surely made many people jealous while describing data collection with instruments on paddle boards.
View the recording on the Bioinformatics Meetings page.
Kevin is looking for new collaborations within USGS and you can email him at firstname.lastname@example.org if interested. (Remember: data collection with instruments on paddle boards.)
Sophia Liu Liu, Sophia led a discussion covering many topics, including the OSTP Draft Report to Congress for the Crowdsourcing and Citizen Science Act, a Dept of the Interior Generic Information Collection Request, the USGS Open Innovation Strategy, the CitizenScience.gov Website, including USGS CCS Projects, and Past and Upcoming Events like the Citizen Science Association (CSA) Conference - March 13-17, 2019, and the Federal Crowdsourcing Webinar - Episode 1: Citizen Science, and upcoming Federal Crowdsourcing Webinars that can currently be found on this page: https://digital.gov/events/. Sophia’s use of Mentimeter added a great element of interactivity to the meeting. See more on the group wiki page.
Kris Ludwig Ludwig, Kristin A. and Dave Ramsey Ramsey, David W. lead the Risk CoP and hosted a call with presentations about the benefits of communities of practice (Leslie Hsu lhsu, CDI Coordinator) and user engagement in the development of ShakeCast (Dave Wald Wald, David J. , Seismologist).
With respect to user engagement, Dave shared several titles that present “logical approaches for bringing products to users,” including The Power of Habit, Contagious, To Sell is Human, Nudge, Made to Stick, Diffusion of Innovators, and The Undoing Project. Book club, anyone?
View the presentations and recording on the Risk Meetings page.
Reads related to user engagement recommended by David Wald.
Cassandra Ladino Ladino, Cassandra C. led a discussion on building connections, inspired by this Better Scientific Software post: Building Connections and Community within an Institution.
The group had recently fielded a question about desktop installers, and the challenges of code signing. An internal site on application and script signing was shared. Some group members were also of the opinion that providing a method to install your application using Anaconda (on all OSs) was adequate.
A huge thanks to the three CDI Project teams who presented at our April Monthly Meeting.
Caitlin Andrews Andrews, Caitlin Marie , a landscape ecologist in the Southwest Biological Science Center, explained how she used Rshiny and Amazon Web Services to create an interactive, online, front-end for a proven model of ecosystem water balance, SOILWAT2. This tool helps to predict and understand site-specific risk of future drought. Lots of lessons here for people who want to make user-friendly online tools out of more traditional scientific models within the USGS IT ecosystem. Code repository at https://github.com/DrylandEcology
Matt Neilson Neilson, Matthew E. , a fishery biologist and co-lead for the Nonindigenous Aquatic Species Database program, delivered the line of the day: We are living in a machine-readable world. His project uses natural language processing and the xDD (eXtract Dark Data, formerly GeoDeepDive) literature database to improve, modernize, and greatly increase the efficiency of literature review. For people who used to walk to the library and photocopy stuff (and record radio songs on cassettes and dial with rotary phones), this is strange, but I will attempt to evolve with the times. See more information, like code repositories, in the Related External Resources links on the project's ScienceBase page.
Jon Warrick Warrick, Jonathan , research geologist in the Coastal/Marine Hazards and Resources Program described the software tools, resources, and training workshops developed to allow USGS scientists to apply deep learning to remotely sensed imagery and better understand natural hazards and habitats. The 2 in-person workshops on these tools held in 2018 were able to accommodate only a fraction of the interested applicants. The CDI hopes to be able to provide more trainings like this to help build deep learning expertise and capacity in the USGS. See more at https://github.com/dbuscombe-usgs/cdi_dl_workshop and https://github.com/dbuscombe-usgs/dl_tools.
Log in to see the meeting recording and slides at the meeting page.
Cassandra Ladino led a brainstorming session for topics that could be discussed within the Software Development cluster, using sli.do to collect ideas and trello to organize them. Some ideas included: code.usgs.gov - what is it, who should use it and when; Using US Web Design System in USGS web sites; Docker training for distributing scientific software; Python APIs using Swagger and/or Flask; How to grow grassroots development efforts to enterprise systems; Creating a community of practice for unit testing code so that it can be easily reviewed by anyone in the software dev community; Should there be separation between scientific software and web development software discussions? (pros and cons). Lots of exciting topics!
Risk Community of Practice leads Kris Ludwig and Dave Ramsey introduced the new Risk Community of Practice, reviewed the USGS Risk Plan and implementation plans for FY19, and announced the FY19 Risk RFP. The purpose of the group is to
build connections across centers, programs, mission areas
create a central point of contact for USGS risk research and applications
identify needs and opportunities to benefit the community
generate project ideas
share resources, expertise
Besides the Risk Plan, another recent publication mentioned was Assessing Hazards and Risks at the Department of the Interior—A Workshop Report, by Nate Wood, Alice Pennaz, Kristin Ludwig, Jeanne Jones, Kevin Henry, Jason Sherba, Peter Ng, and others.
Mattia Almansi from Johns Hopkins University presented on Integrating SciServer and OceanSpy. OceanSpy is an open-source and user-friendly Python package that enables scientists and interested amateurs to use ocean model data sets with out-of-the-box analysis tools. OceanSpy builds on software packages developed by the Pangeo community (in particular xarray, dask, and xgcm). OceanSpy accelerates and facilitates exploration (including visualization) of terascale data. (Adapted from the presentation abstract.)
See more, including a link to the recorded session, on the group presentation website, hosted by ESIP - the Earth Science Information Partners. TSWG contacts are Dave Blodgett Blodgett, David L. and Rich Signell Signell, Richard P. .
The Semantic Web Working Group held a discussion about Semantic Web elements at the upcoming CDI Workshop. Ken Bagstad mentioned the breakout session he is co-leading at the workshop, which will include semantics in the context of predictive modelling, intersecting with artificial intelligence and machine learning. Other topics included FAIR (findability, accessibility, interoperability, and reusability) in machine- and human-readable contexts and the importance of standard data dictionaries.
What I learned at the AI/ML group call:
USGS is setting up a new machine for AI, it is named Tallgrass after this NPS park in Kansas
Projected timeline for the set up: mid April - Tallgrass Installation; Early May - friendly testing; early June - general availability.
Reminder of what GPUs are vs. CPUs
AI for Ecosystem Services: What if our data and models could talk to one another, and decision makers could use scientific information to more quickly and reliably answer questions about today’s most urgent problems? Find out more at http://www.integratedmodelling.org
JC pointed out some activity on the AI/ML forum and encouraged members to post
Group leads reminded members to contribute to a spreadsheet for collecting USGS AI/ML project descriptions to communicate to USGS leadership.
You should think of this image whenever we mention the Tallgrass infrastructure. (from the NPS Tallgrass Prarie website)
Cassandra Ladino led the working group in a discussion of topics to be discussed at the CDI Workshop or at future DMWG meetings. Some ideas for further discussion included:
Data Management Plans - streamlining process from DMP to publishing; enforcing; hosting
QMS (Quality Management System for USGS labs) integration with data management and records management
Metadata for the National Digital Catalog
More information and guidance on USGS Software Release
UAS (Unmanned Aircraft Systems/AKA Drone) data
Data sharing agreements
Martin Folkoff, lead DevOps engineer at Booz Allen Hamilton provided a technical overview of the DevOps environment he has designed and the CI/CD (continuous integration/continuous deployment) pipeline employed by his teams at BAH. He provided a look at the tools he uses to orchestrate his production environments.
The Metadata Reviewers Community of Practice will be hosting a breakout session at the CDI Workshop to provide guidance for data and metadata review, and tips and tricks for data and metadata authors. Virtual participation is planned.
The ISO Content Specs project will be hosting workshop sessions on Thursday and Friday at the CDI Workshop. The sessions will focus on collecting requirements for metadata specification modules, most likely modules for experimental data, computational data, and observational data. To learn more, contact Dennis Walworth Walworth, Dennis H. , Fran Lightsom Lightsom, Frances L. , or Lisa Zolly Zolly, Lisa .
At the March 13, 2019 monthly meeting, CDI’s executive sponsor Kevin Gallagher talked about the theme of this year’s CDI workshop: From Big Data to Smart Data - this concerns turning our huge volumes of diverse data into usable, actionable, integratable, or “smart” data. Registration for the workshop (June 4-7, 2019 in Boulder, CO) is open and can be found on the workshop wiki page.
We heard presentations from three FY18 CDI Funded Projects:
Wesley Daniel Daniel, Wesley Michael presented on the Nonindigenous Aquatic Species Alert Risk Mapper and reported that the team will be posting a write-up of their challenges transitioning to ArcGIS Pro as part of their outcomes. See more accomplishments on their ScienceBase page.
Dennis Walworth Walworth, Dennis H. and Fran Lightsom Lightsom, Frances L. presented on the Transition to ISO metadata project and reported that the project team will host several activities at the CDI workshop, they are looking for users to test their interface. They are using the previously-funded mdEditor application (ScienceBase page) in their work.
Nate Wood Wood, Nathan J. and Jeanne Jones Jones, Jeanne M. presented on the Department of Interior Risk and CDI Risk Map. They reported many links that are available for Department of Interior users to test out, including data description, codebase, the risk map, GeoServer, and the API. CDI members, go to the meeting page and log in to view their slides - links are on the last slide.
The DOI Risk Workshop Report is out! Wood, N., Pennaz, A., Ludwig, K., Jones, J., Henry, K, Sherba, J., Ng, P., Marineau, J., and Juskie, J., 2019, Assessing hazards and risks at the Department of the Interior—A workshop report: U.S. Geological Survey Circular 1453, 42 p., https://doi.org/10.3133/cir1453.
Hans Vraga from the Web Informatics and Mapping Program (WIM, wim.usgs.gov) gave an overview of the group, of which he is the Project Manager. WIM is a web development shop that has cooperators from both within and outside of the USGS. Some of their products include a SPARROW model output visualizer, StreamStats, and a WHISPers wildlife event reporting system (coming soon).
As you can imagine, their expertise is in high demand. Things they look for in cooperators include a match of scientific/subject matter expertise to complement their group’s technical expertise, the cooperator as an active product owner, focusing on development and minimizing time for operations, and fast turnaround time projects. Check out their website or contact Hans Vraga, Hans Wegmueller for more information.
In February, the group had two major questions come up for discussion - these were passed along to the appropriate committees and officials for guidance and answers were produced quickly!
First: Is there updated guidance on the volume of data necessary to trigger a separate data release? (As opposed to a table in a publication.) Short answer: Having the data in the paper is ok - however, if data is big enough to be moved into a supplemental section of the paper, it has to be a USGS data release.
Second: How should authors reference data that is not publicly available when writing a manuscript? Short answer: there is updated guidance on the FSP “Guide to Data Releases” page for data that are not available at the time of publication, or that have limited availability owing to restrictions, in the section Data Associated with a Publication.
John Stock @ of the USGS Innovation Center joined to talk about some opportunities available for postdoctoral research, future workshops, and future discussions related to AI/ML in the USGS. The joint USGS-NASA postdoctoral fellowships are now posted: https://geography.wr.usgs.gov/InnovationCenter/fellowship.html
Pete Doucette Doucette, Peter Joseph presented a talk “Ruminations on AI and Land Imaging.” He included a great intro on the difference between the AI and machine learning of decades ago versus the capabilities now (e.g. neural networks versus DEEP neural networks). Several land imaging projects and datasets at the USGS are becoming more “analysis-ready” for data science, predictive analytics, and to inform decisions. For example, see “Continuous change detection and classification of land cover using all available Landsat data.” Zhu and Woodcock 2014.
A major theme was the need for the combination of disciplinary expertise and AI/ML expertise, essentially team science, in order to reach the full potential of AI/ML. (See the NAS report Enhancing the Effectiveness of Team Science.)
A White House Fact Sheet on “Accelerating America’s Leadership in Artificial Intelligence” was shared with the group by Mona Khalil @mkhalil and Leah Colasuonno Colasuonno, Leah Taylor .
A few slides from Pete Doucette's talk on AI and Land Imaging.
Cassandra Ladino Ladino, Cassandra C. stepped in to lead the February Semantic Web Working Group discussion, which focused on the theme of FAIR (Findable, Accessible, Interoperable, Reusable) in USGS. The group discussed ideas for a proposed FAIR Workshop, including the topic of new approaches and technologies to further enhance FAIRness at USGS. See the meeting notes for more resources and references.
The joint ESIP Tech Dive - CDI Tech Stack presentation was on “Cloud Native Geoprocessing of Earth Observation Satellite Data with Pangeo,” by Scott Henderson, University of Washington. “The integration of new technologies with several high-level Python packages are enabling Cloud-native workflows and circumvent the bottleneck of downloading large amounts of data.”
Aptly summarized: “If that doesn’t get people excited I don’t know what will,” said Rich Signell Signell, Richard P. , co-chair of the Tech Stack Group.
Screenshot from a demo linked to the post "Cloud Native Geoprocessing of Earth Observation Satellite Data with Pangeo."
The latest monthly eDNA webinars organized by Scott Cornman Cornman, Robert S. was on CALeDNA (California Environmental DNA), by Rachel Meyer of UCLA. CALeDNA capitalizes on the enthusiasm of citizen scientists - they provide kits for collection of data in the field. Data collectors also take iNaturalist observations for benchmarking. The data are provided online for the public to identify patterns, and are also used for academic research on topics like phylogenetic diversity and functional diversity.
CALeDNA used the Kobo toolbox to build their data collection form, they found it to be the most robust platform for cell phone data collection. https://www.kobotoolbox.org/
rANACAPA - an R package developed so that non-specialists without community ecology background can generate the relevant plots. “Ranacapa: An R package and Shiny web app to explore environmental DNA data with exploratory statistics and interactive visualizations” https://f1000research.com/articles/7-1734/v1
Check out one of their case studies and the data visualizations available! https://data.ucedna.com/research_projects/pillar-point
A few slides from Rachel Meyer's talk on the California eDNA program.
The Software Development Cluster hosted a discussion on Cloud and Big Data in the Cloud. Cassandra Ladino started off the discussion with a presentation on Cloud and Big Data, including a summary of resources she has been using to learn more. There is information in the notes on how to sign up for a USGS Cloud Hosting Solutions Sandbox.
Our first monthly meeting of 2019 was on February 13, and we heard about forward-looking water research tools, new outputs to help resource managers deal with invasive species, and information about how to get the most out of the upcoming June CDI workshop. View the recording and slides on the February 13 Monthly Meeting page.
Tony Castronova of CUASHI (Consortium of Universities for the Advancement of Hydrologic Science, Inc.) gave an overview of HydroShare and CUAHSI-JupyterHub, tools that help researchers to develop, save, and share water research workflows. This gave a cool perspective on tools that use USGS water data and complement existing USGS tools. CUAHSI has a large education component, including plentiful cyberseminar presentations that address topics of interest overlapping with the CDI!
Hydroshare workflow at https://www.hydroshare.org/
Jake Weltzin opened a series of CDI funded project presentations that will occur in the next few months, presenting on “Workflows to support integrated predictive science capacity: Forecasting invasive species for natural resource planning and risk assessment.” In addition to the daily map forecasts and other outputs about invasive insect activity, the project team is working on a report that will outline their experiences with stakeholder engagement.
Finally, Madison Langseth and I gave some of the latest information about how everyone can benefit from the upcoming CDI Workshop in Boulder, June 4-7. Right now we are focusing on getting community members to submit and comment on session ideas by the end of February, so that we can organize the topics in early March. Also, we are working on stepping up our game for virtual participation and interactive content that will help members meet and connect with each other.
Join us on March 13 for our next monthly meeting and more presentations from CDI community selected and supported projects!
The Metadata Reviewers group met and continued to share resources for effective metadata review. Among the topics that they discussed:
Guidelines for metadata review (google doc link available to Dept of Interior users) that were first discussed in November, led by Tamar Norkin.
Sharing other Data Release Guidance resources, of which there were many but a few include:
The group also has recent Q&A posted on their Metadata Discussion Forum.
The DMWG had three very informative presentations in December!
Kelly Haberstroh – Updates about the Publications Warehouse
Dennis Walworth – Updates on ISO for USGS: content specifications and current status of ADIwg
Lisa Zolly – Updates to the Digital Object Identifier Tool
Nelson, John C. hosted the first AI/ML CDI call and discussed plans for the group. Over the next several months, we will hear from different researchers around the USGS that are incorporating AI/ML techniques into their work. The group will also be a forum for questions for practitioners, such as one asked by Michelle Guy: Are people doing AI/ML work in the cloud, on local GPU hardware, or another option?
The group will stay in touch with another USGS effort focused on AI and image processing. This group was initiated in the Ecosystems Mission Area and is led by Mona Khalil. Mona held calls on 12/18 and 12/19 that focused on hearing about current activities and resources for AI and image processing.
Both groups mentioned the dl_tools lectures and toolboxes that were developed and presented by Buscombe, Daniel D. and others with support from the CDI!
dl_tools method schematic from https://dbuscombe-usgs.github.io/dl_tools
The Tech Stack group invited Ian Rose (University of California, Berkeley) to demo “Developing JupyterLab Extensions.” Ian took us through a live demo of the process of building a JupyterLab extension. “In fact, the whole of JupyterLab itself is simply a collection of extensions that are no more powerful or privileged than any custom extension.”
For fun: From the JupyterLab documentation: Let's Make an xkcd JupyterLab Extension
Visit the joint Tech Stack and ESIP Tech Dive webinar page to see the next few months of topics!
Image from the JupyterLab documentation: https://jupyterlab.readthedocs.io/en/stable/developer/xkcd_extension_tutorial.html
Another month and another group of topics - stay informed!
The group, led by Tamar Norkin, had a discussion on the Guidelines for Metadata Review. They discussed ways to improve the usability of the document as an actual checklist, and what information would be good to include, such as “tips and tricks” for metadata reviewers. Looks like a great resource for anyone who is called upon to review metadata!
In addition to regular updates on the USGS Git Hosting Platform and the USGS Software Management website, in November the DevOps group heard about recent Recreation.gov activities from Shums Hoda and Martin Folkoff of Booz Allen Hamilton.
Recreation.gov is a gateway to discover America's Outdoors and more, a place for trip planning, information sharing and reservations with information from 12 federal Participating Partners.
The website is at https://www.recreation.gov. API documentation of the RESTful services for the Recreation Information Database are at https://ridb.recreation.gov/docs. Other topics covered included microservices and domain driven design, and high level architecture.
What's the tech behind reserving your campsites at recreation.gov?
Martin Durant (Anaconda) presented on "Intake: Lightweight tools for loading and sharing data in data science projects"
Intake has a nice tag line: “Taking the pain out of data access and distribution”
Intake is a set of free open-source Python tools that help load data from a variety of formats into familiar containers like Pandas dataframes, Xarray datasets, and more. Boilerplate data loading code can be transformed into reusable Intake plugins. Datasets can be described for easy reuse and sharing using Intake catalog files. Martin will gave an overview of Intake and demonstrated use via Jupyter Notebooks. You can check out the video here.
Austen Thomas presented data on a backpack-style eDNA acquisition device, including aspects of flow regulation and filter pore size. Austen also presented data on the performance of a field test for specific targets relative to conventional laboratory approaches. A paper describing some of these results is available here:
Thomas, A. C., Howard, J., Nguyen, P. L., Seimon, T. A., & Goldberg, C. S. (2018). ANDe™: A fully integrated environmental DNA sampling system. Methods in Ecology and Evolution, v. 9(6), 1379-1385. https://doi.org/10.1111/2041-210X.12994
The group had a discussion about what's happening with the FAIR Principles (here is just one explanatory website about FAIR), the CDI Proposal Process, the CDI 2019 Workshop (June 4-7, 2019 in Boulder, CO).
In November, we heard more about the CDI Request for Proposals and commenting and voting in this year’s process. The proposals process is one of the major ways that we are able to share our ideas and comments as a community of practice. We are using new tools this year, and so far the commenting on our wiki and the voting through SimplyVoting seems to be working. All CDI members should have received a ballot on November 30 and the deadline to vote is Friday, December 14 at midnight!
USGS Director Reilly dropped by to talk about Artificial Intelligence and Machine Learning and opportunities for the USGS to capitalize on these techniques. JC Nelson and Pete Doucette will be leading a new CDI Collaboration Area in Artificial Intelligence and Machine Learning, and they are having their first meeting on December 11, more details are on the group’s wiki page.
Rob Dollison from the National Geospatial Program presented on “The new 3D Elevation Program Lidar Products and Elevation Services from the National Map.” The National Map has a new web presence, map service notifications, and several viewers to browse the data, including the National Map Viewer, Elevation Viewer, and a Lidar explorer. They are moving to a system where you don’t need to download large volumes to your local drives, instead, basic visualization, analysis, and extraction functions are available through services on an open platform.
Annie Burgess from ESIP spoke about ESIP Lab Opportunities - funding from the Earth Science Information Partners and ways that CDI members could participate. Their community and goals are very similar to the CDI, but within a larger context of other agencies and institutions. The latest ESIP Lab round closes on December 18. Check out previous projects and outputs on their webpage.
ESIP Lab - facilitating pathways for 'data people' to engagement with critical developer communities.
We're taking a break from monthly meetings in December and will see you on January 9, 2019!
At the October 10, 2018 CDI monthly meeting, we heard about ongoing projects that could help us with our spatial data workflow, share solutions for the challenges of integrating incomplete and disparate data, and allow us to test and use technologies for storing and managing large volumes of data.
First, Kevin Gallagher gave us a preview of the FY19 CDI Request for Proposals themes - Biosurveillance of emerging invasive species and health threats, building national datasets, reusing previously funded CDI outputs, and enabling FAIR (Findable, Accessible, Interoperable, Reusable) data. The official Request for Proposals was released the following week and you can see the details here: https://my.usgs.gov/confluence/display/cdi/2019+Proposals
The deadline for 2-page statements of interest is November 16, 2018!
Next, I had a brief Q&A with Sky Bristol about building a spatiotemporal feature registry. This is a concept about designing and building a system for usable and repeatable processes that use spatial features. Sky is looking for feedback on how such a system can be built broadly to benefit many people. I hope to have more Q&A with CDI members and their projects in the future!
Ben Mirus from the Geologic Hazards Science Center presented on Assembling a National Scale Map of Landslide Inventories from Incomplete and Disparate Spatial Data. From his presentation, some topics that came up to explore further with CDI are: figuring out what other types of disciplinary data have this type of incomplete and disparate data (for example, species occurrence), and what is the theory about quantitatively analyzing incomplete and disparate data (for example, a dataset that is a mix of point locations and polygons of landslide scars).
Previous landslide compilation.
Matt Davis, from the Advanced Research Computing group, presented on A Cost Effective Approach to Scientific Data Storage and Management: BlackPearl and Globus. This presentation was exciting because we often get questions about how we in the USGS are supposed to meet data release requirements, or even share within a group of researchers, large volumes of data. Here, large files >>10GB. Matt let us know that YES, there are new options for storing and managing large data that are available to USGS researchers now (in beta). To get started, contact email@example.com and tell the Advanced Research Computing team about your data needs.
An image from Matt Davis' presentation.