Here’s a roundup of recent CDI collaboration area topics from the month of May!
VeeAnn Cross Cross, VeeAnn A and Peter Schweitzer Schweitzer, Peter N. reviewed use of keywords in the USGS Science Data Catalog. Choosing good keywords is an important part of creating a USGS data release, and there is an opportunity to work together to better align the terms being used. One tip is to make sure there are USGS Thesaurus and ISO terms being used, and not to make up keywords that are not part of one of the suggested vocabularies.
Here’s more guidance on keywords and suggested vocabularies from our trusty Data Management website:
Recent post on the Metadata Reviewers forum: Data Dictionaries as a standalone product?
Announcement: If you are using code.usgs.gov, the official source code archive for USGS, there is a collaborator request form and a public repository request process. Contact email@example.com for these resources.
Kyle Goodwin from GitLab presented on “GitLab as a web-based DevOps lifecycle tool.” GitLab is not Github, although they both use git, a distributed version-control system for tracking changes in source code during software development. GitLab is a platform for a DevOps-driven software development lifecycle. GitLab is the official source code archive for USGS (https://code.usgs.gov/public/) and provides a toolchain that includes project management, software repo, CI/CD (continuous integration/continuous deployment), metrics, and monitoring. Some groups use GitHub for repository features but, GitLab for CI/CD. https://about.gitlab.com/devops-tools
Did you know that members of the Semantic Web Working Group created the buzzword bingo sheets for the CDI workshop? Thank you to Peter Schweitzer and Fran Lightsom for coordinating. Do you hear the following during CDI presentations? Leverage, game changer, compliance, smart data, portal, authoritative, internet of things, analytics, best practices, takeaway, innovative, crowdsourcing, buy-in, stakeholder, framework, workflow, carrot and stick, pitch deck, quick win, modernization, cultural change...
A couple example bingo sheets:
Chris Holmes presented on the SpatioTemporal Asset Catalog (STAC) specification, an emerging standard to make it easier to find geospatial information. It aims to enable a cloud-native geospatial future by providing a common layer of metadata for search and discovery, while playing well with the web and existing geospatial standards.
Find out more about STAC at http://stacspec.org/
Chris Holmes is also involved with the Radiant Earth Foundation, which you can read more about here: Creating a Machine Learning Commons for Global Development
The Tech Stack working group meets jointly with the ESIP Information Technology and Interoperability Committee, and is led by Dave Blodgett Blodgett, David L. and Rich Signell Signell, Richard P. .
Tamar Norkin Norkin, Tamar and Ricardo McClees-Funinan McClees-Funinan, Ricardo presented on “Behind the Scenes at ScienceBase: How Data Release happens in your USGS Trusted Digital Repository (TDR).”
They highlighted the ScienceBase Data Release (SBDR) Tool, a very handy way to start your USGS data release. The SBDR Tool can be found here: https://www.sciencebase.gov/datarelease.
For data release questions and requests, you can get in touch with the data release team at firstname.lastname@example.org.
After filling out the ScienceBase Data Release Tool form, you get a new landing page in ScienceBase, a reserved Digital Object Identifier, and an email with instructions!
Chris Barber Barber, Christopher from USGS EROS presented on XGBoost in Continuous Change Detection and Classification (CCDC). Chris explained how XGBoost (Extreme Gradient Boost, an open-source software library which provides a gradient boosting framework) improved the efficiency and accuracy of segment classification and land cover extraction for LCMAP (Land Change Monitoring, Assessment, Projection). Chris gave a good introduction to the concepts of decision trees, decision tree ensembles, and boosted trees. However, he urged us to remember that there is no substitute for appropriate training data. Email Chris for his extensive reference list - email@example.com.
Peter Esselman Esselman, Peter C. , USGS Great Lakes Science Center also presented on Deep learning to quantify benthic habitat.
From Peter Esselman's talk - image showing various tools that are the future of Great Lakes science.
The Citizen-Centered Innovation group discussed the final draft of the OSTP Report on Prizes and Citizen Science Projects, the USGS Open Innovation Strategy, and the DOI Generic Information Collection Request. They also highlighted relevant upcoming seminars and events in the broader Federal sphere.
Sara McBride @McBride, Sara K of the Earthquake Science Center presented on Social Science 101: a Primer. Some conclusions: Social science is a big field with a lot of disciplines, each examining the human experience with its own unique lens. Doing it well requires years of study, therefore DIY social science is not recommended. There are a number of social scientists within the USGS: reach out and ask us questions!
Some related resources
Wilkins EJ, Miller HM, Tilak E, Schuster RM (2018) Communicating information on nature-related topics: Preferred information channels and trust in sources. PLoS ONE 13(12): e0209013. https://doi.org/10.1371/journal.pone.0209013
https://my.usgs.gov/hd/: HDgov is a multi-agency website for all things human dimensions of natural resources. Here you can access a variety of resources to assist you in your work.
Dale Cox Cox, Dale A. presented on SAFRR (Science Application for Risk Reduction) Projects and Scenarios for Risk Reduction.Dale has been involved in many scenario projects and is in the process of looking back and evaluate some of the scenarios. What is a scenario? Principles of a scenario: A single, large, but plausible event that we need to be ready for, integrate across many disciplines, use best hazard science, consensus among leading experts, create study with community partners, and results presented in products that fit the user, not the scientist.
Some related resources:
USGS Earthquake Scenario Map: https://earthquake.usgs.gov/scenarios/related.php
Slides explaining the components of a scenario and the Science Application for Risk Reduction (SAFRR) scenarios.
The Risk group also announced its inaugural RFP Awards!
Chris Merkes Merkes, Christopher M. from UMESC presented on Choosing the right eDNA assay: Developing standards for Limit of Detection and Limit of Quantification. This work is planned to be soon released in a new environmental DNA journal.
A resource discussed: https://github.com/cmerkes/qPCR_LOD_Calc
Merkes CM, Klymus KE, Allison MJ, Goldberg C, Helbing CC, Hunter ME, Jackson CA, Lance RF, Mangan AM, Monroe EM, Piaggio AJ, Stokdyk JP, Wilson CC, Richter C. (2019) Generic qPCR Limit of Detection (LOD) / Limit of Quantification (LOQ) calculator. R Script. Available at: https://github.com/cmerkes/qPCR_LOD_Calc. DOI: https://doi.org/10.5066/P9GT00GB.
Slide fro Chris Merkes' talk illustrating Limit of Detection and Limit of Quantification.
The Software Development Cluster discussed Docker basics for code development.
More notes and links at their meeting notes in their Meeting Notes (accessible to DOI users).
All CDI Collaboration Areas may be browsed on the CDI wiki.
Summary extracted from notes of Fran Lightsom Lightsom, Frances L. , lead of the Metadata Reviewers group:
Sheryn Olson Olson, Sheryn Joy demonstrated the metadata collecting system used by MonitoringResources.org to encourage discussion of how it might be simpler and easier to use, as well as good ideas that the rest of us can copy. MonitoringResources.org is part of the Pacific Northwest Aquatic Monitoring Partnership (PNAMP) and uses the metadata to provide an index of monitoring activities, especially the ecology of streams of the U.S. Pacific Northwest, and the procedures, protocols, and monitoring designs that are in use.
View more notes and the presentation slides on the Metadata Reviewers Meetings page.
Summary provided by Derek Masaki Masaki, Derek , co-lead of the DevOps group:
Presenters: Kevin Portanova, Director of IT for Public and Indian Housing, and Mel Hurley, DevOps Manager. The presentation provided an overview of the shift that HUD is taking away from traditional on-premise IT operations toward cloud-focused DevOps. Kevin and Mel took us through their process of re-organizing a contractor based IT environment, re-factoring their development process, and creating a Federal employee centric staff oriented toward Agile and a DevOps workflow in the Microsoft Azure environment.
See the slides on the DevOps Meetings page.
The DMWG heard two presentations, first from John Faundeen Faundeen, John L. and Natalie Latysh Latysh, Natalie about “Becoming a USGS Trusted Digital Repository,” and second from Viv Hutchison Hutchison, Vivian B. and John Faundeen on “Progress on a USGS Data Manager Position Description Series.”
The slides and recording are posted on the meeting page.
John Karabaic presented on Pachyderm, a data science platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance. Read the docs here: http://docs.pachyderm.io/en/latest/index.html
Tech Stack calls are joint with the ESIP Interoperability and Technology Tech Dive Webinars. You can review the recording here.
Kevin Lafferty Lafferty, Kevin D. , senior ecologist at Western Ecological Research Center, presented on White Shark eDNA. In recent work he has been refining methods to get better data from white shark eDNA. Kevin is based in Santa Barbara, CA, and surely made many people jealous while describing data collection with instruments on paddle boards.
View the recording on the Bioinformatics Meetings page.
Kevin is looking for new collaborations within USGS and you can email him at firstname.lastname@example.org if interested. (Remember: data collection with instruments on paddle boards.)
Sophia Liu Liu, Sophia led a discussion covering many topics, including the OSTP Draft Report to Congress for the Crowdsourcing and Citizen Science Act, a Dept of the Interior Generic Information Collection Request, the USGS Open Innovation Strategy, the CitizenScience.gov Website, including USGS CCS Projects, and Past and Upcoming Events like the Citizen Science Association (CSA) Conference - March 13-17, 2019, and the Federal Crowdsourcing Webinar - Episode 1: Citizen Science, and upcoming Federal Crowdsourcing Webinars that can currently be found on this page: https://digital.gov/events/. Sophia’s use of Mentimeter added a great element of interactivity to the meeting. See more on the group wiki page.
Kris Ludwig Ludwig, Kristin A. and Dave Ramsey Ramsey, David W. lead the Risk CoP and hosted a call with presentations about the benefits of communities of practice (Leslie Hsu lhsu, CDI Coordinator) and user engagement in the development of ShakeCast (Dave Wald Wald, David J. , Seismologist).
With respect to user engagement, Dave shared several titles that present “logical approaches for bringing products to users,” including The Power of Habit, Contagious, To Sell is Human, Nudge, Made to Stick, Diffusion of Innovators, and The Undoing Project. Book club, anyone?
View the presentations and recording on the Risk Meetings page.
Reads related to user engagement recommended by David Wald.
Cassandra Ladino Ladino, Cassandra C. led a discussion on building connections, inspired by this Better Scientific Software post: Building Connections and Community within an Institution.
The group had recently fielded a question about desktop installers, and the challenges of code signing. An internal site on application and script signing was shared. Some group members were also of the opinion that providing a method to install your application using Anaconda (on all OSs) was adequate.
A huge thanks to the three CDI Project teams who presented at our April Monthly Meeting.
Caitlin Andrews Andrews, Caitlin Marie , a landscape ecologist in the Southwest Biological Science Center, explained how she used Rshiny and Amazon Web Services to create an interactive, online, front-end for a proven model of ecosystem water balance, SOILWAT2. This tool helps to predict and understand site-specific risk of future drought. Lots of lessons here for people who want to make user-friendly online tools out of more traditional scientific models within the USGS IT ecosystem. Code repository at https://github.com/DrylandEcology
Matt Neilson Neilson, Matthew E. , a fishery biologist and co-lead for the Nonindigenous Aquatic Species Database program, delivered the line of the day: We are living in a machine-readable world. His project uses natural language processing and the xDD (eXtract Dark Data, formerly GeoDeepDive) literature database to improve, modernize, and greatly increase the efficiency of literature review. For people who used to walk to the library and photocopy stuff (and record radio songs on cassettes and dial with rotary phones), this is strange, but I will attempt to evolve with the times. See more information, like code repositories, in the Related External Resources links on the project's ScienceBase page.
Jon Warrick Warrick, Jonathan , research geologist in the Coastal/Marine Hazards and Resources Program described the software tools, resources, and training workshops developed to allow USGS scientists to apply deep learning to remotely sensed imagery and better understand natural hazards and habitats. The 2 in-person workshops on these tools held in 2018 were able to accommodate only a fraction of the interested applicants. The CDI hopes to be able to provide more trainings like this to help build deep learning expertise and capacity in the USGS. See more at https://github.com/dbuscombe-usgs/cdi_dl_workshop and https://github.com/dbuscombe-usgs/dl_tools.
Log in to see the meeting recording and slides at the meeting page.
Cassandra Ladino led a brainstorming session for topics that could be discussed within the Software Development cluster, using sli.do to collect ideas and trello to organize them. Some ideas included: code.usgs.gov - what is it, who should use it and when; Using US Web Design System in USGS web sites; Docker training for distributing scientific software; Python APIs using Swagger and/or Flask; How to grow grassroots development efforts to enterprise systems; Creating a community of practice for unit testing code so that it can be easily reviewed by anyone in the software dev community; Should there be separation between scientific software and web development software discussions? (pros and cons). Lots of exciting topics!
Risk Community of Practice leads Kris Ludwig and Dave Ramsey introduced the new Risk Community of Practice, reviewed the USGS Risk Plan and implementation plans for FY19, and announced the FY19 Risk RFP. The purpose of the group is to
build connections across centers, programs, mission areas
create a central point of contact for USGS risk research and applications
identify needs and opportunities to benefit the community
generate project ideas
share resources, expertise
Besides the Risk Plan, another recent publication mentioned was Assessing Hazards and Risks at the Department of the Interior—A Workshop Report, by Nate Wood, Alice Pennaz, Kristin Ludwig, Jeanne Jones, Kevin Henry, Jason Sherba, Peter Ng, and others.
Mattia Almansi from Johns Hopkins University presented on Integrating SciServer and OceanSpy. OceanSpy is an open-source and user-friendly Python package that enables scientists and interested amateurs to use ocean model data sets with out-of-the-box analysis tools. OceanSpy builds on software packages developed by the Pangeo community (in particular xarray, dask, and xgcm). OceanSpy accelerates and facilitates exploration (including visualization) of terascale data. (Adapted from the presentation abstract.)
See more, including a link to the recorded session, on the group presentation website, hosted by ESIP - the Earth Science Information Partners. TSWG contacts are Dave Blodgett Blodgett, David L. and Rich Signell Signell, Richard P. .
The Semantic Web Working Group held a discussion about Semantic Web elements at the upcoming CDI Workshop. Ken Bagstad mentioned the breakout session he is co-leading at the workshop, which will include semantics in the context of predictive modelling, intersecting with artificial intelligence and machine learning. Other topics included FAIR (findability, accessibility, interoperability, and reusability) in machine- and human-readable contexts and the importance of standard data dictionaries.
What I learned at the AI/ML group call:
USGS is setting up a new machine for AI, it is named Tallgrass after this NPS park in Kansas
Projected timeline for the set up: mid April - Tallgrass Installation; Early May - friendly testing; early June - general availability.
Reminder of what GPUs are vs. CPUs
AI for Ecosystem Services: What if our data and models could talk to one another, and decision makers could use scientific information to more quickly and reliably answer questions about today’s most urgent problems? Find out more at http://www.integratedmodelling.org
JC pointed out some activity on the AI/ML forum and encouraged members to post
Group leads reminded members to contribute to a spreadsheet for collecting USGS AI/ML project descriptions to communicate to USGS leadership.
You should think of this image whenever we mention the Tallgrass infrastructure. (from the NPS Tallgrass Prarie website)
Cassandra Ladino led the working group in a discussion of topics to be discussed at the CDI Workshop or at future DMWG meetings. Some ideas for further discussion included:
Data Management Plans - streamlining process from DMP to publishing; enforcing; hosting
QMS (Quality Management System for USGS labs) integration with data management and records management
Metadata for the National Digital Catalog
More information and guidance on USGS Software Release
UAS (Unmanned Aircraft Systems/AKA Drone) data
Data sharing agreements
Martin Folkoff, lead DevOps engineer at Booz Allen Hamilton provided a technical overview of the DevOps environment he has designed and the CI/CD (continuous integration/continuous deployment) pipeline employed by his teams at BAH. He provided a look at the tools he uses to orchestrate his production environments.
The Metadata Reviewers Community of Practice will be hosting a breakout session at the CDI Workshop to provide guidance for data and metadata review, and tips and tricks for data and metadata authors. Virtual participation is planned.
The ISO Content Specs project will be hosting workshop sessions on Thursday and Friday at the CDI Workshop. The sessions will focus on collecting requirements for metadata specification modules, most likely modules for experimental data, computational data, and observational data. To learn more, contact Dennis Walworth Walworth, Dennis H. , Fran Lightsom Lightsom, Frances L. , or Lisa Zolly Zolly, Lisa .
At the March 13, 2019 monthly meeting, CDI’s executive sponsor Kevin Gallagher talked about the theme of this year’s CDI workshop: From Big Data to Smart Data - this concerns turning our huge volumes of diverse data into usable, actionable, integratable, or “smart” data. Registration for the workshop (June 4-7, 2019 in Boulder, CO) is open and can be found on the workshop wiki page.
We heard presentations from three FY18 CDI Funded Projects:
Wesley Daniel Daniel, Wesley Michael presented on the Nonindigenous Aquatic Species Alert Risk Mapper and reported that the team will be posting a write-up of their challenges transitioning to ArcGIS Pro as part of their outcomes. See more accomplishments on their ScienceBase page.
Dennis Walworth Walworth, Dennis H. and Fran Lightsom Lightsom, Frances L. presented on the Transition to ISO metadata project and reported that the project team will host several activities at the CDI workshop, they are looking for users to test their interface. They are using the previously-funded mdEditor application (ScienceBase page) in their work.
Nate Wood Wood, Nathan J. and Jeanne Jones Jones, Jeanne M. presented on the Department of Interior Risk and CDI Risk Map. They reported many links that are available for Department of Interior users to test out, including data description, codebase, the risk map, GeoServer, and the API. CDI members, go to the meeting page and log in to view their slides - links are on the last slide.
The DOI Risk Workshop Report is out! Wood, N., Pennaz, A., Ludwig, K., Jones, J., Henry, K, Sherba, J., Ng, P., Marineau, J., and Juskie, J., 2019, Assessing hazards and risks at the Department of the Interior—A workshop report: U.S. Geological Survey Circular 1453, 42 p., https://doi.org/10.3133/cir1453.
Hans Vraga from the Web Informatics and Mapping Program (WIM, wim.usgs.gov) gave an overview of the group, of which he is the Project Manager. WIM is a web development shop that has cooperators from both within and outside of the USGS. Some of their products include a SPARROW model output visualizer, StreamStats, and a WHISPers wildlife event reporting system (coming soon).
As you can imagine, their expertise is in high demand. Things they look for in cooperators include a match of scientific/subject matter expertise to complement their group’s technical expertise, the cooperator as an active product owner, focusing on development and minimizing time for operations, and fast turnaround time projects. Check out their website or contact Hans Vraga, Hans Wegmueller for more information.
In February, the group had two major questions come up for discussion - these were passed along to the appropriate committees and officials for guidance and answers were produced quickly!
First: Is there updated guidance on the volume of data necessary to trigger a separate data release? (As opposed to a table in a publication.) Short answer: Having the data in the paper is ok - however, if data is big enough to be moved into a supplemental section of the paper, it has to be a USGS data release.
Second: How should authors reference data that is not publicly available when writing a manuscript? Short answer: there is updated guidance on the FSP “Guide to Data Releases” page for data that are not available at the time of publication, or that have limited availability owing to restrictions, in the section Data Associated with a Publication.
John Stock @ of the USGS Innovation Center joined to talk about some opportunities available for postdoctoral research, future workshops, and future discussions related to AI/ML in the USGS. The joint USGS-NASA postdoctoral fellowships are now posted: https://geography.wr.usgs.gov/InnovationCenter/fellowship.html
Pete Doucette Doucette, Peter Joseph presented a talk “Ruminations on AI and Land Imaging.” He included a great intro on the difference between the AI and machine learning of decades ago versus the capabilities now (e.g. neural networks versus DEEP neural networks). Several land imaging projects and datasets at the USGS are becoming more “analysis-ready” for data science, predictive analytics, and to inform decisions. For example, see “Continuous change detection and classification of land cover using all available Landsat data.” Zhu and Woodcock 2014.
A major theme was the need for the combination of disciplinary expertise and AI/ML expertise, essentially team science, in order to reach the full potential of AI/ML. (See the NAS report Enhancing the Effectiveness of Team Science.)
A White House Fact Sheet on “Accelerating America’s Leadership in Artificial Intelligence” was shared with the group by Mona Khalil @mkhalil and Leah Colasuonno Colasuonno, Leah Taylor .
A few slides from Pete Doucette's talk on AI and Land Imaging.
Cassandra Ladino Ladino, Cassandra C. stepped in to lead the February Semantic Web Working Group discussion, which focused on the theme of FAIR (Findable, Accessible, Interoperable, Reusable) in USGS. The group discussed ideas for a proposed FAIR Workshop, including the topic of new approaches and technologies to further enhance FAIRness at USGS. See the meeting notes for more resources and references.
The joint ESIP Tech Dive - CDI Tech Stack presentation was on “Cloud Native Geoprocessing of Earth Observation Satellite Data with Pangeo,” by Scott Henderson, University of Washington. “The integration of new technologies with several high-level Python packages are enabling Cloud-native workflows and circumvent the bottleneck of downloading large amounts of data.”
Aptly summarized: “If that doesn’t get people excited I don’t know what will,” said Rich Signell Signell, Richard P. , co-chair of the Tech Stack Group.
Screenshot from a demo linked to the post "Cloud Native Geoprocessing of Earth Observation Satellite Data with Pangeo."
The latest monthly eDNA webinars organized by Scott Cornman Cornman, Robert S. was on CALeDNA (California Environmental DNA), by Rachel Meyer of UCLA. CALeDNA capitalizes on the enthusiasm of citizen scientists - they provide kits for collection of data in the field. Data collectors also take iNaturalist observations for benchmarking. The data are provided online for the public to identify patterns, and are also used for academic research on topics like phylogenetic diversity and functional diversity.
CALeDNA used the Kobo toolbox to build their data collection form, they found it to be the most robust platform for cell phone data collection. https://www.kobotoolbox.org/
rANACAPA - an R package developed so that non-specialists without community ecology background can generate the relevant plots. “Ranacapa: An R package and Shiny web app to explore environmental DNA data with exploratory statistics and interactive visualizations” https://f1000research.com/articles/7-1734/v1
Check out one of their case studies and the data visualizations available! https://data.ucedna.com/research_projects/pillar-point
A few slides from Rachel Meyer's talk on the California eDNA program.
The Software Development Cluster hosted a discussion on Cloud and Big Data in the Cloud. Cassandra Ladino started off the discussion with a presentation on Cloud and Big Data, including a summary of resources she has been using to learn more. There is information in the notes on how to sign up for a USGS Cloud Hosting Solutions Sandbox.
Our first monthly meeting of 2019 was on February 13, and we heard about forward-looking water research tools, new outputs to help resource managers deal with invasive species, and information about how to get the most out of the upcoming June CDI workshop. View the recording and slides on the February 13 Monthly Meeting page.
Tony Castronova of CUASHI (Consortium of Universities for the Advancement of Hydrologic Science, Inc.) gave an overview of HydroShare and CUAHSI-JupyterHub, tools that help researchers to develop, save, and share water research workflows. This gave a cool perspective on tools that use USGS water data and complement existing USGS tools. CUAHSI has a large education component, including plentiful cyberseminar presentations that address topics of interest overlapping with the CDI!
Hydroshare workflow at https://www.hydroshare.org/
Jake Weltzin opened a series of CDI funded project presentations that will occur in the next few months, presenting on “Workflows to support integrated predictive science capacity: Forecasting invasive species for natural resource planning and risk assessment.” In addition to the daily map forecasts and other outputs about invasive insect activity, the project team is working on a report that will outline their experiences with stakeholder engagement.
Finally, Madison Langseth and I gave some of the latest information about how everyone can benefit from the upcoming CDI Workshop in Boulder, June 4-7. Right now we are focusing on getting community members to submit and comment on session ideas by the end of February, so that we can organize the topics in early March. Also, we are working on stepping up our game for virtual participation and interactive content that will help members meet and connect with each other.
Join us on March 13 for our next monthly meeting and more presentations from CDI community selected and supported projects!
The Metadata Reviewers group met and continued to share resources for effective metadata review. Among the topics that they discussed:
Guidelines for metadata review (google doc link available to Dept of Interior users) that were first discussed in November, led by Tamar Norkin.
Sharing other Data Release Guidance resources, of which there were many but a few include:
The group also has recent Q&A posted on their Metadata Discussion Forum.
The DMWG had three very informative presentations in December!
Kelly Haberstroh – Updates about the Publications Warehouse
Dennis Walworth – Updates on ISO for USGS: content specifications and current status of ADIwg
Lisa Zolly – Updates to the Digital Object Identifier Tool
Nelson, John C. hosted the first AI/ML CDI call and discussed plans for the group. Over the next several months, we will hear from different researchers around the USGS that are incorporating AI/ML techniques into their work. The group will also be a forum for questions for practitioners, such as one asked by Michelle Guy: Are people doing AI/ML work in the cloud, on local GPU hardware, or another option?
The group will stay in touch with another USGS effort focused on AI and image processing. This group was initiated in the Ecosystems Mission Area and is led by Mona Khalil. Mona held calls on 12/18 and 12/19 that focused on hearing about current activities and resources for AI and image processing.
Both groups mentioned the dl_tools lectures and toolboxes that were developed and presented by Buscombe, Daniel D. and others with support from the CDI!
dl_tools method schematic from https://dbuscombe-usgs.github.io/dl_tools
The Tech Stack group invited Ian Rose (University of California, Berkeley) to demo “Developing JupyterLab Extensions.” Ian took us through a live demo of the process of building a JupyterLab extension. “In fact, the whole of JupyterLab itself is simply a collection of extensions that are no more powerful or privileged than any custom extension.”
For fun: From the JupyterLab documentation: Let's Make an xkcd JupyterLab Extension
Visit the joint Tech Stack and ESIP Tech Dive webinar page to see the next few months of topics!
Image from the JupyterLab documentation: https://jupyterlab.readthedocs.io/en/stable/developer/xkcd_extension_tutorial.html
Another month and another group of topics - stay informed!
The group, led by Tamar Norkin, had a discussion on the Guidelines for Metadata Review. They discussed ways to improve the usability of the document as an actual checklist, and what information would be good to include, such as “tips and tricks” for metadata reviewers. Looks like a great resource for anyone who is called upon to review metadata!
In addition to regular updates on the USGS Git Hosting Platform and the USGS Software Management website, in November the DevOps group heard about recent Recreation.gov activities from Shums Hoda and Martin Folkoff of Booz Allen Hamilton.
Recreation.gov is a gateway to discover America's Outdoors and more, a place for trip planning, information sharing and reservations with information from 12 federal Participating Partners.
The website is at https://www.recreation.gov. API documentation of the RESTful services for the Recreation Information Database are at https://ridb.recreation.gov/docs. Other topics covered included microservices and domain driven design, and high level architecture.
What's the tech behind reserving your campsites at recreation.gov?
Martin Durant (Anaconda) presented on "Intake: Lightweight tools for loading and sharing data in data science projects"
Intake has a nice tag line: “Taking the pain out of data access and distribution”
Intake is a set of free open-source Python tools that help load data from a variety of formats into familiar containers like Pandas dataframes, Xarray datasets, and more. Boilerplate data loading code can be transformed into reusable Intake plugins. Datasets can be described for easy reuse and sharing using Intake catalog files. Martin will gave an overview of Intake and demonstrated use via Jupyter Notebooks. You can check out the video here.
Austen Thomas presented data on a backpack-style eDNA acquisition device, including aspects of flow regulation and filter pore size. Austen also presented data on the performance of a field test for specific targets relative to conventional laboratory approaches. A paper describing some of these results is available here:
Thomas, A. C., Howard, J., Nguyen, P. L., Seimon, T. A., & Goldberg, C. S. (2018). ANDe™: A fully integrated environmental DNA sampling system. Methods in Ecology and Evolution, v. 9(6), 1379-1385. https://doi.org/10.1111/2041-210X.12994
The group had a discussion about what's happening with the FAIR Principles (here is just one explanatory website about FAIR), the CDI Proposal Process, the CDI 2019 Workshop (June 4-7, 2019 in Boulder, CO).
In November, we heard more about the CDI Request for Proposals and commenting and voting in this year’s process. The proposals process is one of the major ways that we are able to share our ideas and comments as a community of practice. We are using new tools this year, and so far the commenting on our wiki and the voting through SimplyVoting seems to be working. All CDI members should have received a ballot on November 30 and the deadline to vote is Friday, December 14 at midnight!
USGS Director Reilly dropped by to talk about Artificial Intelligence and Machine Learning and opportunities for the USGS to capitalize on these techniques. JC Nelson and Pete Doucette will be leading a new CDI Collaboration Area in Artificial Intelligence and Machine Learning, and they are having their first meeting on December 11, more details are on the group’s wiki page.
Rob Dollison from the National Geospatial Program presented on “The new 3D Elevation Program Lidar Products and Elevation Services from the National Map.” The National Map has a new web presence, map service notifications, and several viewers to browse the data, including the National Map Viewer, Elevation Viewer, and a Lidar explorer. They are moving to a system where you don’t need to download large volumes to your local drives, instead, basic visualization, analysis, and extraction functions are available through services on an open platform.
Annie Burgess from ESIP spoke about ESIP Lab Opportunities - funding from the Earth Science Information Partners and ways that CDI members could participate. Their community and goals are very similar to the CDI, but within a larger context of other agencies and institutions. The latest ESIP Lab round closes on December 18. Check out previous projects and outputs on their webpage.
ESIP Lab - facilitating pathways for 'data people' to engagement with critical developer communities.
We're taking a break from monthly meetings in December and will see you on January 9, 2019!
At the October 10, 2018 CDI monthly meeting, we heard about ongoing projects that could help us with our spatial data workflow, share solutions for the challenges of integrating incomplete and disparate data, and allow us to test and use technologies for storing and managing large volumes of data.
First, Kevin Gallagher gave us a preview of the FY19 CDI Request for Proposals themes - Biosurveillance of emerging invasive species and health threats, building national datasets, reusing previously funded CDI outputs, and enabling FAIR (Findable, Accessible, Interoperable, Reusable) data. The official Request for Proposals was released the following week and you can see the details here: https://my.usgs.gov/confluence/display/cdi/2019+Proposals
The deadline for 2-page statements of interest is November 16, 2018!
Next, I had a brief Q&A with Sky Bristol about building a spatiotemporal feature registry. This is a concept about designing and building a system for usable and repeatable processes that use spatial features. Sky is looking for feedback on how such a system can be built broadly to benefit many people. I hope to have more Q&A with CDI members and their projects in the future!
Ben Mirus from the Geologic Hazards Science Center presented on Assembling a National Scale Map of Landslide Inventories from Incomplete and Disparate Spatial Data. From his presentation, some topics that came up to explore further with CDI are: figuring out what other types of disciplinary data have this type of incomplete and disparate data (for example, species occurrence), and what is the theory about quantitatively analyzing incomplete and disparate data (for example, a dataset that is a mix of point locations and polygons of landslide scars).
Previous landslide compilation.
Matt Davis, from the Advanced Research Computing group, presented on A Cost Effective Approach to Scientific Data Storage and Management: BlackPearl and Globus. This presentation was exciting because we often get questions about how we in the USGS are supposed to meet data release requirements, or even share within a group of researchers, large volumes of data. Here, large files >>10GB. Matt let us know that YES, there are new options for storing and managing large data that are available to USGS researchers now (in beta). To get started, contact email@example.com and tell the Advanced Research Computing team about your data needs.
An image from Matt Davis' presentation.
Looks like October brought back collaboration area activity in full swing. Here are October’s topics and discussions in reverse chronological order!
The Data Management Working group held a special session - Wade Bishop of University of Tennessee presented his findings on a data fitness-for-use study. In his study he asked participants to consider a recent example of when they searched for data and decided if it was fit for them to (re)use. Then he asked questions related to each of the elements in the FAIR data framework (Findable, Accessible, Interoperable, Reusable). Wade provided many fine puns on “FAIR” (if that is FAIR to say) and quotes such as “Deciding if data is fit for reuse is kind of like thumping on a melon or smelling bread before you buy it.” (Maybe you had to be there?) Participant quotes provided interesting insights, such as the metadata-data disconnect - do people understand how metadata and keywords are helping them to discover or use data? Perhaps if data providers do such a good job in making data FAIR, the data consumers will not even notice, they will just happily reuse the data. Slides can be found on the DMWG meeting page.
The Software Development Cluster discussed a draft Git migration plan (link accessible by Dept of Int) for USGS. Last June, an announcement about the USGS Git Platform (link accessible on the USGS network) was distributed. Members of the Software Development Cluster are providing information to help USGS code repository owners meet the requirements on the announcement. Note that the plan is still in early draft and open to suggestions. The contact for the plan is Eric Martinez, firstname.lastname@example.org.
The Subduction Zone Focus Group posted notes from their October meeting, summarizing ongoing projects, new members, and other opportunities. Topics included land-level changes along the Olympic Peninsula, SZ4D Research Coordination Networks, a Cascadia Recurrence database, a Mendenhall Fellowship focused on Cascadia landslides now being advertised, automated turbidite analysis, tsunamis, and recent papers and reports from the M(agnitude)9 project.
Snapshot of a data compilation for a Cascadia 3D seismic model, summary of the locations of 34 individual controlled-source wide-angle seismic imaging experiments dating to the 1960s. (T. Brocher)
The Bioinformatics Community of Practice had a discussion about the newly released CDI Request for Proposals, including what is in scope, how to meet the 30% in-kind match, and how the two-phase selection process works. Notes can be found on the RFP Collaboration forum.
The Tech Stack group didn’t have a live meeting, but Sky Bristol made a video demonstrating some of the concepts behind a SpatioTemporal Feature Registry. The group was encouraged to ask questions about the video using our wiki page. Further discussion is at the ESIP-hosted IdeaScale ideation page.
The Semantic Web working group discussed semantic approaches to enable USGS data to be FAIR (Findable, Accessible, Interoperable, Reusable). They used the list of FAIR Principles at https://www.go-fair.org/fair-principles/, which includes links to explanations. Notes can be viewed at their meeting page.
The eDNA community of practice created a page sharing recent example data releases for environmental DNA.
In FY19, DevOps will consolidate to one meeting per month with both Project Management and SysAd/Developer Topics. Sarah Battani from Develop Intelligence gave an introduction to their DevOps Academy training opportunities.
The group took stock of the state of USGS metadata: and challenges and needs.
Fran set up a wiki page at https://my.usgs.gov/confluence/display/cdi/Metadata+Reviewers+Training+Collection as a place to share resources on Metadata Reviewers training.
Learn more at the CDI Collaboration Area Page.
View the CDI Calendar to see upcoming meetings.
Data Management Working Group, 9/10/18 - connecting existing data assets and our new USGS websites
Semantic Web Working Group, 9/13/18 - potential future plans with the USGS Thesaurus and with FAIR data
Citizen-Centered Innovation Monthly Meeting, 9/17/18 - the upcoming Crowdsourcing and Citizen Science Report to Congress
Lance Everette presented information to help update data managers and web masters on how to use new tools that are available for connecting the Science Data Catalog, ScienceBase, and the new USGS websites.
Raad Saleh (email@example.com) from EROS sought ideas and examples of processes for transitioning research data to operational data.
Presentations given by Lance Everette and Raad Saleh found at the meeting page.
A small group talked about the history, status, and future plans of the working group. Some potential future plans included working on the USGS Thesaurus, as presented at the September CDI Monthly Meeting, and activities make USGS data more consistent with the FAIR Data Principles – not just focusing on integrating data to support a particular use, but improving our data practices so that all USGS data is findable, accessible, interoperable, and re-usable for multiple unanticipated uses. Contact Fran Lightsom (firstname.lastname@example.org) if you are interested in participating. More information at the SWWG Meetings page.
Sophia Liu led the monthly meeting, addressing questions or issues that people had about the Crowdsourcing and Citizen Science Report to Congress. She is working on getting the USGS contribution reviewed as it gets closer to the deadline - the final report will be submitted in January 2019 or later. Let Sophia (email@example.com) know if you would like to schedule a meeting to discuss your report in more detail before she begins the review process.
Learn more at the CDI Collaboration Area Page.
View the CDI Calendar to see upcoming meetings.
At the September 12, 2018 CDI Monthly Meeting, topics included sedimentary geology data, online python training, the CDI request for proposals, a spatiotemporal feature registry challenge, STEP-UP student opportunities at the USGS, Bayesian networks, and the USGS Thesaurus. View the recording, slides, Q&A, and highlighted links on the meeting page.
September’s Scientist’s Challenge came from Anjali Fernandes at University of Connecticut - “do you know of an open access database that offers archival of outcrop scans (geo-referenced point clouds) & surfaces mapped on said scans, as well as geo-referenced grain-size distributions, geochemical analyses, sedimentary facies descriptions, etc.?” Initial answers include OpenTopography, Safaridb, and resources at virtualoutcrop.com.
After August’s successful foray into online learning with DataCamp’s Git tutorial, we’re going to try the Introduction to Python for Data Science module next. It is about a 4-hour commitment and I will send reminders from the period October 3-October 24. Read more here and sign up here.
We’ve updated the 2019 Proposals wiki space in preparation for the next round of CDI project ideas!
Sky Bristol presented a challenge in finding the appropriate and best sources for spatial features including boundaries, identifiers, and associated information. Read more and add your ideas at the ESIP-hosted IdeaScale site.
Sue Kemp presented on the experience working with a STEP-UP student to remotely work on a legacy data management challenge - the SageMap site. If you think your center has a STEP-UP opportunity for a student, you can submit it at this Google Form.
Erika Lentz presented some lessons learned through the ongoing conversion of a probabilistic modeling framework from proprietary to freely available open-source software. The project goal is to create a portable interactive web-interface to demonstrate how interdisciplinary USGS science and models can be transformed into an approachable format for decision-makers, such as those making decisions about impacts of sea level rise.
Peter Schweitzer presented on the USGS Thesaurus: what it is, how you can use it, and how you can improve it. The USGS Thesaurus is an important resource that helps us to categorize, browse, and compare the data and science at USGS by using a controlled vocabulary. It is incorporated into multiple USGS data management tools, and is accessible here: https://www2.usgs.gov/science/about/.
Peter described opportunities to correct, refine, and extend Thesaurus concepts; create cross-walks to other controlled vocabularies; build more web services and application interfaces; and help other people use this resource effectively. The presentation led to an extensive Q&A which can be found on the meeting page. Contact Peter (firstname.lastname@example.org) if you are interested in learning more.
8/7/18 DevOps: Meet the Software Development Cluster, Migrating to Amazon Web Services at Cal Poly
8/9/18 Tech Stack: EarthSim Lightweight Python Tools
8/13/18 Data Management: Preserve
8/15/18 Citizen-Centered Innovation: Report to Congress
8/30/18 Software Development: A Deeper Dive into Git
Project Management Sync
Michelle Guy gave an overview of the CDI Software Development Cluster activities. Sharing information across different CDI collaboration areas is a great way to learn from related, but separate, groups of expertise. DevOps expressed an interest in being more informed of other CDI activities and we shared the CDI Calendar.
SysAd and Developer Sync
Paul Jurasin, Theresa May, and Ben Butler, of California Polytechnic State University shared their experiences from their institution’s migration to Amazon Web Services. They stressed the importance of introducing enough training to accompany new tools, the importance of putting people first and keeping them informed during major institutional shifts in technology, and the importance of acknowledging different skills, values, and priorities of different groups of people (such as developers, systems people, and infrastructure people.) Thanks to the presentation team for sharing their experiences in a major enterprise migration to the cloud.
A slide from the Cal Poly team's presentation on migration to Amazon Web Services.
Dharhas Pothina, from the US Army Engineer Research and Development Center, presented on EarthSim: lightweight python tools for environmental simulation. EarthSim provides a set of tools that can easily be reconfigured and repurposed as needed to rapidly solve specific emerging issues. By interacting and visualizing data in the browser, it is easier to deliver products to customers, and allows users to run the tools locally or on HPC.
EarthSim is a website and github repo, a place to try things out and see examples. http://earthsim.pyviz.org/
See the recording on the joint CDI Tech Stack / ESIP Tech Dive website.
The Data Management Working Group focused on the “preserve” theme in August, with two presentations.
Chris Bartlett, USGS Records Officer and Chief, Information Management Branch, presented on the relationship between Records Management and Science Data.
Larry Reedy, Records Disposition Coordinator, presented on the NARA ARCIS system to submit scientific records to Federal Records Centers.
See the slides and recording on the DMWG meeting page.
A slide on records management areas from Chris Bartlett's talk.
Topics discussed at the August Citizen-Centered Innovation call included
Overview of Report to Congress for the Crowdsourcing and Citizen Science Act (15 U.S.C. 3724)
Updated CitizenScience.gov Website
Announcements: Upcoming Conferences and Meetings
Contact Sophia Liu, email@example.com, for more details.
George Rolston from USGS Cloud Hosting Solutions shared his knowledge and enthusiasm for Git, in particular, different Git branching strategies.
He shared the following resources:
The recording is accessible from the Software Development meetings wiki page.