The Metadata Reviewers group met and continued to share resources for effective metadata review. Among the topics that they discussed:
Guidelines for metadata review (google doc link available to Dept of Interior users) that were first discussed in November, led by Tamar Norkin.
Sharing other Data Release Guidance resources, of which there were many but a few include:
The group also has recent Q&A posted on their Metadata Discussion Forum.
The DMWG had three very informative presentations in December!
Kelly Haberstroh – Updates about the Publications Warehouse
Dennis Walworth – Updates on ISO for USGS: content specifications and current status of ADIwg
Lisa Zolly – Updates to the Digital Object Identifier Tool
Nelson, John C. hosted the first AI/ML CDI call and discussed plans for the group. Over the next several months, we will hear from different researchers around the USGS that are incorporating AI/ML techniques into their work. The group will also be a forum for questions for practitioners, such as one asked by Michelle Guy: Are people doing AI/ML work in the cloud, on local GPU hardware, or another option?
The group will stay in touch with another USGS effort focused on AI and image processing. This group was initiated in the Ecosystems Mission Area and is led by Mona Khalil. Mona held calls on 12/18 and 12/19 that focused on hearing about current activities and resources for AI and image processing.
Both groups mentioned the dl_tools lectures and toolboxes that were developed and presented by Buscombe, Daniel D. and others with support from the CDI!
dl_tools method schematic from https://dbuscombe-usgs.github.io/dl_tools
The Tech Stack group invited Ian Rose (University of California, Berkeley) to demo “Developing JupyterLab Extensions.” Ian took us through a live demo of the process of building a JupyterLab extension. “In fact, the whole of JupyterLab itself is simply a collection of extensions that are no more powerful or privileged than any custom extension.”
For fun: From the JupyterLab documentation: Let's Make an xkcd JupyterLab Extension
Visit the joint Tech Stack and ESIP Tech Dive webinar page to see the next few months of topics!
Image from the JupyterLab documentation: https://jupyterlab.readthedocs.io/en/stable/developer/xkcd_extension_tutorial.html
Another month and another group of topics - stay informed!
The group, led by Tamar Norkin, had a discussion on the Guidelines for Metadata Review. They discussed ways to improve the usability of the document as an actual checklist, and what information would be good to include, such as “tips and tricks” for metadata reviewers. Looks like a great resource for anyone who is called upon to review metadata!
In addition to regular updates on the USGS Git Hosting Platform and the USGS Software Management website, in November the DevOps group heard about recent Recreation.gov activities from Shums Hoda and Martin Folkoff of Booz Allen Hamilton.
Recreation.gov is a gateway to discover America's Outdoors and more, a place for trip planning, information sharing and reservations with information from 12 federal Participating Partners.
The website is at https://www.recreation.gov. API documentation of the RESTful services for the Recreation Information Database are at https://ridb.recreation.gov/docs. Other topics covered included microservices and domain driven design, and high level architecture.
What's the tech behind reserving your campsites at recreation.gov?
Martin Durant (Anaconda) presented on "Intake: Lightweight tools for loading and sharing data in data science projects"
Intake has a nice tag line: “Taking the pain out of data access and distribution”
Intake is a set of free open-source Python tools that help load data from a variety of formats into familiar containers like Pandas dataframes, Xarray datasets, and more. Boilerplate data loading code can be transformed into reusable Intake plugins. Datasets can be described for easy reuse and sharing using Intake catalog files. Martin will gave an overview of Intake and demonstrated use via Jupyter Notebooks. You can check out the video here.
Austen Thomas presented data on a backpack-style eDNA acquisition device, including aspects of flow regulation and filter pore size. Austen also presented data on the performance of a field test for specific targets relative to conventional laboratory approaches. A paper describing some of these results is available here:
Thomas, A. C., Howard, J., Nguyen, P. L., Seimon, T. A., & Goldberg, C. S. (2018). ANDe™: A fully integrated environmental DNA sampling system. Methods in Ecology and Evolution, v. 9(6), 1379-1385. https://doi.org/10.1111/2041-210X.12994
The group had a discussion about what's happening with the FAIR Principles (here is just one explanatory website about FAIR), the CDI Proposal Process, the CDI 2019 Workshop (June 4-7, 2019 in Boulder, CO).
In November, we heard more about the CDI Request for Proposals and commenting and voting in this year’s process. The proposals process is one of the major ways that we are able to share our ideas and comments as a community of practice. We are using new tools this year, and so far the commenting on our wiki and the voting through SimplyVoting seems to be working. All CDI members should have received a ballot on November 30 and the deadline to vote is Friday, December 14 at midnight!
USGS Director Reilly dropped by to talk about Artificial Intelligence and Machine Learning and opportunities for the USGS to capitalize on these techniques. JC Nelson and Pete Doucette will be leading a new CDI Collaboration Area in Artificial Intelligence and Machine Learning, and they are having their first meeting on December 11, more details are on the group’s wiki page.
Rob Dollison from the National Geospatial Program presented on “The new 3D Elevation Program Lidar Products and Elevation Services from the National Map.” The National Map has a new web presence, map service notifications, and several viewers to browse the data, including the National Map Viewer, Elevation Viewer, and a Lidar explorer. They are moving to a system where you don’t need to download large volumes to your local drives, instead, basic visualization, analysis, and extraction functions are available through services on an open platform.
Annie Burgess from ESIP spoke about ESIP Lab Opportunities - funding from the Earth Science Information Partners and ways that CDI members could participate. Their community and goals are very similar to the CDI, but within a larger context of other agencies and institutions. The latest ESIP Lab round closes on December 18. Check out previous projects and outputs on their webpage.
ESIP Lab - facilitating pathways for 'data people' to engagement with critical developer communities.
We're taking a break from monthly meetings in December and will see you on January 9, 2019!
At the October 10, 2018 CDI monthly meeting, we heard about ongoing projects that could help us with our spatial data workflow, share solutions for the challenges of integrating incomplete and disparate data, and allow us to test and use technologies for storing and managing large volumes of data.
First, Kevin Gallagher gave us a preview of the FY19 CDI Request for Proposals themes - Biosurveillance of emerging invasive species and health threats, building national datasets, reusing previously funded CDI outputs, and enabling FAIR (Findable, Accessible, Interoperable, Reusable) data. The official Request for Proposals was released the following week and you can see the details here: https://my.usgs.gov/confluence/display/cdi/2019+Proposals
The deadline for 2-page statements of interest is November 16, 2018!
Next, I had a brief Q&A with Sky Bristol about building a spatiotemporal feature registry. This is a concept about designing and building a system for usable and repeatable processes that use spatial features. Sky is looking for feedback on how such a system can be built broadly to benefit many people. I hope to have more Q&A with CDI members and their projects in the future!
Ben Mirus from the Geologic Hazards Science Center presented on Assembling a National Scale Map of Landslide Inventories from Incomplete and Disparate Spatial Data. From his presentation, some topics that came up to explore further with CDI are: figuring out what other types of disciplinary data have this type of incomplete and disparate data (for example, species occurrence), and what is the theory about quantitatively analyzing incomplete and disparate data (for example, a dataset that is a mix of point locations and polygons of landslide scars).
Previous landslide compilation.
Matt Davis, from the Advanced Research Computing group, presented on A Cost Effective Approach to Scientific Data Storage and Management: BlackPearl and Globus. This presentation was exciting because we often get questions about how we in the USGS are supposed to meet data release requirements, or even share within a group of researchers, large volumes of data. Here, large files >>10GB. Matt let us know that YES, there are new options for storing and managing large data that are available to USGS researchers now (in beta). To get started, contact firstname.lastname@example.org and tell the Advanced Research Computing team about your data needs.
An image from Matt Davis' presentation.
Looks like October brought back collaboration area activity in full swing. Here are October’s topics and discussions in reverse chronological order!
The Data Management Working group held a special session - Wade Bishop of University of Tennessee presented his findings on a data fitness-for-use study. In his study he asked participants to consider a recent example of when they searched for data and decided if it was fit for them to (re)use. Then he asked questions related to each of the elements in the FAIR data framework (Findable, Accessible, Interoperable, Reusable). Wade provided many fine puns on “FAIR” (if that is FAIR to say) and quotes such as “Deciding if data is fit for reuse is kind of like thumping on a melon or smelling bread before you buy it.” (Maybe you had to be there?) Participant quotes provided interesting insights, such as the metadata-data disconnect - do people understand how metadata and keywords are helping them to discover or use data? Perhaps if data providers do such a good job in making data FAIR, the data consumers will not even notice, they will just happily reuse the data. Slides can be found on the DMWG meeting page.
The Software Development Cluster discussed a draft Git migration plan (link accessible by Dept of Int) for USGS. Last June, an announcement about the USGS Git Platform (link accessible on the USGS network) was distributed. Members of the Software Development Cluster are providing information to help USGS code repository owners meet the requirements on the announcement. Note that the plan is still in early draft and open to suggestions. The contact for the plan is Eric Martinez, email@example.com.
The Subduction Zone Focus Group posted notes from their October meeting, summarizing ongoing projects, new members, and other opportunities. Topics included land-level changes along the Olympic Peninsula, SZ4D Research Coordination Networks, a Cascadia Recurrence database, a Mendenhall Fellowship focused on Cascadia landslides now being advertised, automated turbidite analysis, tsunamis, and recent papers and reports from the M(agnitude)9 project.
Snapshot of a data compilation for a Cascadia 3D seismic model, summary of the locations of 34 individual controlled-source wide-angle seismic imaging experiments dating to the 1960s. (T. Brocher)
The Bioinformatics Community of Practice had a discussion about the newly released CDI Request for Proposals, including what is in scope, how to meet the 30% in-kind match, and how the two-phase selection process works. Notes can be found on the RFP Collaboration forum.
The Tech Stack group didn’t have a live meeting, but Sky Bristol made a video demonstrating some of the concepts behind a SpatioTemporal Feature Registry. The group was encouraged to ask questions about the video using our wiki page. Further discussion is at the ESIP-hosted IdeaScale ideation page.
The Semantic Web working group discussed semantic approaches to enable USGS data to be FAIR (Findable, Accessible, Interoperable, Reusable). They used the list of FAIR Principles at https://www.go-fair.org/fair-principles/, which includes links to explanations. Notes can be viewed at their meeting page.
The eDNA community of practice created a page sharing recent example data releases for environmental DNA.
In FY19, DevOps will consolidate to one meeting per month with both Project Management and SysAd/Developer Topics. Sarah Battani from Develop Intelligence gave an introduction to their DevOps Academy training opportunities.
The group took stock of the state of USGS metadata: and challenges and needs.
Fran set up a wiki page at https://my.usgs.gov/confluence/display/cdi/Metadata+Reviewers+Training+Collection as a place to share resources on Metadata Reviewers training.
Learn more at the CDI Collaboration Area Page.
View the CDI Calendar to see upcoming meetings.
Data Management Working Group, 9/10/18 - connecting existing data assets and our new USGS websites
Semantic Web Working Group, 9/13/18 - potential future plans with the USGS Thesaurus and with FAIR data
Citizen-Centered Innovation Monthly Meeting, 9/17/18 - the upcoming Crowdsourcing and Citizen Science Report to Congress
Lance Everette presented information to help update data managers and web masters on how to use new tools that are available for connecting the Science Data Catalog, ScienceBase, and the new USGS websites.
Raad Saleh (firstname.lastname@example.org) from EROS sought ideas and examples of processes for transitioning research data to operational data.
Presentations given by Lance Everette and Raad Saleh found at the meeting page.
A small group talked about the history, status, and future plans of the working group. Some potential future plans included working on the USGS Thesaurus, as presented at the September CDI Monthly Meeting, and activities make USGS data more consistent with the FAIR Data Principles – not just focusing on integrating data to support a particular use, but improving our data practices so that all USGS data is findable, accessible, interoperable, and re-usable for multiple unanticipated uses. Contact Fran Lightsom (email@example.com) if you are interested in participating. More information at the SWWG Meetings page.
Sophia Liu led the monthly meeting, addressing questions or issues that people had about the Crowdsourcing and Citizen Science Report to Congress. She is working on getting the USGS contribution reviewed as it gets closer to the deadline - the final report will be submitted in January 2019 or later. Let Sophia (firstname.lastname@example.org) know if you would like to schedule a meeting to discuss your report in more detail before she begins the review process.
Learn more at the CDI Collaboration Area Page.
View the CDI Calendar to see upcoming meetings.
At the September 12, 2018 CDI Monthly Meeting, topics included sedimentary geology data, online python training, the CDI request for proposals, a spatiotemporal feature registry challenge, STEP-UP student opportunities at the USGS, Bayesian networks, and the USGS Thesaurus. View the recording, slides, Q&A, and highlighted links on the meeting page.
September’s Scientist’s Challenge came from Anjali Fernandes at University of Connecticut - “do you know of an open access database that offers archival of outcrop scans (geo-referenced point clouds) & surfaces mapped on said scans, as well as geo-referenced grain-size distributions, geochemical analyses, sedimentary facies descriptions, etc.?” Initial answers include OpenTopography, Safaridb, and resources at virtualoutcrop.com.
After August’s successful foray into online learning with DataCamp’s Git tutorial, we’re going to try the Introduction to Python for Data Science module next. It is about a 4-hour commitment and I will send reminders from the period October 3-October 24. Read more here and sign up here.
We’ve updated the 2019 Proposals wiki space in preparation for the next round of CDI project ideas!
Sky Bristol presented a challenge in finding the appropriate and best sources for spatial features including boundaries, identifiers, and associated information. Read more and add your ideas at the ESIP-hosted IdeaScale site.
Sue Kemp presented on the experience working with a STEP-UP student to remotely work on a legacy data management challenge - the SageMap site. If you think your center has a STEP-UP opportunity for a student, you can submit it at this Google Form.
Erika Lentz presented some lessons learned through the ongoing conversion of a probabilistic modeling framework from proprietary to freely available open-source software. The project goal is to create a portable interactive web-interface to demonstrate how interdisciplinary USGS science and models can be transformed into an approachable format for decision-makers, such as those making decisions about impacts of sea level rise.
Peter Schweitzer presented on the USGS Thesaurus: what it is, how you can use it, and how you can improve it. The USGS Thesaurus is an important resource that helps us to categorize, browse, and compare the data and science at USGS by using a controlled vocabulary. It is incorporated into multiple USGS data management tools, and is accessible here: https://www2.usgs.gov/science/about/.
Peter described opportunities to correct, refine, and extend Thesaurus concepts; create cross-walks to other controlled vocabularies; build more web services and application interfaces; and help other people use this resource effectively. The presentation led to an extensive Q&A which can be found on the meeting page. Contact Peter (email@example.com) if you are interested in learning more.
8/7/18 DevOps: Meet the Software Development Cluster, Migrating to Amazon Web Services at Cal Poly
8/9/18 Tech Stack: EarthSim Lightweight Python Tools
8/13/18 Data Management: Preserve
8/15/18 Citizen-Centered Innovation: Report to Congress
8/30/18 Software Development: A Deeper Dive into Git
Project Management Sync
Michelle Guy gave an overview of the CDI Software Development Cluster activities. Sharing information across different CDI collaboration areas is a great way to learn from related, but separate, groups of expertise. DevOps expressed an interest in being more informed of other CDI activities and we shared the CDI Calendar.
SysAd and Developer Sync
Paul Jurasin, Theresa May, and Ben Butler, of California Polytechnic State University shared their experiences from their institution’s migration to Amazon Web Services. They stressed the importance of introducing enough training to accompany new tools, the importance of putting people first and keeping them informed during major institutional shifts in technology, and the importance of acknowledging different skills, values, and priorities of different groups of people (such as developers, systems people, and infrastructure people.) Thanks to the presentation team for sharing their experiences in a major enterprise migration to the cloud.
A slide from the Cal Poly team's presentation on migration to Amazon Web Services.
Dharhas Pothina, from the US Army Engineer Research and Development Center, presented on EarthSim: lightweight python tools for environmental simulation. EarthSim provides a set of tools that can easily be reconfigured and repurposed as needed to rapidly solve specific emerging issues. By interacting and visualizing data in the browser, it is easier to deliver products to customers, and allows users to run the tools locally or on HPC.
EarthSim is a website and github repo, a place to try things out and see examples. http://earthsim.pyviz.org/
See the recording on the joint CDI Tech Stack / ESIP Tech Dive website.
The Data Management Working Group focused on the “preserve” theme in August, with two presentations.
Chris Bartlett, USGS Records Officer and Chief, Information Management Branch, presented on the relationship between Records Management and Science Data.
Larry Reedy, Records Disposition Coordinator, presented on the NARA ARCIS system to submit scientific records to Federal Records Centers.
See the slides and recording on the DMWG meeting page.
A slide on records management areas from Chris Bartlett's talk.
Topics discussed at the August Citizen-Centered Innovation call included
Overview of Report to Congress for the Crowdsourcing and Citizen Science Act (15 U.S.C. 3724)
Updated CitizenScience.gov Website
Announcements: Upcoming Conferences and Meetings
Contact Sophia Liu, firstname.lastname@example.org, for more details.
George Rolston from USGS Cloud Hosting Solutions shared his knowledge and enthusiasm for Git, in particular, different Git branching strategies.
He shared the following resources:
The recording is accessible from the Software Development meetings wiki page.
At our last virtual monthly meeting on August 8, 2018, we heard about the upcoming Community for Data Integration Request for Proposals, opportunities for group learning about Git, and recent data-related activities at the USGS Office of Enterprise Information.
CDI sponsor Kevin T. Gallagher thanked the current CDI Project Teams that shared their progress at the Summer Earth Science Information Partners (ESIP) meeting.
Participants at the CDI Session at the ESIP Summer Meeting in July 2018: Supporting integrated and predictive science: Community for Data Integration focus on risk assessment.
Kevin also reminded us that the next CDI Request for Proposals will be happening soon, hopefully September! Like last year, Kevin and Tim Quinn will select a theme or themes for us to work on together as a community. Kevin stressed that the CDI is aiming to help develop the capacity of the entire USGS for data integration and management through this proposals process, and therefore we should always keep an eye out for project outcomes may be relevant to our own work. Selecting projects with wide applicability is a priority. Stay tuned for more info.
I announced that in the next month, I will complete the DataCamp Git for Data Science module, and I invite anyone interested to join me. Announcing this goal to the entire CDI is the best shot I have at actually doing it. This is a new CDI experiment in group learning. Read more at this wiki page and sign up to get weekly reminders and updates on my progress - this way we will complete the 4 hour module together.
Tim Quinn (Chief, USGS Office of Enterprise Information) and Nancy Sternberg (Senior Advisor for Strategic Planning, USGS OEI) presented us their vision of Component Architecture for Integrated Science and some recent workshops that have helped them to identify next steps for implementation. The OEI oversees a tremendous amount of data and hardware in the USGS, and is working hard to modernize the entire system. In the past year, they have held workshops on High Performance Computing and High Throughput Computing, Data Storage, and Sensor Networks, to guide their plans. They also gave us insight into the data storage strategy for the next 5 years and beyond. Tim, Nancy, and Paul Exter welcomed feedback from the community on their presentation.
Insights into a strategy for storing our increasing data volumes.
Meeting recording and slides are posted on the August 2018 Meeting page.
In July, the Metadata Reviewers Community took a look at the newly proposed metadata requirements for the old "legacy" data sets that have been traditionally released on USGS web sites. The USGS Web Re-engineering Team ("WRET") is planning on using these metadata to display legacy data. Lisa Zolly introduced the topic and led the discussion.
July’s theme was Metadata! Ben Wheeler gave a brief update on Science Data Catalog (SDC), an important part of the USGS Public Access Plan.
Colin Talbert provided an overview and demo of the Metadata Wizard (version 2.0!), a tool for creating robust metadata.
JC Nelson led a discussion on data release issues, and recent developments in guidelines for data sharing agreements and software release policies and issues. Pete Ruhl is currently compiling examples of eDNA (environmental DNA)-related data releases. He can be contacted at email@example.com.
Visit the eDNA wiki page.
The group began discussions on some upcoming federal activities related to crowdsourcing, citizen science, and prizes & challenge competitions as well as upcoming meetings in DC related to crowdsourcing and citizen science. The group will discuss the same topic at next month's meeting on Wednesday, August 15 with more updated materials.
A report to Congress about federal crowdsourcing and citizen science activities is due at the end of 2018 this winter. This report is required by the American Innovation and Competitiveness Act (of which the Citizen Science and Crowdsourcing Act is a component).
A draft of the Form for Collecting CCS Projects from the White House Office of Science and Technology Policy (OSTP) and the Science and Technology Policy Institute (STPI).
For more information, contact Sophia Liu, firstname.lastname@example.org.
Carl Schroedl from the USGS Water Mission Area gave us an in-depth look at the Git Fork and Feature Branch Workflow. This method works well for his group, which incorporates code reviews in their workflow.
Read more at this blog post on using the Fork-and-Branch Git Workflow
There are infinite possible workflows in Git, adjust it, make it work for you, Atlassian is a good resource to learn about different workflows https://www.atlassian.com/git/tutorials/comparing-workflows
If this workflow seemed too complicated for your purposes, you could simplify by not doing branches under a fork, or possibly not doing forks, but this takes away from the motivation and the benefits when getting code reviews.
Q: What about branch naming conventions? A: You can use an issue code/identifier. This can help you trace back to details on the issue or motivation for the code changes.
At the Community for Data Integration July 11, 2018 monthly meeting, we heard about two programs in the USGS, the STEP-UP program and the Cloud Hosting Solutions program.
Chris Hammond told us about the STEP-UP (Secondary Transition to Employment Program - USGS Partnership) program. STEP-UP provides employment training to young adults (ages 18-22) with cognitive and other disabilities. Despite these disabilities, they may be highly competent at certain tasks, for example data preparation tasks. The overview will explained how the program works and described several success stories. The CDI has recently heard from a number of research groups that are looking for solutions to migrating legacy data or websites, we introduce the STEP-UP program as a possible solution to investigate further. To learn more, get in touch with Chris at email@example.com.
We also heard an update on the latest services provided by USGS Cloud Hosting Solutions (CHS) from Jennifer Erxleben and Harry House. Cloud Hosting Solutions (CHS) is the required, supported, secure Cloud offering for USGS Science Centers and mission programs. Jennifer and Harry told us about CHS managed services, CHS custom services, and the sandbox environment. They also went over some example projects and costs. After the presentation, we had an active 30-minute long Q&A session, which is documented on the meeting page.
The best way to get started with CHS is to email firstname.lastname@example.org.
Things learned in the writing of this blog post: What is SNP, how do you pronounce SNP, and what USGS research involves SNPs? What is cloud.gov? Where can I find a up-to-date list of USGS science centers that are used in ScienceBase and the USGS Science Data Catalog?
Ray Obuch provided an overview of the new Department of the Interior Metadata Implementation Guide, available at https://doi.org/10.3133/tm16A1.
The group also examined the proposed FAIR metrics, mentioned at a recent CDI Monthly Meeting, which have a lot to do with metadata. (FAIR stands for findable, accessible, interoperable, reusable.)
Peter Schweitzer shared this link about a similar but different way of thinking about the problem, "5 Star Open Data," https://5stardata.info/en/.
At the DevOps Project Management Sync meeting, topics included
An update from USGS Cloud Hosting Solutions (CHS)
An update on the USGS Software Management website, which is under development (Cassandra Ladino, USGS)
A world wind tour of cloud.gov and the default DevOps pipeline to deploy applications to it (Andrew Burnes, 18F). Cloud.gov is a secure, fully compliant Platform as a Service (PaaS), built specifically for government work. Find out more at What is cloud.gov?
At the DevOps SysAd/Dev Sync, Dan Pilone of Element84 presented on Supporting NASA’s Earth Observing System Data and Information System (EOSDIS). “NASA's Earth Observing System Data and Information System (EOSDIS) is working towards a vision of a cloud-based, highly-flexible, ingest, archive, management, and distribution system for its ever-growing and evolving data holdings. This effort is emerging from its prototype stages and is poised to make a huge impact on how NASA manages and disseminates its nearly 30PBs Earth science data as that grows to over 300PBs in the coming years. This talk outlines the motivation for this work, presents the achievements and hurdles of the past 18 months and charts a course for the future expansion of NASA’s cloud based EOSDIS.”
Drew Ignizio presented on the data source list that is used in ScienceBase and the USGS Science Data Catalog. One way this data source list is used is to attribute official USGS data releases to their related USGS science center (the data source). As USGS science centers merge or otherwise change name, having an up-to-date authoritative list is important, not just for ScienceBase and Science Data Catalog, but for linking many other systems in the USGS. The list that Drew previewed during the talk can be accessed at https://www.sciencebase.gov/directory/organizations?displayHints=SDC_List.
Ray Obuch (USGS) provided an overview of the new Department of the Interior Metadata Implementation Guide, available at https://doi.org/10.3133/tm16A1.
Tim Crone of Lamont-Doherty Earth Observatory presented on Analysis of Massive Underwater Video Data in the Cloud using Pangeo.
Summary: An open-source environment for parallel analysis of massive (100TB) image data in the Cloud is now available via the Pangeo environment, which allows you to apply the power of the Python ecosystem from your browser. Technologies include JupyterHub, Kubernetes, Docker, and Dask distributed. Learn more about Pangeo at https://pangeo-data.github.io.
The Bioinformatics Group conducted a survey and had a discussion to gauge interest in various genome analysis topics and outlets for technical exchange. There was interest in analyzing SNP datasets and sharing knowledge about current SNP projects. See meeting notes here.
Perhaps you are wondering, “What is an SNP, and how does USGS use SNPs?”
SNP stands for single nucleotide polymorphism, and is pronounced “snips.”
An SNP is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%). (https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism).
SNPs can generate biological variation between two members of a species. Those differences can in turn influence a variety of traits such as appearance, disease susceptibility or response to drugs. (https://www.23andme.com/gen101/snps/)
USGS studies SNPs of many species, a quick Pubs Warehouse search returns recent studies on steelhead trout, wolves, prairie falcon, fungus, salmonid, and the Florida panther. See more at https://pubs.er.usgs.gov/search?q=SNPs
SNPs explanation from https://www.23andme.com/gen101/snps/
The USGS has a large and active community using Geographic Information Systems tools, including Esri ArcGIS, QGIS, Python, R, gdal, and many others. Recently, the CDI has become more involved in co-hosting presentations and activities to discuss GIS technology, enterprise solutions, and challenges.
Shane Wright, Roland Viger, and Andy Lamotte are a few of the USGS folks that are helping to coordinate this community. (Thanks!)
After gauging community interest last winter, the CDI recently helped to host a two-part series on ArcGIS Pro.
James Sill from Esri demonstrated capabilities of the new ArcGIS Pro in two presentations.
The CDI is also happy to help promote the meetings of the Alaska GIS and Data Science Webinar (contact: Evan Thoms).
You can access further information, presentation recordings, and the GIS Forum at the CDI GIS Focus Group wiki page.
June’s Python for Data Management Training Series reached over 380 participants, addressing the topics of Working with Local Files, Batch Creating and Updating Metadata, and Automation with PySB (python tools for working with ScienceBase).
Drew Ignizio and Madison Langseth of Core Science Analytics, Synthesis, and Library, led the three 1.5 hour sessions for participants of varying levels of Python experience. The training made use of the Jupyter notebook and Python bundle that ships with the USGS Metadata Wizard 2.0. Especially helpful were the example Jupyter notebooks and data files that were supplied as course resources, allowing participants to execute code in real time, and have a copy of the code for future modification and use.
I attended the first two sessions and am getting ready to watch the recording of the third session, and I know I am not alone in telling Drew and Madison: Thanks! Great job! Very helpful! Good work! YOU WERE AMAZING! (Because those are all direct quotes from the feedback form.)
After these short sessions, I feel that I have the knowledge I need to get started with Jupyter notebooks for my own purposes.
If you missed them, you can download the course resources and watch the recordings on the course wiki site: Python for Data Management.
Behind the scenes with our excellent instructors for the Python for Data Management Training Series:
At the June 13, 2018 Monthly Meeting, CDI had a data visualization extravaganza.
Jordan Read, the chief of the Water Mission Area’s Data Science Branch, spoke on “Amplifying USGS science with timely and digestible data visualizations.” Data show that web visualizations can reach a far larger audience than formal reports. Jordan gave us a view into the process of creating time-sensitive visualizations, such as those that address incoming hurricanes. He also noted that methods that allow reproducibility are key for efficiency, and that there are benefits of communicating more frequently between different mission areas in the USGS about data visualization techniques and projects.
A team from the USGS Western Geographic Science Center presented on “Data visualization for science: comparing 3 dashboard building software packages.” Despite an ill-timed power outage at the Menlo office, Kevin Henry and Jason Sherba (and Jeff Peters in spirit) told us about their experiences with Tableau, ArcGIS Online, and PowerBI in visualizing data for hazard exposure analysis. They told us about the pros and cons, summarized in a slide that may “live on in infamy” (shown below). They stressed that the best platform will depend on your specific case, and encouraged further sharing of people’s experiences with data visualization platforms.
Other news from the monthly meeting:
To help the process of formalizing USGS data sharing agreement guidelines, JC Nelson asked for your examples of when you needed to sign data sharing agreements with another agency. Contribute here.
We polled you on tracks and topics for the 2019 CDI Workshop (June 4-7, 2019 in Boulder, CO). Top three tracks (voted by the CDI distribution list): Data visualization, data management, and data science! See more results at the monthly meeting page.
See the full meeting notes, slides, and recording at the monthly meeting page.
Disclaimer: Any use of trade, product, or firm names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.