Data Management Working Group, 9/10/18 - connecting existing data assets and our new USGS websites
Semantic Web Working Group, 9/13/18 - potential future plans with the USGS Thesaurus and with FAIR data
Citizen-Centered Innovation Monthly Meeting, 9/17/18 - the upcoming Crowdsourcing and Citizen Science Report to Congress
Lance Everette presented information to help update data managers and web masters on how to use new tools that are available for connecting the Science Data Catalog, ScienceBase, and the new USGS websites.
Raad Saleh (firstname.lastname@example.org) from EROS sought ideas and examples of processes for transitioning research data to operational data.
Presentations given by Lance Everette and Raad Saleh found at the meeting page.
A small group talked about the history, status, and future plans of the working group. Some potential future plans included working on the USGS Thesaurus, as presented at the September CDI Monthly Meeting, and activities make USGS data more consistent with the FAIR Data Principles – not just focusing on integrating data to support a particular use, but improving our data practices so that all USGS data is findable, accessible, interoperable, and re-usable for multiple unanticipated uses. Contact Fran Lightsom (email@example.com) if you are interested in participating. More information at the SWWG Meetings page.
Sophia Liu led the monthly meeting, addressing questions or issues that people had about the Crowdsourcing and Citizen Science Report to Congress. She is working on getting the USGS contribution reviewed as it gets closer to the deadline - the final report will be submitted in January 2019 or later. Let Sophia (firstname.lastname@example.org) know if you would like to schedule a meeting to discuss your report in more detail before she begins the review process.
Learn more at the CDI Collaboration Area Page.
View the CDI Calendar to see upcoming meetings.
At the September 12, 2018 CDI Monthly Meeting, topics included sedimentary geology data, online python training, the CDI request for proposals, a spatiotemporal feature registry challenge, STEP-UP student opportunities at the USGS, Bayesian networks, and the USGS Thesaurus. View the recording, slides, Q&A, and highlighted links on the meeting page.
September’s Scientist’s Challenge came from Anjali Fernandes at University of Connecticut - “do you know of an open access database that offers archival of outcrop scans (geo-referenced point clouds) & surfaces mapped on said scans, as well as geo-referenced grain-size distributions, geochemical analyses, sedimentary facies descriptions, etc.?” Initial answers include OpenTopography, Safaridb, and resources at virtualoutcrop.com.
After August’s successful foray into online learning with DataCamp’s Git tutorial, we’re going to try the Introduction to Python for Data Science module next. It is about a 4-hour commitment and I will send reminders from the period October 3-October 24. Read more here and sign up here.
We’ve updated the 2019 Proposals wiki space in preparation for the next round of CDI project ideas!
Sky Bristol presented a challenge in finding the appropriate and best sources for spatial features including boundaries, identifiers, and associated information. Read more and add your ideas at the ESIP-hosted IdeaScale site.
Sue Kemp presented on the experience working with a STEP-UP student to remotely work on a legacy data management challenge - the SageMap site. If you think your center has a STEP-UP opportunity for a student, you can submit it at this Google Form.
Erika Lentz presented some lessons learned through the ongoing conversion of a probabilistic modeling framework from proprietary to freely available open-source software. The project goal is to create a portable interactive web-interface to demonstrate how interdisciplinary USGS science and models can be transformed into an approachable format for decision-makers, such as those making decisions about impacts of sea level rise.
Peter Schweitzer presented on the USGS Thesaurus: what it is, how you can use it, and how you can improve it. The USGS Thesaurus is an important resource that helps us to categorize, browse, and compare the data and science at USGS by using a controlled vocabulary. It is incorporated into multiple USGS data management tools, and is accessible here: https://www2.usgs.gov/science/about/.
Peter described opportunities to correct, refine, and extend Thesaurus concepts; create cross-walks to other controlled vocabularies; build more web services and application interfaces; and help other people use this resource effectively. The presentation led to an extensive Q&A which can be found on the meeting page. Contact Peter (email@example.com) if you are interested in learning more.
8/7/18 DevOps: Meet the Software Development Cluster, Migrating to Amazon Web Services at Cal Poly
8/9/18 Tech Stack: EarthSim Lightweight Python Tools
8/13/18 Data Management: Preserve
8/15/18 Citizen-Centered Innovation: Report to Congress
8/30/18 Software Development: A Deeper Dive into Git
Project Management Sync
Michelle Guy gave an overview of the CDI Software Development Cluster activities. Sharing information across different CDI collaboration areas is a great way to learn from related, but separate, groups of expertise. DevOps expressed an interest in being more informed of other CDI activities and we shared the CDI Calendar.
SysAd and Developer Sync
Paul Jurasin, Theresa May, and Ben Butler, of California Polytechnic State University shared their experiences from their institution’s migration to Amazon Web Services. They stressed the importance of introducing enough training to accompany new tools, the importance of putting people first and keeping them informed during major institutional shifts in technology, and the importance of acknowledging different skills, values, and priorities of different groups of people (such as developers, systems people, and infrastructure people.) Thanks to the presentation team for sharing their experiences in a major enterprise migration to the cloud.
A slide from the Cal Poly team's presentation on migration to Amazon Web Services.
Dharhas Pothina, from the US Army Engineer Research and Development Center, presented on EarthSim: lightweight python tools for environmental simulation. EarthSim provides a set of tools that can easily be reconfigured and repurposed as needed to rapidly solve specific emerging issues. By interacting and visualizing data in the browser, it is easier to deliver products to customers, and allows users to run the tools locally or on HPC.
EarthSim is a website and github repo, a place to try things out and see examples. http://earthsim.pyviz.org/
See the recording on the joint CDI Tech Stack / ESIP Tech Dive website.
The Data Management Working Group focused on the “preserve” theme in August, with two presentations.
Chris Bartlett, USGS Records Officer and Chief, Information Management Branch, presented on the relationship between Records Management and Science Data.
Larry Reedy, Records Disposition Coordinator, presented on the NARA ARCIS system to submit scientific records to Federal Records Centers.
See the slides and recording on the DMWG meeting page.
A slide on records management areas from Chris Bartlett's talk.
Topics discussed at the August Citizen-Centered Innovation call included
Overview of Report to Congress for the Crowdsourcing and Citizen Science Act (15 U.S.C. 3724)
Updated CitizenScience.gov Website
Announcements: Upcoming Conferences and Meetings
Contact Sophia Liu, firstname.lastname@example.org, for more details.
George Rolston from USGS Cloud Hosting Solutions shared his knowledge and enthusiasm for Git, in particular, different Git branching strategies.
He shared the following resources:
The recording is accessible from the Software Development meetings wiki page.
At our last virtual monthly meeting on August 8, 2018, we heard about the upcoming Community for Data Integration Request for Proposals, opportunities for group learning about Git, and recent data-related activities at the USGS Office of Enterprise Information.
CDI sponsor Kevin T. Gallagher thanked the current CDI Project Teams that shared their progress at the Summer Earth Science Information Partners (ESIP) meeting.
Participants at the CDI Session at the ESIP Summer Meeting in July 2018: Supporting integrated and predictive science: Community for Data Integration focus on risk assessment.
Kevin also reminded us that the next CDI Request for Proposals will be happening soon, hopefully September! Like last year, Kevin and Tim Quinn will select a theme or themes for us to work on together as a community. Kevin stressed that the CDI is aiming to help develop the capacity of the entire USGS for data integration and management through this proposals process, and therefore we should always keep an eye out for project outcomes may be relevant to our own work. Selecting projects with wide applicability is a priority. Stay tuned for more info.
I announced that in the next month, I will complete the DataCamp Git for Data Science module, and I invite anyone interested to join me. Announcing this goal to the entire CDI is the best shot I have at actually doing it. This is a new CDI experiment in group learning. Read more at this wiki page and sign up to get weekly reminders and updates on my progress - this way we will complete the 4 hour module together.
Tim Quinn (Chief, USGS Office of Enterprise Information) and Nancy Sternberg (Senior Advisor for Strategic Planning, USGS OEI) presented us their vision of Component Architecture for Integrated Science and some recent workshops that have helped them to identify next steps for implementation. The OEI oversees a tremendous amount of data and hardware in the USGS, and is working hard to modernize the entire system. In the past year, they have held workshops on High Performance Computing and High Throughput Computing, Data Storage, and Sensor Networks, to guide their plans. They also gave us insight into the data storage strategy for the next 5 years and beyond. Tim, Nancy, and Paul Exter welcomed feedback from the community on their presentation.
Insights into a strategy for storing our increasing data volumes.
Meeting recording and slides are posted on the August 2018 Meeting page.
In July, the Metadata Reviewers Community took a look at the newly proposed metadata requirements for the old "legacy" data sets that have been traditionally released on USGS web sites. The USGS Web Re-engineering Team ("WRET") is planning on using these metadata to display legacy data. Lisa Zolly introduced the topic and led the discussion.
July’s theme was Metadata! Ben Wheeler gave a brief update on Science Data Catalog (SDC), an important part of the USGS Public Access Plan.
Colin Talbert provided an overview and demo of the Metadata Wizard (version 2.0!), a tool for creating robust metadata.
JC Nelson led a discussion on data release issues, and recent developments in guidelines for data sharing agreements and software release policies and issues. Pete Ruhl is currently compiling examples of eDNA (environmental DNA)-related data releases. He can be contacted at email@example.com.
Visit the eDNA wiki page.
The group began discussions on some upcoming federal activities related to crowdsourcing, citizen science, and prizes & challenge competitions as well as upcoming meetings in DC related to crowdsourcing and citizen science. The group will discuss the same topic at next month's meeting on Wednesday, August 15 with more updated materials.
A report to Congress about federal crowdsourcing and citizen science activities is due at the end of 2018 this winter. This report is required by the American Innovation and Competitiveness Act (of which the Citizen Science and Crowdsourcing Act is a component).
A draft of the Form for Collecting CCS Projects from the White House Office of Science and Technology Policy (OSTP) and the Science and Technology Policy Institute (STPI).
For more information, contact Sophia Liu, firstname.lastname@example.org.
Carl Schroedl from the USGS Water Mission Area gave us an in-depth look at the Git Fork and Feature Branch Workflow. This method works well for his group, which incorporates code reviews in their workflow.
Read more at this blog post on using the Fork-and-Branch Git Workflow
There are infinite possible workflows in Git, adjust it, make it work for you, Atlassian is a good resource to learn about different workflows https://www.atlassian.com/git/tutorials/comparing-workflows
If this workflow seemed too complicated for your purposes, you could simplify by not doing branches under a fork, or possibly not doing forks, but this takes away from the motivation and the benefits when getting code reviews.
Q: What about branch naming conventions? A: You can use an issue code/identifier. This can help you trace back to details on the issue or motivation for the code changes.
At the Community for Data Integration July 11, 2018 monthly meeting, we heard about two programs in the USGS, the STEP-UP program and the Cloud Hosting Solutions program.
Chris Hammond told us about the STEP-UP (Secondary Transition to Employment Program - USGS Partnership) program. STEP-UP provides employment training to young adults (ages 18-22) with cognitive and other disabilities. Despite these disabilities, they may be highly competent at certain tasks, for example data preparation tasks. The overview will explained how the program works and described several success stories. The CDI has recently heard from a number of research groups that are looking for solutions to migrating legacy data or websites, we introduce the STEP-UP program as a possible solution to investigate further. To learn more, get in touch with Chris at email@example.com.
We also heard an update on the latest services provided by USGS Cloud Hosting Solutions (CHS) from Jennifer Erxleben and Harry House. Cloud Hosting Solutions (CHS) is the required, supported, secure Cloud offering for USGS Science Centers and mission programs. Jennifer and Harry told us about CHS managed services, CHS custom services, and the sandbox environment. They also went over some example projects and costs. After the presentation, we had an active 30-minute long Q&A session, which is documented on the meeting page.
The best way to get started with CHS is to email firstname.lastname@example.org.
Things learned in the writing of this blog post: What is SNP, how do you pronounce SNP, and what USGS research involves SNPs? What is cloud.gov? Where can I find a up-to-date list of USGS science centers that are used in ScienceBase and the USGS Science Data Catalog?
Ray Obuch provided an overview of the new Department of the Interior Metadata Implementation Guide, available at https://doi.org/10.3133/tm16A1.
The group also examined the proposed FAIR metrics, mentioned at a recent CDI Monthly Meeting, which have a lot to do with metadata. (FAIR stands for findable, accessible, interoperable, reusable.)
Peter Schweitzer shared this link about a similar but different way of thinking about the problem, "5 Star Open Data," https://5stardata.info/en/.
At the DevOps Project Management Sync meeting, topics included
An update from USGS Cloud Hosting Solutions (CHS)
An update on the USGS Software Management website, which is under development (Cassandra Ladino, USGS)
A world wind tour of cloud.gov and the default DevOps pipeline to deploy applications to it (Andrew Burnes, 18F). Cloud.gov is a secure, fully compliant Platform as a Service (PaaS), built specifically for government work. Find out more at What is cloud.gov?
At the DevOps SysAd/Dev Sync, Dan Pilone of Element84 presented on Supporting NASA’s Earth Observing System Data and Information System (EOSDIS). “NASA's Earth Observing System Data and Information System (EOSDIS) is working towards a vision of a cloud-based, highly-flexible, ingest, archive, management, and distribution system for its ever-growing and evolving data holdings. This effort is emerging from its prototype stages and is poised to make a huge impact on how NASA manages and disseminates its nearly 30PBs Earth science data as that grows to over 300PBs in the coming years. This talk outlines the motivation for this work, presents the achievements and hurdles of the past 18 months and charts a course for the future expansion of NASA’s cloud based EOSDIS.”
Drew Ignizio presented on the data source list that is used in ScienceBase and the USGS Science Data Catalog. One way this data source list is used is to attribute official USGS data releases to their related USGS science center (the data source). As USGS science centers merge or otherwise change name, having an up-to-date authoritative list is important, not just for ScienceBase and Science Data Catalog, but for linking many other systems in the USGS. The list that Drew previewed during the talk can be accessed at https://www.sciencebase.gov/directory/organizations?displayHints=SDC_List.
Ray Obuch (USGS) provided an overview of the new Department of the Interior Metadata Implementation Guide, available at https://doi.org/10.3133/tm16A1.
Tim Crone of Lamont-Doherty Earth Observatory presented on Analysis of Massive Underwater Video Data in the Cloud using Pangeo.
Summary: An open-source environment for parallel analysis of massive (100TB) image data in the Cloud is now available via the Pangeo environment, which allows you to apply the power of the Python ecosystem from your browser. Technologies include JupyterHub, Kubernetes, Docker, and Dask distributed. Learn more about Pangeo at https://pangeo-data.github.io.
The Bioinformatics Group conducted a survey and had a discussion to gauge interest in various genome analysis topics and outlets for technical exchange. There was interest in analyzing SNP datasets and sharing knowledge about current SNP projects. See meeting notes here.
Perhaps you are wondering, “What is an SNP, and how does USGS use SNPs?”
SNP stands for single nucleotide polymorphism, and is pronounced “snips.”
An SNP is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%). (https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism).
SNPs can generate biological variation between two members of a species. Those differences can in turn influence a variety of traits such as appearance, disease susceptibility or response to drugs. (https://www.23andme.com/gen101/snps/)
USGS studies SNPs of many species, a quick Pubs Warehouse search returns recent studies on steelhead trout, wolves, prairie falcon, fungus, salmonid, and the Florida panther. See more at https://pubs.er.usgs.gov/search?q=SNPs
SNPs explanation from https://www.23andme.com/gen101/snps/
The USGS has a large and active community using Geographic Information Systems tools, including Esri ArcGIS, QGIS, Python, R, gdal, and many others. Recently, the CDI has become more involved in co-hosting presentations and activities to discuss GIS technology, enterprise solutions, and challenges.
Shane Wright, Roland Viger, and Andy Lamotte are a few of the USGS folks that are helping to coordinate this community. (Thanks!)
After gauging community interest last winter, the CDI recently helped to host a two-part series on ArcGIS Pro.
James Sill from Esri demonstrated capabilities of the new ArcGIS Pro in two presentations.
The CDI is also happy to help promote the meetings of the Alaska GIS and Data Science Webinar (contact: Evan Thoms).
You can access further information, presentation recordings, and the GIS Forum at the CDI GIS Focus Group wiki page.
June’s Python for Data Management Training Series reached over 380 participants, addressing the topics of Working with Local Files, Batch Creating and Updating Metadata, and Automation with PySB (python tools for working with ScienceBase).
Drew Ignizio and Madison Langseth of Core Science Analytics, Synthesis, and Library, led the three 1.5 hour sessions for participants of varying levels of Python experience. The training made use of the Jupyter notebook and Python bundle that ships with the USGS Metadata Wizard 2.0. Especially helpful were the example Jupyter notebooks and data files that were supplied as course resources, allowing participants to execute code in real time, and have a copy of the code for future modification and use.
I attended the first two sessions and am getting ready to watch the recording of the third session, and I know I am not alone in telling Drew and Madison: Thanks! Great job! Very helpful! Good work! YOU WERE AMAZING! (Because those are all direct quotes from the feedback form.)
After these short sessions, I feel that I have the knowledge I need to get started with Jupyter notebooks for my own purposes.
If you missed them, you can download the course resources and watch the recordings on the course wiki site: Python for Data Management.
Behind the scenes with our excellent instructors for the Python for Data Management Training Series:
At the June 13, 2018 Monthly Meeting, CDI had a data visualization extravaganza.
Jordan Read, the chief of the Water Mission Area’s Data Science Branch, spoke on “Amplifying USGS science with timely and digestible data visualizations.” Data show that web visualizations can reach a far larger audience than formal reports. Jordan gave us a view into the process of creating time-sensitive visualizations, such as those that address incoming hurricanes. He also noted that methods that allow reproducibility are key for efficiency, and that there are benefits of communicating more frequently between different mission areas in the USGS about data visualization techniques and projects.
A team from the USGS Western Geographic Science Center presented on “Data visualization for science: comparing 3 dashboard building software packages.” Despite an ill-timed power outage at the Menlo office, Kevin Henry and Jason Sherba (and Jeff Peters in spirit) told us about their experiences with Tableau, ArcGIS Online, and PowerBI in visualizing data for hazard exposure analysis. They told us about the pros and cons, summarized in a slide that may “live on in infamy” (shown below). They stressed that the best platform will depend on your specific case, and encouraged further sharing of people’s experiences with data visualization platforms.
Other news from the monthly meeting:
To help the process of formalizing USGS data sharing agreement guidelines, JC Nelson asked for your examples of when you needed to sign data sharing agreements with another agency. Contribute here.
We polled you on tracks and topics for the 2019 CDI Workshop (June 4-7, 2019 in Boulder, CO). Top three tracks (voted by the CDI distribution list): Data visualization, data management, and data science! See more results at the monthly meeting page.
See the full meeting notes, slides, and recording at the monthly meeting page.
Disclaimer: Any use of trade, product, or firm names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.
USGS DevOps 5/1/2018
If you are interested in updates on Git and different task management systems like Remedy and JIRA, this is the group to watch. Eric Martinez from USGS kept us up to date with Git and two presenters from Tasktop (a provider of software integration solutions) presented at the May 1st meeting about integrating USGS IT and System Development teams by sharing information from their ticketing systems. DevOps wiki space for more information.
Metadata Reviewers Community of Practice 5/7/2018
The group discussed the Genetics Guide to Data Release and Associated Data Dictionary, which was compiled by Barbara Pierson at the Alaska Science Center. Metadata Reviewers Meetings page for more information.
Semantic Web Working Group 5/10/2018
The group shared news and ideas, such as using the new Quality Management System (QMS) for USGS as a good example for which to start developing a data dictionary database, a blog post about the Semantic Web by Ken Bagstad, and a competency framework for professional development in the use of linked data at http://explore.dublincore.net/. Semantic Web working group meetings page.
Tech Stack Working Group 5/10/2018
What’s new with the Network Common Data Form - Climate and Forecast?
"NetCDF-CF Advances - Simple Geometries, Swaths, and Groups" was the May topic for the Tech Stack group. Speakers were Dave Blodgett (USGS), Tim Whiteaker (UT Austin), Aleksander Jelanek (HDF Group) and Daniel Lee (EUMETSAT). Simple geometry (points, lines, and polygons) has now been accepted as part of the Open Geospatial Consortium’s NetCDF-CF specification. This a major enhancement to a widely used standard whose utility has previously been limited to time-series of point or (raster) coverage data only. Advances on Groups and Swaths will also be presented. Exciting! Tech Stack and ESIP Tech Dive meeting page.
Data Management Working Group 5/14/2018
Data management to support integrative, FAIR, multidisciplinary modeling: Lessons from the last decade and paths forward - Ken Bagstad, USGS.
Since 2007, the Artificial Intelligence for Ecosystem Services (ARIES) project has been developing an open-source software package, modeling language, and data repository to enable integrated, multidisciplinary environmental and Earth systems modeling (more details here http://www.integratedmodelling.org/). Data Management Working Group May Meeting page.
Software Development Cluster 5/31/2018
Software Licensing Aspects, Leon Foks, USGS
There are many ways to license software as open source. Using his own example of software developed for USGS, Leon Foks walked us through what he learned about placing proper licenses on USGS-produced software that may have some special considerations. It is much more interesting than just “Use CC0”. USGS software products must be in the public domain, meaning that copyright is waived. However it is best practice to apply an approved Open Source license to the software. Usually it is advised that we apply the CC0 license. However, we learned that in some cases, it is necessary and possible to release code with a dual license, with different licenses (e.g., CC0, MIT, BSD, LGPL, GPL) applying to different parts of the code. (You would explain the intricacies of the licensing in the README.md). Maybe I’m weird, but this presentation blew my mind. See the recording on the Software Development Cluster meetings page, and those with access to the USGS GitLab instance can view an .ipynb at https://code.usgs.gov/nfoks1/Software_Licensing.
At the May 9, 2018 CDI Monthly Meeting, we heard from three FY17 CDI Funded Projects:
(Brian Reichert, Fort Collins Science Center and Becca Scully, PNAMP). NABat database (North American Bat database), NABat web portal and MonitoringResources.org are now linked with two way APIs, which will help collaborators to coordinate their sampling efforts.
(Mark Wiltermuth, Northern Prairie Wildlife Research Center). The ScienceCache application has been extended to use a more flexible data model, allowing more types of mobile app data collection. Testing phase will come later this year, contact Mark Wiltermuth (email@example.com) if interested.
(John Young, Leetown Science Center). A project on deriving vegetation metrics from lidar data provides 10m and 25m products for use by stakeholders such as staff at Shenandoah National Park.
A little bit more information about USGS lidar in the cloud from Jason Stoker:
There was a question from someone about the status of lidar in the cloud. We are in the process of replicating all lidar point cloud data that we serve on FTP in S3 as well. Our plan is to have all ~10 Trillion lidar points (110+ TB) that we have in archive on S3 by the end of the year. I believe we have a little more than half out there now, and growing.
Due to the large volumes of data (and small budgets) we only provide lidar point cloud data as a 'requester pays' option in S3. It is still free to access via FTP
A web page with instructions can be found here:
We also announced to save the date for the next CDI Workshop: June 4-7, 2019 in Boulder CO!
See the meeting recording, slides, and more details at the May Monthly Meeting Page!
Highlights from the last month of CDI Collaboration Area activity:
The group took a look at the Genetics Guide to Data Release and Associated Data Dictionary, which was spearheaded by Bobbi Pierson, Alaska Science Center Geneticist, and the Genetics Metadata Working Group. They found it to be a great resource for those that need to author genetics metadata under USGS guidelines. More meeting information.
In April, the DevOps Project Manager Sync had several topics:
Update from USGS Cloud Hosting Solutions
Software management website update (Cassandra Ladino)
Zero Trust Networking: What is it? (internal link) (Tom Van Dreser)
Overview of Cloud activities at Cal Poly (Paul Jurasin)
The Zero Trust Model: Should we be taking the information security advice from Congressmen? Drawbridge Network, November 2016
Lance Everette and Tara Bell presented in the theme of “Preserve”: Taking action against USGS legacy data challenges. See recording and slides at the meeting page.
Alan Allwardt demonstrated the creation of a new set of persistent identifiers using the PURL system (https://archive.org/services/purl/). He used the example of the Data Categories of Marine Planning vocabulary, and described other use cases. (Notes)
Jeremy Fischer from Indiana University presented on "Jetstream: A free national science and engineering cloud environment on XSEDE." (video)
The GIS Community of Practice hosted a webinar on ArcGISDesktop to ArcGIS Pro Transition. James Sill and Stephen Zahniser of Esri gave an overview of the user interface and architecture and a demo. We received over 50 questions and comments during the presentation via sli.do and chat and we’re working on getting the Q&A up on our wiki. Recording is available at the meeting page.
At the CDI monthly meetings, our goal is to bring you tools and information to help you do your daily work.
On April 11, 2018, we started with a review of the Reproducible Notebook Series, started in October 2017. The series has been showcasing different examples of reproducible and executable online notebooks. These notebooks are cast as the successor to the traditional scientific paper in a recent Atlantic article that has been making the rounds: The Scientific Paper is Obsolete.
April’s reproducible notebook installment: OBIS (Ocean Biogeographic Information System) and R - Filipe Fernandes, SECORRA/IOOS (Southeast Coastal Ocean Observing Regional Association/Integrated Ocean Observing System). Filipe’s presentation used the jupyter nbviewer, creating a presentation directly from the notebook! He showed how to connect sea turtle observation points to create possible migration paths in the Atlantic Ocean.
Screenshot from Filipe's notebook, plotting and connecting sea turtle observations.
Taxa Taxi: An automated process for using citizen science data to facilitate biodiversity monitoring (Erin Boydston and Toni Lyn Morelli)
iNaturalist citizen science observations are helping researchers understand biodiversity monitoring (after some automated data processing). iNaturalist got a thumbs up from a meeting participant as a neat mobile app to take on your hikes.
USGS Data at Risk: Expanding Legacy Data Inventory and Preservation Strategies (Lance Everette and Tara Bell)
Rescuing legacy data at the USGS remains a Herculean effort. The Legacy Data Inventory Reporting System (LDIRS) and its evaluation criteria can help the USGS address this need.
Web Mapping Application for a Historical Geologic Field Photo Collection (Sarah Nagorsen and Jason Sherba)
Need guidance for proper documentation and publication of geolocated photo collections? See the CDI-funded project on a web mapping application for photo collections.
Some highlights from March 2018 CDI collaboration area activity:
The group discussed a CDI proposal to create specifications for USGS data products so that ISO standard metadata records can be created in tools like the ADIwg metadata toolkit (mdEditor, mdTools). (Update: funded). The group also got a sneak peak at the new Data Dictionary page on the USGS Data Management Website (Update: published).
Brian Fox shared a cloud training resources wiki page.
Ross Wickman gave an update from Cloud Hosting Solutions (CHS).
Eric Martinez gave a presentation entitled Software Inventory, What it is, how it's made, and how you can make it better (internal link). More info: https://sourcecode.cio.gov/
Zarr: A simple, open, scalable solution for big NetCDF/HDF data on the Cloud": Alistair Miles, University of Oxford. The motivation, current status and future plans for Zarr were discussed, along with a demo of basic functionality, and, an analogy between virtual machines and cows. (link to video)
Capturing your processing and analysis workflow in R - Alison Appling. Alison introduced tools in R for dealing with reproducibility of analysis, size and complexity of analysis, collaboration on analysis, and dissemination. (Just a sampling of tools: remake, drake, googledrive, sbtools, whisker). (slides)
R tools for modern data analyses
The group discussed potential activities for future conference calls. The group also maintains links to eDNA talks being hosted outside of the CDI on their wiki page.
Chris Johnson presented on USGS EDGE (Equipment Development Grade Evaluation): What is it, how does it apply to you, and why you may be interested in participating. You can access the recording on their meetings page if you are logged in.