Highlights from the last month of CDI Collaboration Area activity:
The group took a look at the Genetics Guide to Data Release and Associated Data Dictionary, which was spearheaded by Bobbi Pierson, Alaska Science Center Geneticist, and the Genetics Metadata Working Group. They found it to be a great resource for those that need to author genetics metadata under USGS guidelines. More meeting information.
In April, the DevOps Project Manager Sync had several topics:
Update from USGS Cloud Hosting Solutions
Software management website update (Cassandra Ladino)
Zero Trust Networking: What is it? (internal link) (Tom Van Dreser)
Overview of Cloud activities at Cal Poly (Paul Jurasin)
The Zero Trust Model: Should we be taking the information security advice from Congressmen? Drawbridge Network, November 2016
Lance Everette and Tara Bell presented in the theme of “Preserve”: Taking action against USGS legacy data challenges. See recording and slides at the meeting page.
Alan Allwardt demonstrated the creation of a new set of persistent identifiers using the PURL system (https://archive.org/services/purl/). He used the example of the Data Categories of Marine Planning vocabulary, and described other use cases. (Notes)
Jeremy Fischer from Indiana University presented on "Jetstream: A free national science and engineering cloud environment on XSEDE." (video)
The GIS Community of Practice hosted a webinar on ArcGISDesktop to ArcGIS Pro Transition. James Sill and Stephen Zahniser of Esri gave an overview of the user interface and architecture and a demo. We received over 50 questions and comments during the presentation via sli.do and chat and we’re working on getting the Q&A up on our wiki. Recording is available at the meeting page.
At the CDI monthly meetings, our goal is to bring you tools and information to help you do your daily work.
On April 11, 2018, we started with a review of the Reproducible Notebook Series, started in October 2017. The series has been showcasing different examples of reproducible and executable online notebooks. These notebooks are cast as the successor to the traditional scientific paper in a recent Atlantic article that has been making the rounds: The Scientific Paper is Obsolete.
April’s reproducible notebook installment: OBIS (Ocean Biogeographic Information System) and R - Filipe Fernandes, SECORRA/IOOS (Southeast Coastal Ocean Observing Regional Association/Integrated Ocean Observing System). Filipe’s presentation used the jupyter nbviewer, creating a presentation directly from the notebook! He showed how to connect sea turtle observation points to create possible migration paths in the Atlantic Ocean.
Screenshot from Filipe's notebook, plotting and connecting sea turtle observations.
Taxa Taxi: An automated process for using citizen science data to facilitate biodiversity monitoring (Erin Boydston and Toni Lyn Morelli)
iNaturalist citizen science observations are helping researchers understand biodiversity monitoring (after some automated data processing). iNaturalist got a thumbs up from a meeting participant as a neat mobile app to take on your hikes.
USGS Data at Risk: Expanding Legacy Data Inventory and Preservation Strategies (Lance Everette and Tara Bell)
Rescuing legacy data at the USGS remains a Herculean effort. The Legacy Data Inventory Reporting System (LDIRS) and its evaluation criteria can help the USGS address this need.
Web Mapping Application for a Historical Geologic Field Photo Collection (Sarah Nagorsen and Jason Sherba)
Need guidance for proper documentation and publication of geolocated photo collections? See the CDI-funded project on a web mapping application for photo collections.
Some highlights from March 2018 CDI collaboration area activity:
The group discussed a CDI proposal to create specifications for USGS data products so that ISO standard metadata records can be created in tools like the ADIwg metadata toolkit (mdEditor, mdTools). (Update: funded). The group also got a sneak peak at the new Data Dictionary page on the USGS Data Management Website (Update: published).
Brian Fox shared a cloud training resources wiki page.
Ross Wickman gave an update from Cloud Hosting Solutions (CHS).
Eric Martinez gave a presentation entitled Software Inventory, What it is, how it's made, and how you can make it better (internal link). More info: https://sourcecode.cio.gov/
Zarr: A simple, open, scalable solution for big NetCDF/HDF data on the Cloud": Alistair Miles, University of Oxford. The motivation, current status and future plans for Zarr were discussed, along with a demo of basic functionality, and, an analogy between virtual machines and cows. (link to video)
Capturing your processing and analysis workflow in R - Alison Appling. Alison introduced tools in R for dealing with reproducibility of analysis, size and complexity of analysis, collaboration on analysis, and dissemination. (Just a sampling of tools: remake, drake, googledrive, sbtools, whisker). (slides)
R tools for modern data analyses
The group discussed potential activities for future conference calls. The group also maintains links to eDNA talks being hosted outside of the CDI on their wiki page.
Chris Johnson presented on USGS EDGE (Equipment Development Grade Evaluation): What is it, how does it apply to you, and why you may be interested in participating. You can access the recording on their meetings page if you are logged in.
We continued learning about the FAIR Data Principles - I and R stand for Interoperable and Reusable. Awareness of these principles is growing within the CDI.
Cheryl Morris gave the opening announcements, displaying the 18 CDI Proposals that moved to Phase 2, shown around the CDI Science Support Framework. For teams that did not advance to Phase 2, the we always welcome further discussion about how to better frame projects with CDI principles (and we’re not just saying that). She also reminded us that the FY19 Request for Proposals is not too far away, and encourages groups to start the discussion.
FY18 proposals advancing to full proposal stage, around the CDI Science Support Framework.
Group Announcements - A USGS Software Management Website is being planned and Cassandra Ladino (firstname.lastname@example.org) is looking for volunteers to help with the design - this includes everyone from the individual scientist developing software to large development teams. See all announcements.
Kristin Ludwig briefed us “Science for a Risky World: A USGS Plan for Risk Research & Applications” giving us more information about efforts around the USGS that we can join.
There were four CDI funded project presentations from last year, sharing their findings regarding making data more accessible, high throughput computing and docker containers, benefits and limitations of using Tableau for USGS data, and new technologies that allow us to “do science” in the cloud.
We circulated a very brief survey for feedback on the community voting phase of this year’s CDI Request for Proposals. Please take it if you haven’t already!
We’re trying out a new “Highlights” section on the Monthly Meeting pages that will list major links and resources presented at the meeting, these will be posted well before my blog posts!
Tsunami Evacuation Tableau app: https://geography.wr.usgs.gov/science/vulnerability/oahuEvacDashboard.html
USGS Data Life Cycle in the Cloud: https://github.com/USGS-CMG/data-life-cycle-cloud
The frequency of exciting CDI collaboration area meetings is far greater than the frequency of my writing about them. Here are some highlights from the past two months:
The DevOps group has started a Cloud Training Resources page. (Example: Amazon Web Services "What is Cloud Computing?") If you find other training opportunities, please let Brian Fox (email@example.com) know, so that he can add them to the list.
Online platforms for data analysis have arrived, as illustrated by the recent Tech Stack/Tech Dive presentations. The webinar page has links and recordings for The Pangeo Project (an open-source big data science platform), and the National Data Service Labs Workbench (a scalable platform for research data access, education, and training).
The Data Management Working Group has covered several topics including: Publishing metadata to the Science Data Catalog and Data Management Challenges (Jan 2018); Tidy data, Biological Analysis Packages, and Volunteered Geographic Information (Feb 2018).
Bonus: Read the original Tidy Data paper (Wickham, Journal of Statistical Software, 2014). Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
In February, the Semantic Web Working Group tuned in to a National Park Service presentation on the use of linked data to protect cultural heritage resources in the national parks from climate change, using the Digital Index of North American Archaeology.
In February, the Bioinformatics group covered All things microbiome. Many different groups within the USGS have some element of microbiome research - check out the USGS Fact Sheet on microbiome research for more information.
Open Source Coffee Talks decided to combine forces with the Software Development Cluster as of February 2018. The Software Development cluster discussed the topics of the USGS HPC/HTC Workshop, the USGS EDGE (Equipment Development Grade Evaluation) program (more info), and 508 Compliance (IT Accessibility) for websites and web applications. In March, Chris Johnson will give a presentation with further information on the EDGE Program.
The Citizen-Centered Innovation group held its inaugural call on February 21, 2018. Anyone interested in crowdsourcing, citizen science, civic hacking, and challenge & prize competitions are encouraged to join. Contact Sophia B Liu at firstname.lastname@example.org for more information.
The eDNA Community of Practice held its inaugural call on January 16, 2018. They will be held every other month at the same time slot as the Bioinformatics group (3rd Tuesday from 2-3p ET). Contact Pete Ruhl (email@example.com) for more information.
The USGS GIS community followed up on their inaugural call with a message about next steps. You can reply to this short form to log your interest in future talks and topics, including ArcGIS Pro, Serving GIS Data with ESRI, Open-source GIS topics, GIS on the cloud, and Global mapper. You can also suggest a new topic!
Whew. My next goal: Update the blog with collaboration area news in less than two months!
Kyle Enns and Cristiana Falvo from the USGS gave a presentation on "Using Python to Bring Geophysical Data to the Surface", showing the CDI another example of a way to officially share python scripts for reproducibility. Kyle and Cristiana also shared the documents they use for Pre-review quality control, Releasing accessible python code, and their Technical peer review checklist (log in at the meeting page to view).
The feature presentation was "Semantic web for scientific information: streamlining how we write, find, link, and reuse data and models" by Ferdinando Villa of the Basque Centre for Climate Change. After describing the challenge of data and model integration and reuse, and a project he is working on to address the problem (The Integrated Modelling Partnership, www.integratedmodelling.org), he invited us all to come join in the adventure of working together in partnership to build an integrated information landscape! You can contact him at firstname.lastname@example.org.
Ferdinando's presentation was followed by a panel discussion on the semantic web and the USGS with Ken Bagstad, Dalia Varanka, Julia Moriarty, and hosted by myself. It was clear that we need more time to learn from each other about the challenges and opportunities of the semantic web!
You can view the recording and slides on the February 2018 monthly meeting page.
Kevin Gallagher opened the January 10, 2018 meeting by announcing the CDI FY18 Request for Proposals, which was released on December 18, 2017. This year, there is a topical focus on Risk Assessment and Hazards Vulnerability.
In our Reproducible Notebook Series, Chris Sherwood presented his experience with officially publishing a Jupyter Notebook on code.usgs.gov as part of an official code release. You can see the finished product at https://code.usgs.gov/usgs/whcmsc-rdc/tree/v1.0.
Our reproducible notebook series highlights repeatable, executable, and documented methods in Jupyter Notebooks.
Brian May, who manages the USGS FOIA (Freedom of Information Act) program, presented "How the Freedom of Information Act impacts Data." Did you know that the USGS receives and processes over 200 FOIA requests a year? Brian’s talk touched on some of the more routine questions posed to the FOIA program, however he is happy to provide more detailed trainings or discussions on the topic to a smaller group. You can reach him at email@example.com.
Finally, I gave a brief overview of the CDI Request for Proposals Process: Past, Present, and Future. This presentation was an opportunity for me to emphasize some of the unique features of our RFP, such as the community commenting and voting, encouragement to make new connections and discuss and promote in-progress ideas, and the benefits of participating in the voting process. Your vote counts, and as CDI members, you are registered voters!
The CDI has funded over 80 projects since 2010.
All slides and the recording are available on the January 10, 2018 Meeting Page.
The USGS GIS Community had a discussion on 12/19/17 about how GIS users and enthusiasts at USGS can share information and tools as a community. The importance of the topic was illustrated by the fact that the call was so well attended that we ran out of phone lines (sorry about that - recording linked below.) Shane Wright and Roland Viger led the discussion, including the current state of USGS Enterprise GIS Help. CDI helped to facilitate the call.
Participants answered polls about what open source GIS tools they use, what technical support mechanisms seemed most promising, and what are the most important needs of the GIS community over the next 5 years. This was the start of a community of practice that will help to communicate and advance GIS capabilities at the USGS. To get involved in the conversation, contact Shane (firstname.lastname@example.org) or Roland (email@example.com).
This post rounds out the 2017 CDI Collaboration Area Activity. It's been such a full year, I'm looking forward to more great topics in 2018!
Some of these topics do not really lend themselves to images, but we must have an image. So here is last month's ball of CDI Collaboration Area words:
The group discussed goals to help guide how this group could collaborate and benefit from each other (in order of priority and likelihood):
Share awareness of what is going on (software efforts, tool exploration, best practices, metadata standards)
Share lessons learned
Share configurations (software, tools, architectures, ...)
Share data, services, and/or maybe even code
Let the group leads, Michelle Guy (mguy) and Blake Draper (bdraper), know if you have specific topics or goals you’d like to see addressed. Software Development Cluster Page
The group talked about a specific field in the USGS data release metadata: That pesky data quality information. The Data Quality field is challenging because many metadata creators and reviewers are not sure what to put there, many times there is no useful content in that field. Madison Langseth brought up a current effort to compile Data Quality Documentation Examples. See the rest of the discussion at the Metadata Reviewers page.
DevOps had three presentations, two in Project Management and two in SysAd and Developer.
SCAPE (Secure Cloud Analytic Processing Environment): A Framework for adaptable and secure analysis of streaming data. (Ginny Cevasco - Booz Allen Hamilton)
GHSC (Geologic Hazards Science Center) experience with an Agile Contract (Lynda Lastowka, USGS). Shared link on agile contracts in government.
CHS (Cloud Hosting Solutions) Cloudfront/WAF service (Jonathan Russo - CHS)
The focus was the Data Management Theme: Acquire. Brian Reece spoke on the topic "Data integration, fiscal accountability, and the 'business of science.'" He presented an evolving suite of web services and procedures that improve the availability to access and integrate data from Bureau systems such as BASIS+ (used to track projects and financial info), FBMS (tracks agreements and sales), and IPDS (used to track publications). Data Management WG page
"Mini-Hack-Session: Developing and extending Jupyter Widgets": Jason Grout, Bloomberg. Jason walked through the thought and technical processes involved with developing new widget capability. See the recording. Tech Stack WG page.
The December Open Source topic was code inventories and metadata. Eric Martinez has been working on leverage open APIs to aggregate code.json files from individual USGS projects into a software inventory compatible with code.gov. Eric was unavailable at this months call. Alternatively, Cian Dawson volunteered to talk about the Water Mission Area activities and the Software IM. The Software IM is currently under heavy revision by the Fundamental Science Practices Advisory Committee and any feedback is welcome. (See details at the first comment on this page.) Open Source Coffee Talks page.
Here's another installment of all the topics being explored in the CDI Collaboration Areas. I'll get up to date yet!
The Software Development group discussed how people use github or other version control, for example, regarding release schedules and when in the dev cycle do releases begin? Eric Martinez led this conversation with a presentation on how the GHSC (Geologic Hazards Science Center) is using gitlab. Slides available to Dept. of Interior users.
Examples of using GitLab
The Metadata Reviewers group had earlier decided to learn together about different types of specialized metadata. Pai and Erika shared examples of using the Biological Data Profile for data from Sea Otter Surveys. (See Western Ecological Research Center Approved Data Releases) Read more.
The DevOps meetings continue to bring us explanations of new and evolving capabilities available to groups in the USGS, as well as opportunities for me to learn new acronyms.
Announcing CHS CDN/WAF Service (Cloud Hosting Solutions) (Content Delivery Network) (Web Application Firewall) (Jonathan Russo). This is a managed service intended for people who have a public facing internally hosted site that want to utilize Cloudfront.
GIT Hosting and Version Control (George Rolston). George presented code.chs.usgs.gov and gitlab-ci which is currently running and available for use. If you are not aware of what gitlab-ci is, it is a great time to learn how you can automate your builds with nothing more than a commit to master on code.chs.usgs.gov (CI = Continuous Integration)
This is the place you can go to learn about user stories for a USGS triple store, picking a system of persistent identifiers for linked data components, and choosing between 303 URIs and hash URIs. We are all learning together!
"Jupyter Widgets": Jason Grout, Bloomberg. (aka ipywidgets) enables building interactive GUIs for Python code using standard form controls (sliders, dropdowns, textboxes, etc.), as well providing a framework for building complex interactive controls such as interactive 2d graphs, 3d graphics, maps, and more.
The focus was the Data Management Theme: Plan, and the group welcomed speakers on three topics:
Guidance on how to release USGS model output files – Fran Lightsom
Examples of building data management plans as code – Sky Bristol
Data Management activities in the Water Mission Area – Linda Debrewer
See the slides at the DMWG meeting page.
Estimating Software Development Tasks. Discussion: "What approach has worked to best determine when a (software development) task will be completed on time, within scope, and within budget? Single point estimating? Three Point Estimating? Story Point Estimating? 50%-90% Estimating? Padding your initial thought by a factor of 2,4,8 estimating?" The group discussed these options and also created a new #projectmanagement slack channel on USGS slack. (If you do not have a Slack account, email Paul Moreland (firstname.lastname@example.org) and he will get you set up.)
Learn more about the group at their wiki page.
Announcements included some teasers for the FY18 CDI Request for Proposals. We hope to release the guidance for the proposals process in December, you can check out the current proposals page to prepare.
Sophia Liu, who is on Mission Assignment to FEMA, presented on Leveraging Crowdsourcing in FEMA-led Response Efforts.
Colin Talbert showed us some awesome notebook capabilities in the Reproducible Notebook Series: Notebooks as a Data Management Superpower. These included examples of batch metadata propagation and upload, and a way to visualize a summary of a Science Center’s records in the USGS internal publication system.
Demo of an app that shows a timeline and different status for publications in the USGS internal publication system.
Michelle Guy presented on National Earthquake Information Center: Overview real-time data acquisition, processing, and archive.
NEIC data flow.
Lynda Lastowka presented on National Earthquake Information Center: Data-first concept for presentation and delivery.
The data-first approach (providing quality data that can be then used in a variety of ways by users) supports minimum viable products and early adopters. Data is presented for both human and programmatic users.
See more at https://earthquake.usgs.gov/
Highlights from Q/A:
I’m a bit behind on showing off all of the different topics being explored in CDI, but here is the next installment!
Wait, what’s a collaboration area? Collaboration Area is just our new term that includes both our familiar working groups and other groups with different communication formats (like Slack and Google Hangouts). We won’t get upset if you keep saying Working Group.
You can always email email@example.com for more information on a particular group or to request to join a group’s announcement list. Also, let us know if we missed your activity!
The group is planning metadata training for the USGS. They discussed the results of their metadata training priority survey, typical metadata shortcomings, and ideal activities and outcomes from the viewpoint of metadata reviewers. Contact: Fran Lightsom. Read more!
Open Shift Demo (Chuck Svoboda, OpenShift Practice Lead, Public Sector)
A quick baseline on DevOps and containerized platforms, an overview of OpenShift, and how Red Hat solutions enable DevOps through a trusted software supply chain. The presenter also compared and contrasted capabilities/features between OpenShift and PCF (Pivotal Cloud Foundry).
What's OpenShift? Develop, Deploy, and Manage Your Containers
Automating ESRI Services with Jenkins (Robert Djurasaj)
Presentation on using Jenkins to automate complex workflows for delivering and publishing latest data and service updates.
What is Jenkins? A self-contained, open source automation server which can be used to automate all sorts of tasks related to building, testing, and delivering or deploying software.
Access the recordings from the DevOps Page. Contact: Brian Fox
The group looked at the user stories developed in September and discussed paths forward. They discussed the concept of a data dictionary element database: It provides descriptions that you could use in metadata for data fields for the things you have measured, or are going to measure.
SWWG Meetings. Contact: Fran Lightsom
"Research Workspace: A web-based tool for data sharing, documentation, analysis, and publication": Rob Bochenek, Axiom Data Science.
Research Workspace a web-based tool designed to support collaborative science and data management tasks throughout the data lifecycle.
See the video. Contact: Rich Signell
Managing Data with Partners – Donn Holmes (Western Ecological Research Center; San Diego Field Station)
Data Management Planning in NCCWSC – Emily Fort (NCCWSC; Reston)
Presentations included discussion of basic questions for the data management planning stage: Who are the users? What direction is the content flowing? What security is needed?
Meeting page. Contacts: Viv Hutchison and Cassandra Ladino
The group heard about ongoing efforts led by the Alaska Science Center to provide templates and guidance for genetics data release. Speaker: Barbara (Bobbi) Pierson).
Meeting notes. Contacts: Robert (Scott) Cornman, Denise Akob, Chris Kellogg.
The group didn’t hold a meeting in October, but tried out the Tricider app for voting on new ideas each month, setting reminders and deadlines, and providing a cleaner presentation of ideas. Tricider doesn't require authentication to vote or suggest ideas.
Learn more about the Open Source group. Contact: Cassandra Ladino
At the October 11, 2017 Monthly Meeting, we had our first episode of the Reproducible Notebook Series. These notebooks, rather than being college-ruled and spiral bound, are a web-based interactive computing platform where you can execute blocks of code and view results. In these segments we find examples of notebooks doing useful things (e.g., accessing data from a database and visualizing them) and give you a demo.
Rich Signell, demonstrated his Dust Bowl Notebook. At the link you can click on the .ipynb file to see the notebook, or click the "launch binder" button to execute the notebook!
Daniel Pearson, Joe Vrabel, and Ramona Neafie from the Texas Water Science Center presented "From API to Apps: USGS Texas Water Science Center Web Development Approaches and Analytics."
Find out more about their group's products, including the Texas Water Dashboard, Water-On-the-Go, and Graphing Water Information System (GWIS) at https://webapps.usgs.gov/.
Check out the CDI calendar for future group meetings, groups are open to all.
September’s DMWG topics were:
DMWG updates and introduction to data management for integrated science, Cassandra Ladino, USGS
Overview of DMBOK v2 (Data Management Book of Knowledge), Lowell Fryman, Collibra
See more at their meeting page.
Updated Knowledge areas in the Data Management Book of Knowledge v.2.
Topics discussed at the DevOps group calls on September 12 included:
Cloud Hosting Solutions Docker Managed Service (Jonathan Russo)
Cloud Hosting Solutions Overview and Road to a Test/Dev Environment (Courtney Owens, Eric Larson, Emma Sirr)
Terraform (Ivan Fetch)
Automating SSL Certificate Creation (Shawn Noble, WMA)
This month SWWG developed user stories for at least two potential future projects: a permanent USGS triple store and a USGS database of data dictionary elements. Next steps: clean up the stories, identify interested development team members and real customers, look for resources.
In September the group heard a presentation on JupyterHub and JupyterLab Developments by Brian Granger of Cal Poly.
Jupyter lab will one day replace the current jupyter notebook interface. It’s in alpha preview.
Real-time markdown updates
Automatic block detection
Ability to working with .csv with 1.3 million rows (smooth scrolling)
Drag and drop cells from one notebook to another
“Hide all code” in a notebook
Single document mode: cmd-shift-enter to FOCUS on one document
Extension: integration with Google Drive - double click to open, will have full real-time editing options of Google Drive
The Software Development Cluster had its first meeting with an informal "coffee-talk" style of a gathering via webex and phone on September 28. Everyone was welcome to bring questions and topics of interest on anything related to software development and operations at the USGS.
Topics covered included:
Software repository requirements and recommendations
Software releases and DOIs
Credit for original authors, especially as we work in public domain? We can request, encourage, we cannot enforce.
Updates on Code.usgs.gov
Q: What's a "collaboration area"?
A: "Collaboration area" is just a broader way to describe all of the subgroups in CDI that have formed around member interests. The groups have a wide range of goals and meeting styles, so we are refraining from calling them all "working groups." However, if you are used to thinking of all of the CDI subgroups as working groups, this is essentially what collaboration areas are.
The September 13, 2017 CDI monthly meeting happened in the wake of major hurricanes and earthquakes, and in the midst of a severe wildfire season in the western U.S. We heard from USGS speakers that lead efforts and applications that work with hazards data.
Joan Gomberg presented on the USGS plan to reduce risk where tectonic plates collide, and posed the question of how her group could engage with the CDI. (See the USGS circular.) We are now exploring the best way to help facilitate their group in the CDI Earth Science Themes Working Group.
Elizabeth Lile and Jodi Riegle gave an overview of the GeoMAC wildfire application, a multi-agency project that displays near real-time information on fire perimeters.
Blake Draper demoed the USGS Flood Event Viewer which allows users in the public to access flood data associated with events like specific hurricanes.
In the opening Scientist’s Challenge, we brought up the topic of getting started with reproducible notebooks and R Shiny apps, two tools that are helping to improve the way processes are documented and visualizations are shared! We welcome suggestions on specific topics about these tools - make a note on our forum or send an email to firstname.lastname@example.org.
In our interactive segment, we heard from the audience about our preferred learning style for new tools - there wasn’t an overwhelming winner, but most respondents preferred to learn in a group then work alone, a close second was the group that preferred hands-on sessions with experts, followed by a smaller group that prefers to learn it themselves. We’ll try to have a variety of methods when offering training resources. It’s always great to hear from the community!