August 11, 2021 - Hazardous substance data and Communicating about data
The Community for Data Integration (CDI) meetings are held the 2nd Wednesday of each month from 11:00 a.m. to 12:30 p.m. Eastern Time.
Connection information is sent to the CDI mailing list.
Meeting Recording and Slides
Recordings and slides are available to CDI Members approximately 24 hours after the completion of the meeting.
These are the publicly available materials, log in to view the meeting resources. If you would like to become a member of CDI, join at https://listserv.usgs.gov/mailman/listinfo/cdi-all.
During the call, you can ask and up-vote questions at slido.com, event code #CDIAUG.
Agenda (in Eastern time)
11:00 am Welcome and Opening Announcements
11:15 am Collaboration Area Announcements
11:25 am Data Integration activities in the Superfund Research Program - Michelle Heacock, NIH
11:50 am CDI Pop-Up Lab - communicating about data
Data science mentors - Sheree Watson, USGS
Paths to computational fluency - Richie Erickson, USGS (Link to article)
Communicating between data managers and researchers - Madison Langseth, USGS
Creating map services from a data release - Daniel Wieferich, USGS
Other questions from the community
12:30 pm Adjourn
Data Integration activities in the Superfund Research Program
Michelle Heacock, National Institutes of Health
The NIEHS (National Institute of Environmental Health Sciences) Hazardous Substance Basic Research and Training Program (Superfund Research Program [SRP]) provides practical, scientific solutions to protect health, the environment, and communities. As part of NIEHS, an Institute of the National Institutes of Health, SRP works to learn more about ways to protect the public from exposure to hazardous substances, such as industrial solvents, arsenic, lead, and mercury. These and other toxic substances are found in contaminated water, soil, and air at hazardous waste sites throughout the United States. SRP funds university-based grants on basic biological, environmental, and engineering processes to find real and practical solutions to exposures to hazardous substances. These grants include Multiproject Centers, that are required to include a Data Management and Analysis Core (DMAC) to support the management and integration of data assets. The DMACs are intended to foster and enable the interoperability of data across the Center’s projects and cores to accelerate the impact of the Center's research.
Michelle Heacock is a health science administrator where she oversees Superfund Research Program (SRP) grants that span basic molecular mechanisms of biological responses from exposures to hazardous substances, movement of hazardous substances through environmental media, detection technologies, and remediation approaches. Michelle received her doctorate from Texas A & M University in College Station, Texas for her work on the interplay between DNA repair proteins and telomeres. Her postdoctoral work was conducted at NIEHS where she studied the DNA repair pathway, base excision repair. Her research focused on understanding the causes of cellular toxicity caused by DNA damaging agents.
Welcome and Opening Announcements
- New Resources
- Glosario - A Data Science glossary (and template for creating your own glossary) https://carpentries.org/blog/2020/07/announcing-glosario/.
- A newly released book that covers a lot of skills for “doing science” https://carpentries.org/blog/2021/07/pyrse-book/
- Kevin Gallagher comments
- CDI Request for Proposals Preview
- In about a month, we'll be releasing guidance on the process; think of your ideas for proposals now!
- The process includes Statement of Interest Submissions, Lightning Presentations, Community commenting and voting, and highly-supported statements will receive an invitation for full proposals.
- Looking for collaborative products, projects that normally wouldn't be funded through a specific program area, etc.
- Wiki page about CDI Proposals: Proposals
- Question from the community:
- Has there been a push to modernize our suite of standard Position Descriptions at USGS to reflect more contemporary positions? Most are decades old.
- Tim Quinn: Excellent and very timely question. There's been tremendous interest across USGS in this topic. We have established a USGS Human Capital Transformation team, and there's a lot of work going on right now. They're looking to make recommendations to improve our processes for hiring, including position descriptions, especially for higher grade position descriptions. This team is seeking input from the workforce on these efforts: https://atthecore.usgs.gov/node/7038 (must be on VPN, USGS only). Questions include: How can communication procedures/processes between Human Resource and hiring managers be improved? Are there any tools, templates, charts, etc., that you are aware of that you would like to see developed to better track HR activities? Any experience developing such processes? What workflow changes would you suggest to streamline HR actions and processes? Full form is here: https://forms.office.com/Pages/ResponsePage.aspx?id=urWTBhhLe02TQfMvQApUlHlnx72eFtRNo7eGLyNtSS5UQjJEWkEwOFVWQUdVVDBQVzVSQlRGRzdMWSQlQCN0PWcu (USGS only).
- Kevin Gallagher: This has been a big issue among the ELT. Wondering if we do a good job of communicating complexity of positions when writing descriptions. Things like computer modeling, artificial intelligence, and data science are new, and what I'd like to do is to encourage the CDI community to connect with other organizations and share what you learn about position descriptions. We'd like to use these as a benchmark and to think about how USGS might improve. A working group would also be welcome, for making recommendations to Tim and Kevin.
- How can USGS do better on building capacity among existing employees to better harness new tools?
- Kevin Gallagher: Continuous learning and training on new tools is sharpening our saw and skills. USGS is well-respected as a brand - there's no one better at hydrology, volcanology, etc. To maintain that edge, we have to focus on the state of the art and learning new tools. The CDI is a big part of this. This is a place where we encourage membership to share what training they need, informally share resources/courses among themselves, engage with external agencies to offer training (Software and Data Carpentries trainings, for example). I encourage you to use CDI to express what tools and resources are needed. If you're not attending CDI meetings regularly, you should be. A working group to identify tools and training that are most popular, we could consider identifying ways to put on group training opportunities to the USGS. RGE and EDGE are also options outside of GS positions.
- Tim Quinn on Future of Work
- IT leadership have been having conversations on things that would help for the future we envision for the workforce. CDI has been at the forefront for using tools to help with collaboration. Some priorities:
- Increasing bandwidth.
- With the tools we have (Teams, etc.) - are we using these to the fullest extent? Are there trainings, better communication that we could employ?
- An inter-bureau team is looking at other collaborative tools.
- Making it easier for people to have equal access to data centers and other tools.
- What additional software or collaboration tools should we be considering in the future?
Collaboration Area Announcements
- Software Development
- Next event: August 26th, discussion on data accessibility and integration
- Data Management
- Next event: September 13th, Data series publication from SPN
- Data Viz
- Next event: September 16th, Beyond Bars and Box Plots - Chart alternatives and how to create and style them with ggplot2
- Next event: August 25th, Demo of wireframing/prototyping
- Special presentation: September 2nd, ESIP Disaster Lifecycle Cluster, Usability considerations and applications for decision making in disaster response
- Next event: Annual meeting August 17-19
- Check out the agenda here: https://my.usgs.gov/confluence/x/bhBpKg
- Metadata Reviewers
- Usual meeting on September 6 cancelled due to Labor Day
- Next event: October 4
- Semantic Web
- Next event: September 9, continuing a learning project on semantic annotation of models
- Next event: October (TBD) - meeting quarterly
- Get involved with the Teams channel
Data Integration activities in the Superfund Research Program - Michelle Heacock, NIH
- What is the Superfund Research Program?
- Organization Structure
- Part of the Department of Health and Human Services, National Institutes of Health
- Funds research on understanding fundamental knowledge of living systems
- Superfund adds the aspect of detection and remediation of hazardsous substances
- Superfund is under SARA - Superfund Amendments Authorization Act
- Superfund Research Program, Center Structure
- Problem-based, solution-oriented research
- Shared Goals with USGS
- Exposure pathways, health risks, are shared research areas
- Data Sharing Activities
- Data sharing is important in NIH.
- Strategic plan includes topics around data infrastructure, modernized data ecosystem, data management analytics and tools, workforce development, stewardship and sustainability
- Data Science / Sharing Activities
- Examples for External Use Cases
- Advantages are that center research is directed to common research goals, are structurally and scientifically integrated
- Challenges are data science expertise, vocabularies, ontologies, cost, etc.
- External Use Cases
- Wanted to address real-world, science-driven research
- To integrate at least two or more data streams
- Address barriers to Interoperability and Reuse
- All working towards increasing FAIRness of data
- Plant-microbiome-metal interactions - Can we create broad access to gain a mechanist understanding of plant-microbiome-metal interactions that mediate plant survival and phytostabilization of toxic metals in metal-contaminated mine soils
- Spectrum of readiness for data interoperability
- Proprietary data was a problem - data format and analyses are different depending on which instrument was used for data collection
- Checked in with projects throughout
- Webinars to check on progress and offer solutions
- Annual meeting - building a sustainable community of practice (the 'after')
- Mini workshop: Themes discussed
- Dedicated event for anyone working in data science: what does it take to build a sustainable community of practice?
- Wanted to share tools, valued trainings, had repository issues, time and cost burdens for preparing data
- Lessons learned
- Need for tool development, need for metadata standards, reliable data repositories
- Need to provide training for everyone involved (even mid and late career people)
- Need to develop a list of training topics
- Implementing recommendations
- Establishing working groups
- Providing researchers with centralized resource center
CDI Pop-Up Lab - communicating about data
A session where CDI members can come with their questions, challenges, and solutions.
- Data science mentors - Sheree Watson, USGS
- Youth and Education in Science (YES) program is involved in looking to recruit USGS scientists to mentor a high school student in a new capstone earth system science course in Frederick, MD
- Mentors would meet virtually with a student weekly/bi-weekly for 30-60 minutes
- If interested, email email@example.com
- Paths to computational fluency - Richie Erickson, USGS (Link to article)
- We noticed that there's a lot of people who teach themselves how to become a computer scientists. Thought a paper could lump a bunch of ideas on how to do this together. Computational fluency vs literacy: Fluency: Understanding basics of computer science, using scripts to automate small tasks; Literacy: simple tasks without needing to know infrastructure background, like emailing and using Excel.
- If you don't know where to start, the second half might be most relevant for CDI
- We started off wanting to demonstrate how to use Jupyter Notebooks, but what came out was a set of basic skills that could help almost anyone, like using a terminal or version control software.
- It used to be with Jupyter Notebooks, you would need to download and install Jupyter and Conda, but Pangeo makes it so users can 'dabble' more easily without setting up a complex environment. Customized R environments have been a problem with Pangeo, but a team is working through this.
- Communicating between data managers and researchers - Madison Langseth, USGS
- Session on how to talk to your data manager
- Polls revealed that communication was poor or adequate instead of good/great range
- Session included Story Corps conversations, and communication strategies heard in those conversations
- Communication methods for data managers:
- New employee on-boarding
- Introductory communication (email, Teams chat)
- Monthly coffee chats (events to provide training, describe new/updated policies
- Office Hours - reserving time on a calendar for scientists to come by and chat, open door policy
- Project kickoff meetings - attend or organize kickoff meetings for new projects that include many different members of the team
- Regular check-ins
- Resources to improve communications for data managers:
- Microsoft Teams
- SharePoint Site
- Data Release Guides/Checklists
- Metadata templates
- Other insights for data managers
- Get center management involved
- Ask center director/branch chiefs to encourage researchers to participate in coffee chats & office hours; get assistance from them for on-boarding, project kickoff meetings, etc.
- Be clear about roles (i.e., data managers can't be a project's sole dat manager and a reviewer for the data release
- Ask for help
- DMWG Teams forum
- Ecosystems and Water Mission Area Data Managers forums
- Be flexible
- Try different communication methods
- Communication methods for researchers
- Find out who your data manager is before you need one. If you don't have one, ask if there's one outside of your center that you can contact.
- Reach out early and often
- Attend office hours or coffee chats
- Ask a lot of questions
- Plan ahead
- Insights for researchers
- Use identifiers in communications (like the IPDS number)
- Be flexible
- Leverage other available resources
- Promote data management
- Creating map services from a data release - Daniel Wieferich, USGS
- The need: services to support visualization (primary) and attribute querying of national species distribution and range maps for 271 fish species
- The ask: looking for colleagues to brainstorm options and pros and cons for:
- Hosting solutions (e.g. AGOL, GeoServer, ArcGIS server, Cloud-Hosting solutions, etc.)
- Service Types (e.g. vector types, feature services, map services)
- Data structure (e.g. minimize duplicate storage of geospatial data)
- The data:
- Range data
- Geospatial units WBD HUC8
- Tabular file w/row for each HUC8
- Distribution data
- Geospatial units NHDPlusV2.1 Stream Reach
- Tabular file w/row for each stream reach within range of each species
- Heacock presentation
- What kind of backgrounds do your "data scientists" have? We have fewer employees that "recognize" they are data scientists (yet).
- Librarians, database experts, complementary folks like software engineers. The more you include, the more budget you need.
Are there conflicts/confusion between NIH data policy and the researchers' university data policies?
- Just trying to get ahead of that as much as we can. One thing groups have asked for are examples of data management plans. We're trying to understand that some people are going to be funded by different ICs within the NIH. How can we have the requirements align with those orgs without being too burdensome? Also don't want people to have ten different data management plans. Wondering how to best do that in this evolving situation.
Are you connected into the Earth Science Information Partners (ESIP)? I didn't see any mention but many of these topics are talked about there.
We did put our feelers out to do that, and are working out the best way to do this.