Bioinformatics Community Call 03/21/2017
Topic: Data Release and Data Management for Bioinformatics data
Posted questions: https://my.usgs.gov/confluence/display/cdi/Questions+related+to+Data+Releases+and+Data+Management
Denise Akob introduced the call and then turned it over to JC Nelson to start us off. Scott Cornman and Barbara (Bobbi) Pierson will also chime in on topics.
JC Nelson started by going through the posted questions.
Q1: What are acceptable data repositories for sequence data (Denise)?
JC: Most use NCBI. It is a trusted repository.
Discussion: Not controversial that data are stored in NCBI as an archive. What is questioned is whether depositing in NCBI is sufficient for a data release, or are additional actions required (e.g., a parallel USGS data release, ScienceBase, etc.)?
- Distinction between archive and release.
- Denise’s BAO approved NCBI as both archive and release.
- Chris Kellogg describes how her center does it: Ok to submit to NCBI but first produce a parallel USGS data release that includes website on local server to host the raw data files and FGDC metadata in addition to MIMARKS metadata for NCBI. At her center, data releases are approved at the level of center director, so BAO never sees or knows about them.
- JC has been working with Ecosystems to standardize a data release to NCBI. Chris will follow up with him on how to implement this at her center.
- Denise: Would be helpful to have guidance at a higher level than center to define things like archive vs. release.
- JC is working with associate directors of each mission area to try and do just that.
- JC: Should have an archive of your project data locally even when it is in NCBI. A lot of people are moving to using NARA (National Archives and Records Administration) and the Cloud for archive storage. Hoping to generate better documentation about what people are doing–will post on data management website.
Q2: Has anyone used a Laboratory Information Management System (LIMS) (Neil)?
- JC: Had discussion with Tim Quinn’s group at OEI about the development or procurement of a LIMS. Tim’s group would oversee an enterprise LIMS. Ecosystems is looking into what this would cost. Soliciting information about use or if people have used one, let them know. Potential for implementation early next fiscal year?
- Yesha Shrestha at Reston Stable Isotope Lab is looking into one
- Ohio Microbiology Lab has LIMS but not using it for sequences
- Liz Milano: Genetics lab has been looking into LIMS to keep track of samples in freezers, and later PCR and sequence plates
- In most cases, people who looked into them found them to be prohibitively expensive and stopped right there.
- Mostly people have been considering them for keeping track of samples, freezer inventory, etc.
- Is it even feasible to have an enterprise LIMS that is interoperable between centers and mission areas (e.g., is there a system flexible enough to handle that)?
- Would make it easier to collaborate and work across groups
- Maybe possible to purchase at the Bureau level and then each center maintains a local copy that can be customized to a degree to meet their particular needs
- Scott: LIMS would make documentation easier and more robust, but people don’t want to have to fight a system (e.g., to enter workflows) if the enterprise system doesn’t fit their work
Q3: With very large data sets is there anything I should know before attempting to upload an excel or txt file (Jennifer)?
- JC: ScienceBase has a data upload limit (10GB per file). Can upload 20 GB of data as long as individual files are below 10 GB limit. ScienceBase uses FGDC metadata. There is a USGS Online Metadata Editor. https://www1.usgs.gov/csas/ome/
Q4: Could we discuss releasing “original OTUs” vs. “filtered OTUs” (Bobbi)?
- JC: The objective of data releases is to capture data associated with publication. So if you are putting out filtered OTUs in your paper, you can release filtered. This leaves you free to release subsequent data in later papers. However, at end of the project, if there are data that haven’t been published, you still have to release them (e.g., when BASIS number retires, ~5 years).
- Denise: What should be released? Would rather include raw data than processed data.
- Katharine Coykendall: If you release raw data, then everybody starts out at the same place. Processed data can be useful too, but it’s a lot of information; e.g., she made a SILVA database to analyze her sequences and it’s 20GB by itself.
- Adam Mumford: One solution would be if raw reads are deposited to NCBI and other code (e.g. workflow) was released as component of data release.
- Chris: First 2 data releases included raw sequence data, workflow, and all processed files produced by QIIME during the analyses. Her center’s data management team decided the processed files were too cumbersome and has decreed all future data releases will only include raw files. She negotiated to include workflow with raw files and also sometimes includes workflow as a supplemental file with journal article to be safe.
- Adam: Agrees that workflow (commented code) is important, but intermediate files not so much.
- BitBucket to make code available to public. [For code you’ve written yourself]
- Bobbi: FGDC has logical consistency, lineage of data, etc. Metadata could handle describing those process steps. It could be possible to reproduce entire workflow in metadata. Is it possible to standardize this (data dictionary)?
- A data dictionary tries to predefine the fields for metadata to save you from having to define everything; cuts down time to create USGS metadata document. An effort is being made in Ecosystems to include NCBI metadata fields; feedback on definitions–are they appropriate. JC will share it and we can discuss.
- Denise: Bacteria and archaea are not part of taxonomy in USGS metadata editor. JC is getting it added shortly.
Q5: What are obligations for data review of a data release (Katharine)?
- JC: data release requires data review and metadata review (can be same person). Up to you if you want a subject matter expert. Looking for someone that can say if data seem appropriate. Check list on data management website of what to look for. This review is separate from manuscript review.
- Is it the obligation of reviewer to go through code and run it? Typically not. Chris gave examples: First two data releases she had Carrie Givens go through the processed files–check that they open (if possible to open) and confirm they look like appropriate QIIME output files. Now only releasing raw data, but SFF files can’t be viewed by humans. Her data management team decided to have her also post the first human-readable files, i.e., FASTA files. Reviewers confirm files open and that the information inside looks like DNA sequences. Can’t really do more than that.
- Question about getting data review from another center when the files are so huge.
- Send them on a hard drive (several have done this)
- Could upload file to Yeti (shared memory) or Alces Flight (when ready) so people could do reviews across centers. Denise/Scott will follow up with Courtney Owen about this as a reason to support Alces Flight.
Q6: What are people doing with eDNA raw data (Damian Menning)?
- No different from any other high throughput sample; people are uploading to NCBI.
Topics for next call:
More details for Yeti. Being used in Reston and Alaska. Denise will reach out to them.