The Best Practices (BP) Focus Group was formed in early FY 2011 to compile a suite of best practices, lessons learned, and learning opportunities, regarding data management. The goal is to organize this information and make it available through a website or portal. At our initial meetings it was decided that the best way forward would be to develop or adopt a data life cycle model that accurately reflects how USGS science data does or should travel through its life. This model will serve as the conceptual foundation upon which our data management best practices can be organized, aligned, explained, understood, and promoted.
“As the government looks to its plan for open government through the development of tools such as Data.gov, it is important to integrate these tools into the overall federal architecture and project lifecycle.” -- Harnessing the Power of Digital Data: Taking the Next Step. Science Data Management (SDM) for Government Agencies: report from the Workshop to Improve SDM held June 29 – July 1, 2010, Washington, DC.
The following diagram embodies the BP focus group's consensus view of the USGS Science Data Life Cycle Model's key components, relationships, basic workflow, and suitable visual representation. Questions or comments about any aspect of the model are welcome and should be added to the bottom of this wiki page.
★★★ The model concepts, definitions, and many of the links to resources presented here have been incorporated into the USGS Data Management website. ★★★
download full resolution graphic
The diagram above has 6 Stages of the Research Life Cycle, and also horizontal bars representing Cross-Cutting elements that apply at each of the Stages.
Below, the Cross-cutting elements are described first, then the main model Stages.
Step | Definition | Note | Examples/Resources |
---|---|---|---|
DESCRIBE | Provision of (a) locational, temporal, topical, quality, process, administrative, and other descriptive information about the data (as metadata) and (b) other documentation (as narrative) describing processes, methods, tools, best practices, etc., related to the data. | The key distinction between these two definitions is that metadata, in the standard sense of "data about data", formally describes various key attributes of each data element or collection of elements, while documentation makes reference to data in the context of their use in specific systems, applications, settings. Documentation also includes ancillary materials (e.g., field notes) from which metadata can be derived. In the former sense, its 'all about the data'; in the latter, its 'all about the use'. Moreover, metadata is usually collected, stored, and presented in standard formats and structures designed to be used most effectively by automated systems (e.g., for discovery or representation), while documentation in the broader sense is oriented toward description or narrative designed to be easily understood by a human reader. |
|
MANAGE QUALITY | Protocols and methods used to ensure that data are properly collected, handled, processed, used, and maintained at all stages of the Science data life cycle. | Commonly referred to as "QA/QC" (Quality Assurance/Quality Control). |
|
BACKUP & SECURE | Actions or steps taken to protect data from accidental data loss, corruption, and unauthorized access. | Includes making additional copies of data files or databases that can be used to restore the original data or for recovery of earlier instances of the data. |
|
Step | Definition | Note | Examples/Resources |
---|---|---|---|
PLAN | A documented sequence of intended actions to identify and secure resources and gather, maintain, secure and utilize data holdings. | Includes preparation of a Data Management Plan (DMP) as part of the overall project plan and preparation and documentation (e.g., as project-level metadata) of information from the project proposal or program. Also includes procurement of funding and identification of technical and staff resources for full life cycle DM. |
|
ACQUIRE | The series of actions for collecting or adding to the data holdings. | Includes automated collection (e.g., of sensor-derived data), the manual recording of empirical observations, and obtaining existing data from other sources. |
|
PROCESS | A series of actions or steps performed on data to verify, organize, transform, integrate, and extract data in an appropriate output form for subsequent use. | Includes data files and content organization, and data synthesis or integration, format transformations. May include calibration activities (of sensors and other field and laboratory instrumentation). |
|
ANALYZE | A series of actions and methods performed on data that help describe facts, detect patterns, develop explanations and test hypotheses. | Includes data quality assurance, statistical data analysis, modeling, and interpretation of analysis results. |
|
PRESERVE | Actions and procedures to keep data for some period of time; to set data aside for future use. | Includes data archiving and/or data submission to a data repository. |
|
PUBLISH / SHARE | To prepare and issue, or disseminate, the final data products of the research or program activity. | Medium and agent independent; may include, e.g., "raw" data, data/metadata "packages", derivative materials, etc.; |
|
Element ⬇ | PLAN | ACQUIRE | PROCESS | ANALYZE | PRESERVE | PUBLISH / SHARE |
---|---|---|---|---|---|---|
1. Documentation, sensu lato | ☟See §1.1 & §1.2, below for details | ☟See §1.1 & §1.2, below for details | ☟See §1.1 & §1.2, below for details | ☟See §1.1 & §1.2, below for details | ☟See §1.1 & §1.2, below for details | ☟See §1.1 & §1.2, below for details |
1.1. Formal/Structured Metadata
| Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) - http://www.fgdc.gov/metadata/geospatial-metadata-standards#csdgm - Section 1, Identification Information | FGDC-CSDGM - Section 1 Identification information, Section 2 Data Quality | FGDC-CSDGM - Section 1 Identification Information, Section 2 Data Quality, Section 3 Spatial Data Organization, Section 4 Spatial Reference, Section 5 Entity and Attributes | FGDC-CSDGM - Section 2 Data Quality, Section 5 Entity and Attributes | FGDC-CSDGM - Section 6 Distribution Information, Section 7 Metadata Reference | FGDC-CSDGM - Section 6 Distribution Information, Section 7 Metadata Reference |
1.2. Other Documentary Artifacts | Project proposal, data collection plan (e.g., field campaign, selection of sensors, methods) | Field records (e.g., notes, photographs); instrument & sensor manuals; data collection forms, notebooks, logs; protocol deviations; chain of custody documentation | File directory listing; software manuals; database schema; data transformation procedures; data sheets, e.g., Form 9-1861, Report on Referred Fossils | Quality assurance / quality control procedures; statistical and modeling software manuals; data analysis procedures; modeling and simulation procedures | All documentation needed to understand or provide context to the final products/records or to permit replication or retesting of results need to be preserved. This information is often as important as the data products/records themselves. |
|
2. QA/QC | Come up with a strategy to help limit or eliminate errors; define and enforce standards (codes, measurement units, formats, number of significant digits used, etc.); assign person responsible for ensuring QA/QC) | Ensure the quality of the data before collecting it, monitor it during collection to ensure quality control; | Check for errors of omission (ex: data not recorded) or errors of commission (incorrect or inaccurate data entered); use programs to read data back; have data verified by two different people; minimize the number of times data is entered; process electronically to limit data entry errors | Perform statistical summaries; check for missing, impossible, or anomalous values; ensure data matches column and row headers; use filters to that will flag data outside of established max/min values; look for outliers using graphs and maps | Use checksums to compare data sets and ensure no alterations were introduced; preserve data digitally; document data; limit rights to data so that unwanted edits aren't made; use digital object identifiers (DOI); encrypt data when necessary | Use digital object identifiers (DOI); ensure the data sets have appropriate metadata; clearly communicate the quality of data, any restrictictions or limitations |
(1) Metadata Standards and Resources: In addition to the CSDGM, other metadata standards and profiles have been or may be applied as appropriate to Science domain or purpose-specific applications and services. A preliminary collection of resources providing or explaining these standards includes:
Simplified version of the "swim lanes" approach for consolidating information and ideas.
Element ⬇ | PLAN | ACQUIRE | PROCESS | ANALYZE | PRESERVE | PUBLISH / SHARE |
---|---|---|---|---|---|---|
1. Feedback Loops | From previous projects & possibly from each successive element | Primarily to Plan, Analyze (to determine scope, methodology) & Preserve | To each element | Primarily to Plan, Acquire, Process | Feedback interaction will occur | Plan & Preserve |
2. Roles & Responsibilities | Project Lead | Project Lead and research team (e.g., researcher, research assistant, field technician or other support staff) | Team researcher, research assistant, technician or other support staff; data manager) | Lead or designee (e.g., team researcher, research assistant, data analyst) | Lead or designee (e.g., project or organization data manager; archivist) | Project Lead or designee |
3. Standards & Policies | Data Management Plan (DMP); Project planning and funding requirements | Program-specific guidelines and standards, e.g., Guidelines and Standard Procedures for Continuous Water-Quality Monitors: Station Operation, Record Computation, and Data Reporting -- http://pubs.usgs.gov/tm/2006/tm1D3/ | Guided by accepted scientific standards, methods and protocols, in accordance with USGS Fundamental Science Practices | Guided by accepted scientific standards, methods and protocols, in accordance with USGS Fundamental Science Practices | In accordance with the USGS Records Schedules established for all science data | Per National Archives and Records Administration (NARA), DOI & USGS Records Management requirements; Fundamental Science Practices (FSP); USGS Publishing policy |
4. Data Products | Data collection / acquisition plan, data management plan | Empirical (field) observation, measurement – field notes, forms, data collection files; automated data acquired from sensors – instrument / sensor raw data output files; data gathered from other sources – unprocessed data files | Cleaned data files; organized (structured) data files (e.g. database, spreadsheet);transformed, combined or derived data fields and datasets | Data summaries and analysis results; models; simulations, other laboratory procedures; interpreted result; new or revised procedures | Data files and associated documentation available in data repository and/or data archive | Accessible databases/data sets, software applications (e.g., online databases, maps, charts, graphs, and other visualizations), interpretive reports and manuscripts, poster and oral presentations, procedural and other technical documentation |
5. Tools | BASIS+; | Discipline and/or program-specific: e.g., USGS National Streamflow Information Program – streamgage electronic data recorder; mobile device, e.g., smartphone, data collection apps | Extract, Transform, Load (ETL) tools (e.g., Google Refine) | Statistical software (e.g., R, SAS, SPSS, MINITAB, SYSTAT); programming / computing software (e.g., MATLAB, Mathematica), Excel ad-ins | Discipline-specific: e.g., biological data - Darwin Core Archive (DwC-A) Assistant – http://tools.gbif.org/dwca-assistant/ | Information Product Data System (IPDS) – http://internal.usgs.gov/publishing/ipds.html |
These elements address the human resources, technical infrastructure, and practices necessary to implement a comprehensive Data Management Plan spanning the entire Science Data Life Cycle.
Element ⬇ | PLAN | ACQUIRE | PROCESS | ANALYZE | PRESERVE | PUBLISH / SHARE |
---|---|---|---|---|---|---|
1. Data Management Activities |
|
|
|
|
|
|
1.1 Best Practices |
|
|
|
| Data Products | "Publication reviews should include data usability." (per Peter Scheitzer, see attached one-pager)
|
2. Support Services & Infrastructure |
|
|
|
|
| |
2.1 Knowledge Sharing & Management |
|
|
|
|
| Survey Manual (e.g., FSP-related policies) |
Some introductory text needed here to explain purpose, information being captured in linked page below vis-à-vis use cases and use case development.
Acronyms | Meaning | Link |
---|---|---|
CSDGM | Content Standard for Digital Geospatial Metadata | |
DLCM | Data Life Cycle Management |
|
DMP | Data Management Plan | |
ETL | Extract, Transform, Load | |
FGDC | Federal Geographic Data Committee | |
FSP | Fundamental Science Practices | |
IPDS | Information Product Data System | |
NARA | National Archives and Records Administration | |
SDM | Science Data Management |
|
SPN | Science Publishing Network |
Data Lifecycle Models and Concepts (.pdf)
| Come up with a strategy to help limit or eliminate errors; ensure the quality of the data before collecting it, monitor it during collection to ensure quality control; define and enforce standards (codes, measurement units, formats, number of significan digits used, etc.); assign person responsible for ensuring QA/QC) |
3 Comments
Unknown User (faundeen@usgs.gov)
Foe discussion regarding "Roles," consider endorsing a higher-level concept, one that does not need to be embedded in the model or explanatory text. Such as ...
=============
The Project Lead has the overall responsibility of ensuring that all phases of the scientific data life cycle are addressed.
The Project Lead does not have to personally perform all of those functions, however.
Govoni, David L.
Some Useful Resources for Metadata Standards
Please add 'em as you find 'em.
Blodgett, David L.
This is a really impressive piece of work. All in all a great look of the scientific data life cycle cast in the mold of the USGS. Just a few comments/critiques.
In the top level of organization, I think of data discovery and evaluation as super key and potentially time consuming components of acquisition. I'm not sure the "process" and "analyze" steps are separate. As we move toward an increasingly service oriented data infrastructure the way these steps look in the project life cycle will change. As we transition to a more service oriented data infrastructure, the quality assurance "evaluation" will take place before data acquisition. We have services (like map services) that allow people to ask their own questions of data prior to acquiring it. We will also have services that complete most "processing" as a component of data acquisition. In the end, these components are quite grey and over defining them (which I think they are now) is to the detriment of the model. Consider "Data Discovery", "Data Retrieval", "Scientific Analysis".
Why isn't ISO metadata considered? National and International partners are migrating to this standard. It represents tha highest level abstraction available for metadata and tooling is being developed to support interdisciplinary sharing of information based on ISO. The best resource I know of for this is: https://geo-ide.noaa.gov/wiki/
There needs to be some explanatory material preceding all these "model component matrices" to assist people in interpreting the content in them. I am not really sure what I should be taking away from them.
The feedback loops could be explained better and illustrated. There are a few very prominant ones that maybe should be pulled out? Metadata creation and improvement, data access evaluation and interpretation...
Thats all for now, will try and spend some more time with this later.