Skip to end of metadata
Go to start of metadata

Introduction

The Best Practices (BP) Focus Group was formed in early FY 2011 to compile a suite of best practices, lessons learned, and learning opportunities, regarding data management.  The goal is to organize this information and make it available through a website or portal. At our initial meetings it was decided that the best way forward would be to develop or adopt a data life cycle model that accurately reflects how USGS science data does or should travel through its life. This model will serve as the conceptual foundation upon which our data management best practices can be organized, aligned, explained, understood, and promoted.

“As the government looks to its plan for open government through the development of tools such as Data.gov, it is important to integrate these tools into the overall federal architecture and project lifecycle.”  -- Harnessing the Power of Digital Data: Taking the Next Step. Science Data Management (SDM) for Government Agencies: report from the Workshop to Improve SDM held June 29 – July 1, 2010, Washington, DC.

The Model

Icon

The following diagram embodies the BP focus group's consensus view of the USGS Science Data Life Cycle Model's key components, relationships, basic workflow, and suitable visual representation. Questions or comments about any aspect of the model are welcome and should be added to the bottom of this wiki page.

Icon

The model concepts, definitions, and many of the links to resources presented here have been incorporated into the USGS Data Management website.

 

Level 1:  Basic Project-Level Data Workflow View

Component Definitions

The diagram above has 6 Stages of the Research Life Cycle, and also horizontal bars representing Cross-Cutting elements that apply at each of the Stages.
Below, the Cross-cutting elements are described first, then the main model Stages.

Cross-cutting Model Elements

Step

Definition

Note

Examples/Resources

DESCRIBE
(Metadata, Documentation)

Provision of (a) locational, temporal, topical, quality, process, administrative, and other descriptive information about the data (as metadata) and (b) other documentation (as narrative) describing processes, methods, tools, best practices, etc., related to the data.

The key distinction between these two definitions is that metadata, in the standard sense of "data about data", formally describes various key attributes of each data element or collection of elements, while documentation makes reference to data in the context of their use in specific systems, applications, settings. Documentation also includes ancillary materials (e.g., field notes) from which metadata can be derived. In the former sense, its 'all about the data'; in the latter, its 'all about the use'.  Moreover, metadata is usually collected, stored, and presented in standard formats and structures designed to be used most effectively by automated systems (e.g., for discovery or representation), while documentation in the broader sense is oriented toward description or narrative designed to be easily understood by a human reader.

MANAGE QUALITY

Protocols and methods used to ensure that data are properly collected, handled, processed, used, and maintained at all stages of the Science data life cycle.

Commonly referred to as "QA/QC" (Quality Assurance/Quality Control).

BACKUP & SECURE

Actions or steps taken to protect data from accidental data loss, corruption, and unauthorized access.

Includes making additional copies of data files or databases that can be used to restore the original data or for recovery of earlier instances of the data.

Primary Model Stages

Step

Definition

Note

Examples/Resources

PLAN

A documented sequence of intended actions to identify and secure resources and gather, maintain, secure and utilize data holdings.

Includes preparation of a Data Management Plan (DMP) as part of the overall project plan and preparation and documentation (e.g., as project-level metadata) of information from the project proposal or program. Also includes procurement of funding and identification of technical and staff resources for full life cycle DM.

  • DataOne DMP Tool (data management plan creation template) - https://dmp.cdlib.org/* Methods listed in project proposal related to data (e.g. type of data to be collected, protocols to be used for data collection, processing and analysis)
  • U.C. DMPTool - https://dmp.cdlib.org/ allows researchers to input the content they would like to include in a data management plan and then generates that plan for them.
  • Managing Your Data - https://www.lib.umn.edu/datamanagement/DMP The U. of Minnesota library has created Managing Your Data, which guides researchers in the creation of data management plans. Inspired by the NSF data management mandate for grant proposals, the program provides best practices for sharing and finding data, preservation and archiving, copyright and ethics, and other areas.

ACQUIRE

The series of actions for collecting or adding to the data holdings.

Includes automated collection (e.g., of sensor-derived data), the manual recording of empirical observations, and obtaining existing data from other sources.

PROCESS

A series of actions or steps performed on data to verify, organize, transform, integrate, and extract data in an appropriate output form for subsequent use.

Includes data files and content organization, and data synthesis or integration, format transformations. May include calibration activities (of sensors and other field and laboratory instrumentation).

ANALYZE

A series of actions and methods performed on data that help describe facts, detect patterns, develop explanations and test hypotheses.

Includes data quality assurance, statistical data analysis, modeling, and interpretation of analysis results.

PRESERVE

Actions and procedures to keep data for some period of time; to set data aside for future use.

Includes data archiving and/or data submission to a data repository.

A primary goal for USGS is to preserve well-organized and documented datasets that support research interpretations and that can be re-used by others; all research publications should be supported by associated, accessible datasets.

PUBLISH / SHARE

To prepare and issue, or disseminate, the final data products of the research or program activity.

Medium and agent independent; may include, e.g., "raw" data, data/metadata "packages", derivative materials, etc.;
Transfer may occur via automated or non automated mechanisms in any combination, e.g., human-human, human-machine, machine-machine.

 

Model Components Matrix

Cross-Cutting Elements

Element ⬇

PLAN

ACQUIRE

PROCESS

ANALYZE

PRESERVE

PUBLISH / SHARE

1.  Documentation, sensu lato
(Formal/structured metadata; informal field/laboratory notes and logs; manuals; database schema, data models; other project documentation)

☟See §1.1 & §1.2, below for details

☟See §1.1 & §1.2, below for details

☟See §1.1 & §1.2, below for details

☟See §1.1 & §1.2, below for details

☟See §1.1 & §1.2, below for details

☟See §1.1 & §1.2, below for details

1.1.  Formal/Structured Metadata
(Locational, temporal, topical, quality, process, provenance, administrative, and other descriptive information)

 

Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) - http://www.fgdc.gov/metadata/geospatial-metadata-standards#csdgm - Section 1, Identification Information

FGDC-CSDGM - Section 1 Identification information, Section 2 Data Quality

FGDC-CSDGM - Section 1 Identification Information, Section 2 Data Quality, Section 3 Spatial Data Organization, Section 4 Spatial Reference, Section 5 Entity and Attributes

FGDC-CSDGM - Section 2 Data Quality, Section 5 Entity and Attributes

FGDC-CSDGM - Section 6 Distribution Information, Section 7 Metadata Reference

FGDC-CSDGM - Section 6 Distribution Information, Section 7 Metadata Reference

1.2.  Other Documentary Artifacts
(Observations, methodological notes, data sheets, project proposals, plans, and other tangible byproducts)

Project proposal, data collection plan (e.g., field campaign, selection of sensors, methods)

Field records (e.g., notes, photographs); instrument & sensor manuals; data collection forms, notebooks, logs; protocol deviations; chain of custody documentation

File directory listing; software manuals; database schema; data transformation procedures; data sheets, e.g., Form 9-1861, Report on Referred Fossils

Quality assurance / quality control procedures; statistical and modeling software manuals; data analysis procedures; modeling and simulation procedures

All documentation needed to understand or provide context to the final products/records or to permit replication or retesting of results need to be preserved. This information is often as important as the data products/records themselves.

 

2.  QA/QC
(Quality Assurance/Quality Control)
(Protocols and methods used to ensure that data are properly collected, handled, processed, used, and maintained at all stages of the Science data life cycle)

Come up with a strategy to help limit or eliminate errors; define and enforce standards (codes, measurement units, formats, number of significant digits used, etc.); assign person responsible for ensuring QA/QC)

Ensure the quality of the data before collecting it, monitor it during collection to ensure quality control;

Check for errors of omission (ex: data not recorded) or errors of commission (incorrect or inaccurate data entered); use programs to read data back; have data verified by two different people; minimize the number of times data is entered; process electronically to limit data entry errors

Perform statistical summaries; check for missing, impossible, or anomalous values; ensure data matches column and row headers; use filters to that will flag data outside of established max/min values; look for outliers using graphs and maps

Use checksums to compare data sets and ensure no alterations were introduced; preserve data digitally; document data; limit rights to data so that unwanted edits aren't made; use digital object identifiers (DOI); encrypt data when necessary

Use digital object identifiers (DOI); ensure the data sets have appropriate metadata; clearly communicate the quality of data, any restrictictions or limitations

Cross-Cutting Elements Notes

(1)  Metadata Standards and Resources:  In addition to the CSDGM, other metadata standards and profiles have been or may be applied as appropriate to Science domain or purpose-specific applications and services.  A preliminary collection of resources providing or explaining these standards includes:


Primary Elements

Icon

Simplified version of the "swim lanes" approach for consolidating information and ideas.

Element ⬇

PLAN

ACQUIRE

PROCESS

ANALYZE

PRESERVE

PUBLISH / SHARE

1.  Feedback Loops
(Interconnected processes or dependencies within the DLCM)

From previous projects & possibly from each successive element

Primarily to Plan, Analyze (to determine scope, methodology) & Preserve

To each element

Primarily to Plan, Acquire, Process
& Preserve

Feedback interaction will occur
especially during the Plan, Process, Acquire & Analyze phases.

Plan & Preserve

2.  Roles & Responsibilities
(Lead person(s) for each component of the DLCM)

Project Lead

Project Lead and research team (e.g., researcher, research assistant, field technician or other support staff)

Team researcher,  research assistant, technician or other support staff; data manager)

Lead or designee (e.g., team researcher, research assistant, data analyst)

Lead or designee (e.g., project or organization data manager; archivist)

Project Lead or designee

3. Standards & Policies
(Applicable standards and policies required by law or organizational mandate)

Data Management Plan (DMP); Project planning and funding requirements

Program-specific guidelines and standards, e.g., Guidelines and Standard Procedures for Continuous Water-Quality Monitors: Station Operation, Record Computation, and Data Reporting -- http://pubs.usgs.gov/tm/2006/tm1D3/

Guided by accepted scientific standards, methods and protocols, in accordance with USGS Fundamental Science Practices

Guided by accepted scientific standards, methods and protocols, in accordance with USGS Fundamental Science Practices

In accordance with the USGS Records Schedules established for all science data
in the Agency --
http://internal.usgs.gov/gio/irm/fmref2.html 

The ISO Records Management Standard is also an excellent guidance document Ref: ISO 15489-1:2001(E). This standard uses the concepts of a) Authenticity, b) Reliability, c) Integrity, and d) Usability in determining the value of records.

Per National Archives and Records Administration (NARA), DOI & USGS Records Management requirements; Fundamental Science Practices (FSP); USGS Publishing policy

4.  Data Products
(Source data and related process and product artifacts)

Data collection / acquisition plan, data management plan

Empirical (field) observation, measurement – field notes, forms, data collection files; automated data acquired from sensors – instrument / sensor raw data output files; data gathered from other sources – unprocessed data files

Cleaned data files; organized (structured) data files (e.g. database, spreadsheet);transformed, combined or derived data fields and datasets

Data summaries and analysis results; models; simulations, other laboratory procedures; interpreted result; new or revised procedures

Data files and associated documentation available in data repository and/or data archive

Accessible databases/data sets, software applications (e.g., online databases, maps, charts, graphs, and other visualizations), interpretive reports and manuscripts, poster and oral presentations, procedural and other technical documentation

5.  Tools
(Software applications & services to support each top-level component)

BASIS+;
Desktop metadata, e.g., Metadata Parser (MP) - http://www.fgdc.gov/dataandservices/getmeta;
DMP template (e.g., http://dataconservancy.org/dataManagementPlans)

Discipline and/or program-specific: e.g., USGS National Streamflow Information Program – streamgage electronic data recorder; mobile device, e.g., smartphone, data collection apps

Extract, Transform, Load (ETL) tools (e.g., Google Refine)

Statistical software (e.g., R, SAS, SPSS, MINITAB, SYSTAT); programming / computing software (e.g., MATLAB, Mathematica), Excel ad-ins

Discipline-specific: e.g., biological data - Darwin Core Archive (DwC-A) Assistant – http://tools.gbif.org/dwca-assistant/

Information Product Data System (IPDS) – http://internal.usgs.gov/publishing/ipds.html


Secondary (Related) Elements

Icon

These elements address the human resources, technical infrastructure, and practices necessary to implement a comprehensive Data Management Plan spanning the entire Science Data Life Cycle.

Element ⬇

PLAN

ACQUIRE

PROCESS

ANALYZE

PRESERVE

PUBLISH / SHARE

1.  Data Management Activities
(Specific data management activities necessary to successfully address each DLCM component)

 

 

 

 

 

 

1.1  Best Practices
(Recommended best practices associated with each data management activity)

 

 

 

 

Data Products
    o  All final USGS data products/records (including some physical samples) need to be adequately preserved for as long as the applicable Agency Records Schedules dictate - http://internal.usgs.gov/gio/irm/fmref2.html;
    o  Preservation planning needs to be part of the process to address media obsolescence issues;
    o  For digital media this can mean migrations every 3-5 years;
    o  Format evolutions also need to be addressed.

Data Objects/Digital Media
    o  Migrate data to new media every 3-5 years. Ensure multiple copies exist including off-site;
    o  Migrating science data from one media to another is now a requirement for all science projects to undertake.
    o  A good reference point as to when this should occur comes from the Consultative Committee on Space Data Systems which has the definition of "Long-Term" as "A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository. This period extends into the indefinite future." (http://www.ccsds.org/CCSDS/documents/650x0b1.pdf);
    o  A good practice to follow is to note when media are written and then project no more than 3-5 years in the future when the media, including hard disks, should be evaluated for media migration.

Documentation
    o  It is vitally important that the characteristics and contextual information be preserved for all final data products/records.
    o  Equal attention must be given to generating and preserving the metadata as to the actual data itself.

"Publication reviews should include data usability." (per Peter Scheitzer, see attached one-pager)

As part of interpretive reports, provide, as appropriate:

  1. Stable links (e.g., Digital Object Identifiers or DOIs) back to relevant "source data" or other information to enable replication of analyses/experiments, evaluation of methods, or verification of interpretations; and
  2. Structured, machine-parsable versions of published data tables (e.g., a CSV file corresponding to a table distributed in PDF format.)

2.  Support Services & Infrastructure
(Bureau/program-level services & processes, including people, hardware, software, documentation, learning services, help desk, etc., that enable/support each DLCM component)

 

 

 

 

 

Science Publishing Network (SPN);
Data.gov

2.1  Knowledge Sharing & Management
(Preservation and sharing of best practices and other information relating to the DLCM through training, communities of practice, documentation, knowledgebases, etc.)

 

 

 

 

 

Survey Manual (e.g., FSP-related policies)

Use Case Development

Icon

Some introductory text needed here to explain purpose, information being captured in linked page below vis-à-vis use cases and use case development.

Acronyms

Acronyms

Meaning

Link

CSDGM

Content Standard for Digital Geospatial Metadata

http://www.fgdc.gov/metadata/csdgm/

DLCM

Data Life Cycle Management

 

DMP

Data Management Plan

http://dataconservancy.org/dataManagementPlans

ETL

Extract, Transform, Load

http://en.wikipedia.org/wiki/Extract,_transform,_load

FGDC

Federal Geographic Data Committee

http://www.fgdc.gov/

FSP

Fundamental Science Practices

http://internal.usgs.gov/fsp/
http://www.usgs.gov/fsp/

IPDS

Information Product Data System

http://internal.usgs.gov/publishing/ipds.html

NARA

National Archives and Records Administration

http://www.archives.gov/

SDM

Science Data Management

 

SPN

Science Publishing Network

http://internal.usgs.gov/publishing/

References

Data Lifecycle Models and Concepts (.pdf)

Attachments

  File Modified
PDF File Final CDI_poster_Large3x4groupmods.pdf Final 'Write-On' Poster Sep 22, 2011 by Benson, Abigail L.
Microsoft Word Document Data Life Cycle Model Proposal.docx FY12 Proposal Sep 22, 2011 by Benson, Abigail L.
Microsoft Word Document S3 Paper.docx S3 Science Support Service Proposal Sep 22, 2011 by Benson, Abigail L.
Microsoft Word Document USGS Data Lifecycle Model.docx Sep 22, 2011 by Benson, Abigail L.
PDF File DM-Life-Cycle-Model-OAIS-v3.pdf Sep 22, 2011 by Benson, Abigail L.
Microsoft Word Document Data Lifecycle Models and Concepts v7.docx DM Lifecycle Model v7 Sep 22, 2011 by Benson, Abigail L.
PDF File Data-Life-Cycle-Model-Alternatives F-G.pdf Model graphics variants F & G Oct 05, 2011 by Govoni, David L.
PNG File Scientific-Data-Life-Cycle-Model-F-small.png Model version "F" selected by vote of DMWG-BP focus group Oct 11, 2011 by Govoni, David L.
Microsoft Excel Sheet 2011-07-27_Model-discussion_ST-edit1c.xls Original 'swim lanes' doc by Steve Tessler Nov 09, 2011 by Faundeen, John L.
PDF File Publication reviews should include data usability.pdf Peter Schweitzer's one-pager. Jan 24, 2012 by Govoni, David L.
Microsoft Word Document Data Management Working Group Comments.docx Comments from DMWG 1-9-12 Mtg. Jan 31, 2012 by Faundeen, John L.
PDF File Glossary of Metadata Standards (Riley 2010).pdf Riley, Jenn, 2010. "Glossary of Metadata Standards". http://www.dlib.indiana.edu/~jenlrile/metadatamap/. Accessed 01-04-2012 Feb 02, 2012 by Govoni, David L.
Microsoft Word Document Data Lifecycle Models and Concepts v9.docx Data Life Cycle Models and Concepts, Version 1 (CEOS.WGISS.DSIG.TN01 Sept. 2011) Feb 09, 2012 by Govoni, David L.
PNG File Data-Life-Cycle-Model-G.png Model version "G" with comments of 02/09/2012 applied Feb 09, 2012 by Govoni, David L.
Microsoft Word Document Data Lifecycle Models and Concepts v11.docx Data Life Cycle Models v.11 Mar 13, 2012 by Faundeen, John L.
Microsoft Word Document Data Lifecycle Models and Concepts v12.docx Version 12 Apr 04, 2012 by Faundeen, John L.
Microsoft Word Document Data Lifecycle Models and Concepts v13.docx Version 13 Apr 19, 2012 by Govoni, David L.
JPEG File SDLC Level Two Roles - FINAL.jpg Sep 13, 2012 by Hutchison, Vivian B.
PDF File Data Lifecycle Models and Concepts_sm.pdf Mar 07, 2013 by Chang, Michelle Y.

| Come up with a strategy to help limit or eliminate errors; ensure the quality of the data before collecting it, monitor it during collection to ensure quality control; define and enforce standards (codes, measurement units, formats, number of significan digits used, etc.); assign person responsible for ensuring QA/QC) |

3 Comments

  1. Foe discussion regarding "Roles," consider endorsing a higher-level concept, one that does not need to be embedded in the model or explanatory text.  Such as ...

    =============

    The Project Lead has the overall responsibility of ensuring that all phases of the scientific data life cycle are addressed.

    The Project Lead does not have to personally perform all of those functions, however.

  2. Some Useful Resources for Metadata Standards

    Please add 'em as you find 'em.

  3. This is a really impressive piece of work. All in all a great look of the scientific data life cycle cast in the mold of the USGS. Just a few comments/critiques.

    In the top level of organization, I think of data discovery and evaluation as super key and potentially time consuming components of acquisition. I'm not sure the "process" and "analyze" steps are separate. As we move toward an increasingly service oriented data infrastructure the way these steps look in the project life cycle will change. As we transition to a more service oriented data infrastructure, the quality assurance "evaluation" will take place before data acquisition. We have services (like map services) that allow people to ask their own questions of data prior to acquiring it. We will also have services that complete most "processing" as a component of data acquisition. In the end, these components are quite grey and over defining them (which I think they are now) is to the detriment of the model. Consider "Data Discovery", "Data Retrieval", "Scientific Analysis".

    Why isn't ISO metadata considered? National and International partners are migrating to this standard. It represents tha highest level abstraction available for metadata and tooling is being developed to support interdisciplinary sharing of information based on ISO. The best resource I know of for this is: https://geo-ide.noaa.gov/wiki/

    There needs to be some explanatory material preceding all these "model component matrices" to assist people in interpreting the content in them. I am not really sure what I should be taking away from them.

    The feedback loops could be explained better and illustrated. There are a few very prominant ones that maybe should be pulled out? Metadata creation and improvement, data access evaluation and interpretation...

    Thats all for now, will try and spend some more time with this later.