Scientists conducting data analyses face a number of challenges in today’s data-rich world, including efficiently dealing with the growing size and complexity of data, collaborating with team members with various data expertise, and requirements for disseminating models, data, and metadata. While more members of the scientific community are embracing scripting languages to handle inputs and outputs of data, the increasing complexity of the analyses (e.g. multiple data sources and access patterns, large data) can strain the usefulness of basic scripting workflows. At the Water Mission Area Integrated Information Dissemination Division, the Data Science team has been exploring additional tools to achieve complete reproducibility, increase efficiency, bolster collaboration, and promote scalability in these complex situations. Features that have helped shape these robust workflows include modular and reproducible scripts, shared caching for intermediate data products, built-in fault tolerance when dealing with network transactions, as well as capturing dependencies among each input, processing step, and output. While we have implemented these features using GNU-make and related R libraries (remake, drake), the concepts we’ve learned about these features can be applied to other languages. These practices can be taken one step further by integrating the tools with high throughput computing to create powerful and manageable systems for data analysis.
Science Support Framework Category: Data Management
Author(s): Lindsay Platt (firstname.lastname@example.org) - WMA Data Science