Confluence Retirement

In an effort to consolidate USGS hosted Wikis, the myUSGS Confluence service is targeted for retirement on January 28, 2022. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact myusgs@usgs.gov. Thank you for your prompt attention to this matter.
Skip to end of metadata
Go to start of metadata

I'm working on an online database of gravity data, which constitutes text files with all of the metadata (location, etc.) along with the measured gravity value. It's proven pretty easy to write a python script that synchronizes our local database (thousands of text files stored in a specific directory structure) with ScienceBase (one item per station, multiple text files per item). SB works well for this, and I believe it lets us meet FSP for Open Data. But, it's not very "discoverable." What I would like, and intend to submit a CDI proposal for, is a secondary "wrapper", probably in the form of a web page, that digests the text files stored in ScienceBase and presents them in some sort of map- and time-based view. In other words, SB would remain the official data of record, but the secondary interface would let the user select stations within a particular area and/or time period, and either retrieve the text files, or just the pertinent info (gravity) from the text files. 

Although my interest is gravity, I think this is a more general problem, and what we really need is a template and/or example of how to do it. Maybe GIS services would be an option. Anyone interested in going in on the proposal?

  • No labels

8 Comments

  1. We have gravity data waiting in the wings for publication as well headed to ScienceBase. What are your thoughts on using PACES as a secondary interface (http://research.utep.edu/default.aspx?tabid=37229) to make the data more "discoverable"?

    1. ... gravity measurements vs. gravity base stations might preclude usage?

       

    2. Thanks for mentioning PACES, it's at least been around long enough that I don't expect it to go away soon. Correct me if I'm wrong, but the only way to see what data are available is to download a csv file, correct? I've only used it before as a starting point for "geologic" (e.g., depth-to-bedrock) studies, but I guess it doesn't have all the bells and whistles I was looking for.

       

      Our data are a little different, being time-series of absolute and relative-gravity data. More like discrete GW measurements than other geophysical data, but also not totally suitable for NWIS.

       

      Another existing option is AGrav, a European effort: http://dx.doi.org/10.5066/F7SQ8XHX

      We could easily store our absolute gravity data there, and apparently there is a newer, better version out soon. I think the biggest drawback there is that it is a non-USGS product (and not all that widely used). 

       

      What type of gravity data are you publishing? Relative data? I'm very interested in the topic, esp. what exactly to publish (gravity differences? only network adjustment results?). Here is a recent ScienceBase gravity data release:

      http://dx.doi.org/10.5066/F7SQ8XHX

  2. Would text file time series be something that would work? Each file is for a specific station, for a specific data type (e.g. geophone) and with a time range. The interface would still be useful in selecting location, time, data type...  

    1. Yes, possibly. I think there will need to be some intermediate data type between the data text files (which are output by software, shouldn't be hand-edited, and are the "official" data), and whatever web interface allows access. Maybe that's a geodatabase, or a NetCDF file, or a "real" SQL-type database. Do you know of any examples of sites "Built on ScienceBase" that provide time-series data? As far as I can tell, the web services offered by SB don't deal well with time series (maybe the NetCDF extension does?)

      Thank you-

      1. I believe the ScienceBase Development Team has looked into extensions for time series data. I'm going to share this thread with Ignizio, Drew A. He might be able to provide more information about whether or not this is something that ScienceBase may be able to handle in the future. 

  3. In my experience, there isn't anything inherently special about the type of file that is needed to store time-series information. Generally, this can be intrinsically represented in the data by way of its structure (file names or storage organization) or by a field within the data itself that captures time information associated with a feature or record event. Tables, HDFs, Rasters, Shapefiles, Geodatabases, NetCDF, and text files can all be used to store this type of data. The "best" approach is probably one that balances efficiency between how the info is natively captured, how it will be used, storage optimization, and perhaps clarity or ease of pulling out certain information from the master set.

    Kennedy, Jeffrey R. I think you are correct that the best way to tackle something like this while utilizing ScienceBase would probably be to have another 'layer' (an application or other website) that connects to content in ScienceBase and applies additional querying or display functionality. To a certain degree, you have the decision of applying some processing to your data before you post it or after. By that, I mean you could aggregate your text into a well-structured tabular format or perhaps as a geospatial dataset, post that content to ScienceBase and programmatically interact it with from the item(s) in which it is stored. Alternatively, you could have a set of items that host content 'as is', and then you could pull that content into another tool or site, re-assemble it or transform it there, and then work against it. Probably, for performance purposes, you would benefit from getting things cleaned and assembled into the structure your application anticipates sooner than later (if a tool or app has to pull content and transform it every time it runs, that's likely to be slow).

    It might be helpful to think about the following:

    Where will this downstream application reside?

    What is the level of use you are anticipating? 

    What types of queries and display options do you intend to support? Is this information all captured in your data recordings in a standardized manner?

    How familiar are you or your dev team with interacting with different data types? ScienceBase could support a couple different approaches: 1. You simply host the files on ScienceBase and your tool pulls them down (or syncs regularly) and then re-assembles things on your end to support your needs. 2. You host the content as a GIS dataset (a shapefile that you perhaps update regularly? or a set of GIS datasets?) on ScienceBase and allow ScienceBase to create spatial services for the records. You could then interact with the spatial services directly to consume and display data, querying by fields that you've included that capture time information.

    I can't think of a specific example that uses time series data, but there are a suite of other apps and tools that do something like one of the two steps above. In some cases 'features/records' have been very granularly entered into ScienceBase items and the whole set of these records can be queried as items in the ScienceBase API, but this has some limits to how well it realistically scales and the downside of having lots of items in ScienceBase that may not be very informative or useful on their own. This isn't a recommended use in most cases, but it really depends on the total number of recordings/features and the project.

    I think your goals here are going to be to structure a workflow (with files and a process) that is manageable for you and your team based on the familiarity and skillsets with things, an approach that will be performant in your downstream app, and is as efficient as possible (you're probably going to have to massage and restructure your data somehow-- the goal is to do this as efficiently as possible to meet your needs).

    I don't know if the NetCDF file format has some options that make it inherently better for time-series, but the drawback there is that we don't currently offer services for NetCDF files (although we're looking at trying to add these). So you could host it in that format, but your process would be pulling those files down from ScienceBase rather than connecting to some service built from them in ScienceBase.

     

     

     

    1. Ignizio, Drew A., thank you for the helpful reply. I like your idea of generating shapefiles and building on the ScienceBase spatial services. I have a draft gravity archive here:

      https://www.sciencebase.gov/catalog/item/56e301cae4b0f59b85d3a346

      All of the files there are synced with our local archive (text files in a particular directory structure) using a Python script and the REST API. It should be easy enough, each time SB is synced with the local archive, to programatically generate a shapefile with the data. I was thinking to do this just so the station locations could be shown in ScienceBase (individual stations can be shown using the 'representationalPoint' attribute, but not all of the stations in a particular study area-item). 

      But, as long as I'm making the shapefiles (programatically), I might as well include the data as well as the location. Then, the user could download individual text files, or a shapefile for a particular station or study area (or the whole archive), or interact with a separate map-based webpage built on the spatial services created by ScienceBase when the shapefiles are uploaded. As long as I can get that latter part to work, there should be no need for another intermediate (e.g., SQL) database.