Child pages
  • Downloading large subsets of Geo Data Portal data using nccopy.
Skip to end of metadata
Go to start of metadata

While the Geo Data Portal helps gain access to time series or small subsets of gridded time series data, sometimes very large subsets are required for a particular analysis or visualization. In summer 2015, a modification to the OPeNDAP subset algorithm will be released that limits the size of NetCDF subsets to about 500mb. For reference, that is about 65 million grid cells which could be 6500 time steps of a 100X100 grid for one variable or any other combination of X*Y*T*Var. This limit is being put in place to help decrease the number of failed requests for large subsets.

Fortunately, there is a method for making very large subsets of data locally that scales up to multiple gigabyte files. While these instructions are not going to teach you everything you need to know to use this method, they present the basics that will get you started learning how to get these data subsets successfully. 

The NetCDF C libraries, available from Unidata, offer a command line utility called nccopy. It is a relatively basic utility that has some very powerful capabilities. If you are using large netcdf files, you should probably be familiar with and have nccopy and the other NetCDF-C command line utilities, such as ncdump, available to you. If you don't, we highly recommend spending some time getting them installed and becoming familiar with their basic function.

nccopy is typically used to copy a NetCDF file from one format to another changing the way the data is stored internally but not changing the way it looks structurally from the outside. This can be summarized by saying it shuffles bits around into different orders, chunks the file up into different packages, and compresses the file in different ways. That is useful, but nccopy can also write a copy of a 'virtual' NetCDF file specified by a URI (note that is an URI not and an URL meaning this URI will not resolve in your browser).

If you dig into a dataset on the Geo Data Portal or another THREDDS server, say PRISM, you will find a URL labeled 'Data URL'. The page you are on (go to the PRISM link for an example) probably ends in '.html' and begins with the data URL. Note that if you just copy and paste the Data URL into a new browser window without doing anything else on the '.html' page, you will get an error because you haven't asked for anything yet. If you start clicking the check boxes on that page you will see content put into the URL in the Data URL box. The content has the form dataURLBase?variable[index:stride:index],variable[index:stride:index]. 

On the PRISM page you might generate a URL that looks like:[0:1:1404],lat[0:1:620],time[0:1:1427] which, when you go to it, still gives you an error. This is because you are still haven't specified what format you want. If you click the 'Get ASCII' button it will open a new tab with this URL:[0:1:1404],lat[0:1:620],time[0:1:1427] which has .ascii? added in the middle and returns an ASCII text version of the lon, lat, and time variables. Following this basic pattern, long lists of variables with very large index ranges can be described.

The problem here is that if you ask a server to generate a multiple gigabyte response like this, it will throw an error saying you've asked for too much data. To get around this, we can use nccopy, which decomposes the very large data specification and requests the remote data in chunks, assembling a single (or multiple depending how you want to do it) file locally.

To get this prism example using nccopy would be:

nccopy example 1
nccopy -u[0:1:1404],lat[0:1:620],time[0:1:1427]

We could further add a variable and get the whole precipitation variable like:

nccopy example 2
nccopy -u[0:1:1404],lat[0:1:620],time[0:1:1427],ppt[0:1:1427][0:1:620][0:1:1404]

This is a pretty big file though. The ppt array is 1427*620*1404 1.24 billion cells! If we look at the bottom of the page we can see that ppt is stored in 32 bit integers so the data volume of this would be about 4.6GB uncompressed. So lets look at compression.

nccopy lets us write data in NetCDF-4 format which uses a 64-bit offset and can store way more than 4.6GB of data. But we should probably consider disk space and read/write performance so writing the data compressed in little pieces that can be read quickly might be good. So we could do something like:

nccopy compression example
nccopy -k 3 -d 1 -c time/1,lat/30,lon/30 -u[0:1:1404],lat[0:1:620],time[0:1:1427],ppt[0:1:1427][0:1:620][0:1:1404]

k 3 will make sure we get a NetCDF4 classic data model file, d 1 will will set the deflation to level 1, and c time/1,lat/30,lon,30 will set the chunk sizes so we get about 30 chunks of data per time step that can each be compressed individually. 

To get a feel for this, try running it on a very short time period:

nccopy small compression example
nccopy -k 3 -d 1 -c time/1,lat/30,lon/30 -u[0:1:1404],lat[0:1:620],time[0:1:1],ppt[0:1:1][0:1:620][0:1:1404]

If we then do ncdump we can see the ChunkSizes we asked for are correct:

ncdump example
ncdump -sh # -sh shows special virtual stuff and the header information for the file
netcdf prism_ppt {
	lat = 621 ;
	lon = 1405 ;
	time = 2 ;
	float lon(lon) ;
		lon:long_name = "Longitude" ;
		lon:units = "degrees_east" ;
		lon:_Storage = "chunked" ;
		lon:_ChunkSizes = 1405 ;
		lon:_DeflateLevel = 1 ;
	float lat(lat) ;
		lat:long_name = "Latitude" ;
		lat:units = "degrees_north" ;
		lat:_Storage = "chunked" ;
		lat:_ChunkSizes = 621 ;
		lat:_DeflateLevel = 1 ;
	float time(time) ;
		time:calendar = "standard" ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1870-01-01 00:00:00" ;
		time:_Storage = "chunked" ;
		time:_ChunkSizes = 2 ;
		time:_DeflateLevel = 1 ;
	int ppt(time, lat, lon) ;
		ppt:units = "mm/month" ;
		ppt:scale_factor = 0.01 ;
		ppt:add_offset = 0. ;
		ppt:long_name = "Mean monthly precipitation" ;
		ppt:_FillValue = -9999 ;
		ppt:_Storage = "chunked" ;
		ppt:_ChunkSizes = 1, 30, 30 ;
		ppt:_DeflateLevel = 1 ;
		ppt:_Endianness = "little" ;
// global attributes:
		:Conventions = "CF-1.4" ;
		:acknowledgment = "PRISM Climate Group, Oregon State University,, Accessed October 2014." ;
		:Metadata_Conventions = "Unidata Dataset Discovery v1.0" ;
		:title = "Parameter-elevation Regressions on Independent Slopes Model Monthly Climate Data for the Continental United States. October 2014 Snapshot" ;
		:summary = " This dataset was created using the PRISM (Parameter-elevation Regressions on Independent Slopes Model) climate mapping system, developed by Dr. Christopher Daly, PRISM Climate Group director. PRISM is a unique knowledge-based system that uses point measurements of precipitation, temperature, and other climatic factors to produce continuous, digital grid estimates of monthly, yearly, and event-based climatic parameters. Continuously updated, this unique analytical tool incorporates point data, a digital elevation model, and expert knowledge of complex climatic extremes, including rain shadows, coastal effects, and temperature inversions. PRISM data sets are recognized world-wide as the highest-quality spatial climate data sets currently available. PRISM is the USDA\'s official climatological data. " ;
		:keywords = "Atmospheric Temperature, Air Temperature Atmosphere, Precipitation, Rain, Maximum Daily Temperature, Minimum  Daily Temperature" ;
		:keywords_vocabulary = "GCMD Science Keywords" ;
		:id = "prism/thredds/" ;
		:naming_authority = "" ;
		:cdm_data_type = "Grid" ;
		:creator_name = "Christopher Daley" ;
		:creator_email = "" ;
		:publisher_name = "PRISM Climate Group" ;
		:publisher_url = "" ;
		:geospatial_lat_min = "24" ;
		:geospatial_lat_max = "53" ;
		:geospatial_lon_min = "-125" ;
		:geospatial_lon_max = "-67" ;
		:time_coverage_start = "1895-01-01T00:00" ;
		:time_coverage_end = "2013-12-01T00:00" ;
		:time_coverage_resolution = "Monthly" ;
		:license = "Freely Available: The PRISM Climate Group, Oregon State University retains rights to ownership of the data and information." ;
		:authors = "PRISM Climate Group" ;
		:institution = "Oregon State University" ;
		:_Format = "netCDF-4" ;

To show how big of an impact compression can have, if we run the same command but leave off k, d, and c flags, the output is 6.7MB rather than 2MB compressed!

So that should show the basics of using nccopy, but it doesn't help with the problem of which indices correspond to which time steps or where in space. For that, we can use the Geo Data Portal. With the limitation on copied file size, we have implemented an error message that includes the OPeNDAP URI that specifies the subset of data that the OPeNDAP subset algorithm would have created. All the variables, including coordinate variables and grid_mappings, and the OPeNDAP indices that correspond to the time range and spatial domain of the area of interest provided to the Geo Data Portal request are listed in a URI that can be passed directly to an nccopy command as summarized above.

For such large files, putting a bit of thought into chunking, file format, and deflation is worth it too, but this functionality does the heavy lifting to determine where in the world and time the data indices you need are. 

If you make a request that exceeds about 500mb in response file size, you will get an error message including this URI. Please let us know if you run across examples that seem to not be right as this is very new and as yet lightly used functionality. 

Finally, please contact for clarifications and requested additions to this nccopy how to.

  • No labels