Confluence Retirement

In an effort to consolidate USGS hosted Wikis, myUSGS’ Confluence service is scheduled for retirement on January 27th, 2023. The official USGS Wiki and collaboration space is now SharePoint. Please migrate existing spaces and content to the SharePoint platform and remove it from Confluence at your earliest convenience. If you need any additional information or have any concerns about this change, please contact myusgs@usgs.gov. Thank you for your prompt attention to this matter.
Skip to end of metadata
Go to start of metadata

Introduction

TRIP Hook code: https://github.com/triphook/watershedtools

NAQWA Processing Landscape Variables to NHDPlus Version 2.1:  Processing Landscape Variables to NHDPlus Version 2.1

MSU Code: Included in the following publication.  https://springerplus.springeropen.com/articles/10.1186/2193-1801-3-589

USGS Biogeographic Characterization Branch Code:  Python implementation of MSU code: https://my.usgs.gov/bitbucket/projects/BIOLAB/repos/upstreamnetworkaggregation/browse?at=refs%2Fheads%2FAggUpCSV

StreamCat Code: https://github.com/USEPA/StreamCat

StreamCat Data: ftp://newftp.epa.gov/EPADataCommons/ORD/NHDPlusLandscapeAttributes/StreamCat/WelcomePage.html

 

Algorithm Strategy

Would like to document the rules and NHDPlus attributes that are being used to drive each accumulation algorithm a bit in this wiki space.

Proposed Methodology

  • Call (Fri Nov 6, 2015 2pm – 3pm (EST)) to discuss accumulation comparisons -  idea discussed so far to compare precip and area, using precip values from NHDPlusV2 EROM tables, and generating the allocations and accumulations using our respective methods and the precip raster used by NHDPlus.  We can also compare our watershed areas with each with those contained in NHDPlus V2.  From email discussions:
     
    Talking with Scott, sounded like best first step would possibly to pick a simple landscape metric like precip, where we all are using a pre-defined landscape raster with specific projection, resolution, etc.  From there, we can:
    1)      Each run our ‘allocation’ approach
    Its been several years since I have had to mess with this and have mostly used aml in the past (way faster than arcpy) so I am going to assume your groups have a better approach for this now. 
    2)      Each run our ‘accumulation’ approach
    I think running accumulation off the same set of allocated values will help us identify why differences in accumulation are occurring. 
    3)      Compare times we get for these (given this will be on different machines for now)
        I think this comparison is good to know but at this point I think we should focus on comparing output and using those findings to discuss what "we want the code to do", then focus on speed.  (e.g. if a process doesn't address braids right it doesn't really matter how fast it is).  We should be running this on the same # of variables if we do any comparison here.
    4)      Produce agreed on standard output files for comparison purposes (I’m thinking at a minimum files with COMID and watershed-level area weighted results, likely as files by hydroregion) 
    I think we should aim to compare at the national scale if we can.  If we are just looking at the COMID and 1 or 2 associated variables these files won't be to big.  
    5)      Each of us can run comparisons on three files produced and then compare notes on any discrepancies we find. 
    6)      Precipitation is a good one since we can additionally compare with NHDPlus V21 accumulation of precip
    I agree on precip, area is also an easy one.  Should we try comparing a variable for each minimum, maximum, area-weighted, and sum?
  • Marc Weber went ahead and put zip file he got from Cindy McKay at ftp account – ftp  to: scienceftp2.epa.gov

     

EPA USGS Precip Allocation Check

16 November, 2015

EPA / USGS Precip Allocation Comparisons

EPA / USGS Percent Full Allocation Comparisons

EPA / USGS Precip Allocation Comparisons

 

We compared mean precipitation values derived from our allocation process to those done by USGS. To compare values, we plotted allocated precipitation values in plots below (including a red 1:1 line). In addition, we plotted the difference between EPA and USGS allocations of precipitation versus the average of EPA-USGS allocations of precipitation values. This second

plot helped to identify outliers, i.e., catchments with potential errors. The precipitation raster that we used in this comparison was obtained from Horizon Systems, Corp. and was the same raster used to generate the NHDPlusV2 precipitation tables.

This raster was developed by combining 30-yr. (1971-2000) PRISM precipitation normals with 1-km climate rasters of Canada

and Mexico (see NHDPlusV2 User Guide, p. 72 for details).

 

R Code for comparisons:

 

library(knitr)
#Set paths
USGSpath = "L:/Priv/CORFiles/Geospatial_Library/Data/Project/SSWR1.1B/USGSCollaboration
/Wieczorek_USGS/"
EPAPath = "L:/Priv/CORFiles/Geospatial_Library/Data/Project/SSWR1.1B/USGSCollaboration/ Precip_Allocation/"
#Set regions
reg <- c("NE","MA","SA","SA","SA","GL","MS","MS","MS","MS","SR","MS","MS","MS","TX"," RG","CO","CO","GB","PN","CA")
hydro_reg <- c("01","02","03S","03N","03W","04","05","06","07","08","09","10L","10U", "11","12","13","14","15","16","17","18")
 
for (k in 1:21){
EPAPrecip = read.csv(paste0(EPAPath, "precip_",hydro_reg[k],".csv")) USGSPrecip =               read.csv(paste0(USGSpath,"PRECIP7100_PU",hydro_reg[k],".txt")) EPAPrecip = EPAPrecip[c(1,3)]
names(EPAPrecip)[2] = "PrecipMeanEPA" USGSPrecip = USGSPrecip[c(1:2)] names(USGSPrecip)[2] = "PrecipMeanUSGS" BothPrecip = EPAPrecip
BothPrecip$PrecipMeanUSGS = USGSPrecip$PrecipMeanUSGS[match(BothPrecip$COMID, USGSPr ecip$COMID)]
 
# BothPrecip$PrecipDiff = abs(BothPrecip[,3] - BothPrecip[,2]) #Subtract USGS precip from EPA precip
 
par(pty="s", mfrow=c(1,2), mai= c(0.5, 1, 1, 0.2))
plot(BothPrecip$PrecipMeanEPA ~ BothPrecip$PrecipMeanUSGS, pch = 20, cex.lab=0.75, ylab = "EPA Allocated precip (cm*100)",

 
xlab = "USGS Allocated precip (cm*lOO)",
main = pasteO("HydroRegion: ",hydro_reg[k]), cex.main=1.5)
abline(O,l, col=2)
 
 
#    plot(Precip$PrecipDiff             Precip$PrecipMean,  pch = 20,  cex.lab=O. 75,
#                      ylab          "NHD-StreamCat  pairwise precip diff. (mm)  ",
#                      xlab =  "NHD-StreamCat  pairwise precip means          (mm) ")

 

Catchment Allocation One to One Plots:

 

R Code for Percent Full Allocation Comparisons:

 

library(knitr)
#Set paths
USGSpath = "L:/Priv/CORFiles/Geospatial_Library/Data/Project/SSWR1.1B/USGSCollaboration
/Wieczorek_USGS/"
EPAPath = "L:/Priv/CORFiles/Geospatial_Library/Data/Project/SSWR1.1B/USGSCollaboration/ Precip_Allocation/"
#Set regions
reg <- c("NE","MA","SA","SA","SA","GL","MS","MS","MS","MS","SR","MS","MS","MS","TX"," RG","CO","CO","GB","PN","CA")
hydro_reg <- c("01","02","03S","03N","03W","04","05","06","07","08","09","10L","10U", "11","12","13","14","15","16","17","18")
 
for (k in 1:21){
EPAPrecip = read.csv(paste0(EPAPath, "precip_",hydro_reg[k],".csv")) USGSPrecip =               read.csv(paste0(USGSpath,"PRECIP7100_PU",hydro_reg[k],".txt")) EPAPrecip = EPAPrecip[c(1,4)]
names(EPAPrecip)[2] = "PctFullEPA" USGSPrecip = USGSPrecip[c(1,5)] USGSPrecip[,2] = 100 - USGSPrecip[,2] names(USGSPrecip)[2] = "PctFullUSGS" BothPrecip = EPAPrecip
BothPrecip$PctFullUSGS = USGSPrecip$PctFullUSGS[match(BothPrecip$COMID, USGSPrecip$C OMID)]
 
# BothPrecip$PrecipDiff = abs(BothPrecip[,3] - BothPrecip[,2]) #Subtract USGS precip from EPA precip
 
par(pty="s", mfrow=c(1,2), mai= c(0.5, 1, 1, 0.2))
plot(BothPrecip$PctFullEPA ~ BothPrecip$PctFullUSGS, pch = 20, cex.lab=0.75, ylab = "EPA Allocated Percent Full",
xlab = "USGS Allocated Percent Full",
main = paste0("HydroRegion: ",hydro_reg[k]), cex.main=1.5)
abline(0,1, col=2)
 
 
#   plot(Precip$PrecipDiff ~ Precip$PrecipMean, pch = 20, cex.lab=0.75,
#        ylab = "NHD-StreamCat pairwise precip diff. (mm)",
#     xlab = "NHD-StreamCat pairwise precip means (mm)")

 

Catchment Source Coverage Plots:


Proposed Methodology for Accumulation Comparisons

  • Initial testing should consist of verification of "full upstream" watershed areas, which are defined as...? what exactly does verification mean? that different groups come up with the same accumulated areas for a given set of flowlines?
  • NAWQP defines upstream watersheds 2 ways,
    • 1) Flow diversions: defined as divergence-routed cumulative drainage area approach.  This process apportions an accumulated amount to multiple paths at a diversion according to a user-defined fraction coded with each reach.  Typically, the fraction is based on the fraction of flow going down each diverted path. For NHDPlus version 2, all flow is assigned to the main path and zero flow is assigned to non-main paths, so the apportionment is an all or nothing approach to routing.  This method of apportionment means that when diverted paths reconnect, there will be no double counting of accumulated values upstream from the original diversion point. 
    • 2) Total Accumulations: upstream accumulations are based on from and to nodes in the reach network.  This method of apportionment also insures there will be no double counting of accumulated values upstream from the original diversion point. 
  • Precipitation. 

Summaries to be tested include

  • maximum upstream value
  • minimum upstream value
  • area-weighted average
  • others?

The US Environmental Protection Agency (EPA) has included a raster (precip.tif) that was created by Cindy McKay and others at Horizon Systems Corp (HSC). This dataset was used by the HSC team to create the accumulated precipitation values distributed with the NHDPlusV2 and was developed by combining 30-yr. PRISM precipitation normals with Mexican and Canadian climate data. The html file, PrecipitationQACheck.html, provides an example of how we compared our results to those of NHDPlusV2 and we propose using this approach to compare results among groups.

  • Daniel comments: I think this is a good approach but I would recommend we compare results for datasets that are using different types of summaries as well. 

Notes from Previous Work

EPA (Weber):  We have found that full upstream watershed areas are the best way to check accumulation accuracy. It is possible to get very similar results when comparing precipitation accumulations, for example, for the wrong reasons because of spatial auto-correlation of rain patterns. We compared our accumulated areas (wasn't precip being used?) against those distributed by the NHDPlusV2. Through these comparisons, we were able to diagnose several problems in our accumulation process, especially at inter-HydroRegion borders. In addition, we identified errors in the NHDPlusV2 accumulations that we were able to bring to the attention of Cindy McKay of Horizon Systems Corp (HSC). The file AccumulationsChecks.html shows and example of how we compared our watershed areas to those of the NHDPlusV2.

USGS Aquatic Gap Analysis Program (AGAP) and the National Fish Habitat Partnership (NFHP) (Wieferich)  We too have found that full upstream watershed areas are the best way to check accumulation accuracy.  In previous work we used the upstream watershed areas to verify results from our accumulation as well.  Similar to the EPA group we were able to fine tune our code and also found a number of COMIDs within the NHDPlusV1 with incorrect network areas, mostly due to braided networks.  After finding incorrect values in the NHDPlusV1, we also performed a number of manual spot-checks to evaluate our accuracy.  We then used data from the NLCD to further verify our calculations (how? or do the particulars of this matter?).  Since our group discussions have occurred we have updated our code and database to run on the NHDPlusV2 and have started verification using the upstream area values.

USGS NAQWA (Wieczorek): has lists of accumulated basin areas, indexed by COMID that can be used to compare with work of others.

Ease of use and access

Should we compare how easy each tool is to use? If so, what metrics should we use?

  • compiled (e.g., Fortran or C) vs. interpreted (e.g., Python or SAS)
  • free software vs. license (e.g., Python vs. SAS/ArcGIS, R spatial libraries vs ArcGIS, specific Python modules such as numpy, gdal, pysal vs arcpy and needing ArcGIS license)
  • specialized hardware environment (e.g., Yeti HPC) vs. standard configurations (e.g., Linux, Windows)

  • Daniel: My opinion is that no matter what, this work should result in an SOP or similar documentation for our final process so the complexity should be a lower priority, unless it proves to add additional manual processing time.

Speed Benchmarking: HPC vs. non-HPC

This is really the last step. Not necessarily the least important, but we first want to make sure that we are all producing the same numbers even if our algorithms are different.

Proposed Methodology 

  • Compile list of run times for each tool in at least 2, possibly environments
    • Desired analyses
      • accumulated area
        • this is probably the core analysis of this proposal
      • accumulated characteristics, like mean areal precip
        • This creates questions, like should the tool just produce the sum, average, or range of whatever the attribute is?
        • This could test engineering for memory management of large datasets, like the PRISM precip.tif
          • dynamic: so, does this mean that zonalstats are being computed on the fly or are these pre-processed into a table, indexed by COMID, that is input to the tool? 
            • Roland thinks this is out of scope. Mike agrees.
          • pre-processed:
          • Daniel comments: Should we specify the number of variables being used?  I think we should run at least a dozen variables through at one time and this number should be consistent across the....
        • Roland comments: maybe attribute accumulation should be reserved for a second comparison study.
    • Computing platforms
      • USGS High Performance Cluster (HPC).
        • Does this mean that code will be parallelized or just run in its standard incarnation on HPC machines? 
        • Verify that each approach is compatible with the USGS HPC. 
          • is this important to the group?
      • Linux
      • Windows (?)
    • Compute times: just build a table of each analysis (Type of calculation, like  sum or area weighted average) under each platform
  • No labels

1 Comment

  1. some editing comments that we can delete once addressed:

    -perhaps we can give this page a more generic title. We can definitely allude to Daniel's suggestion and leadership w/in the page if that helps folks remember the conversation. 

    -does this topic need to be here in the hidden part of the wiki? might be good to broach w/the group.

    -def of "full upstream"

    -can we link to or attach the "precip.tif" layer and is there any actual citation on this thing? If not on the latter, can we encourage them to do this via USGS & ScienceBase? Probably good to post or link to any referenced data or pubs wherever possible, in general.