geoknife Tutorial

21 June, 2017

Who we are

OWI data science team

Introduction

What is geoknife?

geoknife

R-package to get climate data into R

The geoknife package was created to support web-based geoprocessing of large gridded datasets according to their overlap with landscape (or aquatic/ocean) features that are often irregularly shaped. geoknife creates data access and subsequent geoprocessing requests for the USGS's Geo Data Portal to carry out on a web server.

geoknife:

allows the user to avoid downloading large datasets
avoids reinventing the wheel for the creation and optimization of complex geoprocessing algorithms
uses computing resources that are dedicated elsewhere, so geoknife operations do not have much of an impact on a local computer

geoknife github page

Introduction

Where does the data come from?

US Geological Survey
- Climate and Land Use mission
NASA
- NLDAS, GLDAS, various
- more being added….
NOAA
- NARR, many others
Academia
- E.g., PRISM

Introduction

What does geoknife do to the data?

geoknife subsets and/or summarizes data on a webserver, and let's you know when it is complete

Overview

geoknife concepts:

geoknife(stencil, fabric, knife)

stencil: the "feature" (lake, watershed, point, state, etc)

fabric: the web-available gridded data to be processed

knife: processing configuration

Overview

geoknife concepts:

geoknife(stencil, fabric, knife)

geoknife uses the stencil, fabric and knife to subset and/or summarize data on a webserver and deliver the result back to you.

Stencil

A stencil can be defined in a number of different ways:

a numeric vector of lon, lat (simplegeom class)

stencil = c(-89, 46)

an R data.frame() (simplegeom class)

stencil = data.frame(point1 = c(-89, 46), point2 = c(-88.6, 45.2))

web-available data (webgeom class) that are quickstart

stencil = webgeom("state::New Hampshire,Wisconsin")
stencil = webgeom("ecoregion::Colorado Plateaus,Driftless Area")

web-available data (webgeom class) from a shapefile that you have uploaded

stencil = webgeom(geom = "upload:CIDA_TEST_", attribute = "Letter", 
    value = "C")

Stencil

print in R will tell you what you are working with for the different object classes:

stencil = webgeom("HUC8::09020306,14060009")
stencil

## An object of class "webgeom":
## url: https://cida.usgs.gov/gdp/geoserver/wfs 
## geom: sample:simplified_huc8 
## attribute: HUC_8 
## values: 09020306, 14060009 
## wfs version: 1.1.0

and you can modify things:

values(stencil) <- c("07140103", "07140108", "12100202", "12060104", 
    "12060101")
stencil

## An object of class "webgeom":
## url: https://cida.usgs.gov/gdp/geoserver/wfs 
## geom: sample:simplified_huc8 
## attribute: HUC_8 
## values: 07140103, 07140108, 12100202, 12060104, 12060101 
## wfs version: 1.1.0

Fabric

There is only one geoknife class that can be used for fabric, the webdata class (maybe more will exist in the future).

webdata has some important fields:

times: the time range that you want access to (default is c(NA, NA), which is "all time")
url: the url of the dataset's OPeNDAP endpoint (or Web Covereage Service endpoint)
variables: the variables you want access to. print method works for these too:

webdata()

## An object of class "webdata":
## times: NA, NA
## url: NA 
## variables: NA

Fabric

You can define values for those fields when creating webdata:

fabric = webdata(url = "https://cida.usgs.gov/thredds/dodsC/prism_v2", 
    times = c("1990-01-01", "2010-01-01"), variables = c("ppt", "tmx"))
fabric

## An object of class "webdata":
## times: 1990-01-01T00:00:00Z, 2010-01-01T00:00:00Z
## url: https://cida.usgs.gov/thredds/dodsC/prism_v2 
## variables: ppt, tmx

and access and set them:

times(fabric)[1] <- "1899-01-01"
variables(fabric) <- "tmn"
fabric

## An object of class "webdata":
## times: 1899-01-01T06:00:00Z, 2010-01-01T00:00:00Z
## url: https://cida.usgs.gov/thredds/dodsC/prism_v2 
## variables: tmn

Knife

There is only one geoknife class that can be used for knife, the webprocess class (maybe more will exist in the future).

webprocess has some important fields:

url: the url of the web processing service to be used
algorithm: a list specifying the algorithm to be used for processing. Defaults to Area Grid Statistics (weighted).
version: the webprocessing version to use (there is only one right now, 1.0.0)
processInputs: a list of processing details for the specified algorithm. These details vary depending on algorithm and are this field is automatically reset when the algorithm field is set.
wait: do you want R to wait while the process runs? Defaults to FALSE
email: an email address to use if you want to be notified when the process is complete

Knife

print method works for these too:

webprocess()

## An object of class "webprocess":
## url: https://cida-test.er.usgs.gov/gdp/process/WebProcessingService 
## algorithm: Area Grid Statistics (weighted) 
## web processing service version: 1.0.0 
## process inputs: 
##    SUMMARIZE_TIMESTEP: false
##    SUMMARIZE_FEATURE_ATTRIBUTE: false
##    REQUIRE_FULL_COVERAGE: true
##    DELIMITER: COMMA
##    STATISTICS: 
##    GROUP_BY: 
## wait: FALSE 
## email: NA

Knife

modify when you create:

webprocess(DELIMITER = "TAB", REQUIRE_FULL_COVERAGE = FALSE, wait = TRUE)

## An object of class "webprocess":
## url: https://cida-test.er.usgs.gov/gdp/process/WebProcessingService 
## algorithm: Area Grid Statistics (weighted) 
## web processing service version: 1.0.0 
## process inputs: 
##    SUMMARIZE_TIMESTEP: false
##    SUMMARIZE_FEATURE_ATTRIBUTE: false
##    REQUIRE_FULL_COVERAGE: false
##    DELIMITER: TAB
##    STATISTICS: 
##    GROUP_BY: 
## wait: TRUE 
## email: NA

geoknife

Now we can put it all together

An example of processing PRISM data for two points:

job <- geoknife(stencil = c(-89, 42), fabric = "prism", wait = TRUE)

Note, the structure of the geoknife function is geoknife(stencil, fabric, knife = webprocess(...), ...), meaning that only the stencil and fabric arguments are required, and the knife (if ommitted) will use defaults.

The ... mean that any additional named arguments passed into the geoknife() will go into the webprocess(). This feature can be handy for specifying wait=TRUE and others

geoknife

When a job is complete, you can download it to a local file, or load the result into R

download(job, destination = "prism_results.csv")

## [1] "prism_results.csv"

data <- result(job, with.units = TRUE)
head(data)

##     DateTime bufferedPoint variable statistic    units
## 1 1895-01-01       41.1600      ppt      MEAN mm/month
## 2 1895-02-01       13.9825      ppt      MEAN mm/month
## 3 1895-03-01       24.1625      ppt      MEAN mm/month
## 4 1895-04-01       33.1650      ppt      MEAN mm/month
## 5 1895-05-01       78.7225      ppt      MEAN mm/month
## 6 1895-06-01       47.4175      ppt      MEAN mm/month

geoknife

The text results of the process are formatted according to the naming of the features you use for stencil

In the example above, a simple lon/lat point was used, so it gets a default name ("bufferedPoint")

Using more features will yield more columns in the result:

fabric <- webdata("prism", variables = "tmx", times = c("1895-01-01", 
    "1895-05-01"))
job <- geoknife(stencil = "state::Wisconsin,Virginia", fabric, wait = TRUE)
result(job, with.units = TRUE)

##     DateTime  Virginia Wisconsin variable statistic units
## 1 1895-01-01  5.276245 -7.090785      tmx      MEAN  degC
## 2 1895-02-01  2.623085 -5.470236      tmx      MEAN  degC
## 3 1895-03-01 12.363201  3.124446      tmx      MEAN  degC
## 4 1895-04-01 18.599579 15.556385      tmx      MEAN  degC
## 5 1895-05-01 22.696014 21.111172      tmx      MEAN  degC

geoknife

Processing large web-available shapefiles with geoknife

host the feature somewhere as a WFS (web feature service)
upload the shapefile to GDP

geoknife

Processing large web-available shapefiles with geoknife

upload the shapefile to GDP:
upload zipped shapefile using the uploader

stencil <- webgeom()
query(stencil, "geoms")

##  [1] "sample:Alaska"                                   
##  [2] "upload:CIDA_TEST_"                               
##  [3] "sample:CONUS_Climate_Divisions"                  
##  [4] "derivative:CONUS_States"                         
##  [5] "sample:CONUS_states"                             
##  [6] "sample:CSC_Boundaries"                           
##  [7] "sample:Counties"                                 
##  [8] "sample:Ecoregions_Level_III"                     
##  [9] "derivative:FWS_LCC"                              
## [10] "sample:Lanscape_Conservation_Cooperatives"       
## [11] "derivative:Level_III_Ecoregions"                 
## [12] "draw:MJ"                                         
## [13] "upload:MWCD"                                     
## [14] "upload:MWCD_USGS_WatershedBoundary_with_WGS_1984"
## [15] "draw:Moj"                                        
## [16] "draw:Mojave"                                     
## [17] "sample:NCA_Regions"                              
## [18] "upload:PRMS_2016_HRUs_E_East"                    
## [19] "upload:PRMS_2016_HRUs_E_sub"                     
## [20] "upload:PRMS_2016_HRUs_F"                         
## [21] "upload:PRMS_2016_HRUs_G"                         
## [22] "upload:PRMS_2016_HRUs_H"                         
## [23] "upload:SenegalComm"                              
## [24] "upload:StudyExtentLTB"                           
## [25] "derivative:US_Counties"                          
## [26] "upload:Yahara_alb"                               
## [27] "derivative:climate_divisions"                    
## [28] "draw:fortcollins"                                
## [29] "upload:huc6bb"                                   
## [30] "sample:nps_boundary"                             
## [31] "upload:nrhu_selection_Gallatin"                  
## [32] "upload:nrhu_selection_Lolo2"                     
## [33] "sample:simplified_huc8"                          
## [34] "upload:sixty_basins_6_7_2017"                    
## [35] "upload:sixty_basins_6_7_2017_2"                  
## [36] "draw:test"                                       
## [37] "draw:test2"                                      
## [38] "draw:test3"                                      
## [39] "derivative:wbdhu8_alb_simp"

geom(stencil) <- "upload:CIDA_TEST_"
query(stencil, "attributes")

## [1] "OBJECTID" "Letter"

attribute(stencil) <- "Letter"
values(stencil) <- c("C", "I", "D", "A")

gconfig

I like to toggle some default settings for knife. Those are available in the gconfig() function, which behaves like graphics::par()

gconfig()

## $wps.url
## [1] "https://cida-test.er.usgs.gov/gdp/process/WebProcessingService"
## 
## $sleep.time
## [1] 5
## 
## $wait
## [1] FALSE
## 
## $email
## [1] NA
## 
## $algorithm
## $algorithm$`Area Grid Statistics (weighted)`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureWeightedGridStatisticsAlgorithm"
## 
## 
## $verbose
## [1] FALSE
## 
## $retries
## [1] 1
## 
## $version
## [1] "1.0.0"

gconfig("wait")

## [1] FALSE

gconfig

A number of the defaults in geoknife operations can be modified here, so if you always want to wait for jobs to finish, and want a different sleep time (the time geoknife waits in between checking to see if the job is finished), do it here:

gconfig(wait = TRUE, sleep.time = 30)

but you can always override them with arguments to geoknife()

# will wait because gconfig('wait') is TRUE
job <- geoknife("state::Wisconsin,Virginia", fabric)

# will not wait for the job to complete
job <- geoknife("state::Wisconsin,Virginia", fabric, wait = FALSE)  # will not wait

finding things

OK, so these tools seem powerful, right? But how do you actually find things to process or figure out what the values of your webgeom can be?

The query() function is the catch-all for finding things. query methods exist for webgeom, webdata, webprocess, etc.

The format is query(.object, field), such as:

query(webgeom(), "geoms")

##  [1] "sample:Alaska"                                   
##  [2] "upload:CIDA_TEST_"                               
##  [3] "sample:CONUS_Climate_Divisions"                  
##  [4] "derivative:CONUS_States"                         
##  [5] "sample:CONUS_states"                             
##  [6] "sample:CSC_Boundaries"                           
##  [7] "sample:Counties"                                 
##  [8] "sample:Ecoregions_Level_III"                     
##  [9] "derivative:FWS_LCC"                              
## [10] "sample:Lanscape_Conservation_Cooperatives"       
## [11] "derivative:Level_III_Ecoregions"                 
## [12] "draw:MJ"                                         
## [13] "upload:MWCD"                                     
## [14] "upload:MWCD_USGS_WatershedBoundary_with_WGS_1984"
## [15] "draw:Moj"                                        
## [16] "draw:Mojave"                                     
## [17] "sample:NCA_Regions"                              
## [18] "upload:PRMS_2016_HRUs_E_East"                    
## [19] "upload:PRMS_2016_HRUs_E_sub"                     
## [20] "upload:PRMS_2016_HRUs_F"                         
## [21] "upload:PRMS_2016_HRUs_G"                         
## [22] "upload:PRMS_2016_HRUs_H"                         
## [23] "upload:SenegalComm"                              
## [24] "upload:StudyExtentLTB"                           
## [25] "derivative:US_Counties"                          
## [26] "upload:Yahara_alb"                               
## [27] "derivative:climate_divisions"                    
## [28] "draw:fortcollins"                                
## [29] "upload:huc6bb"                                   
## [30] "sample:nps_boundary"                             
## [31] "upload:nrhu_selection_Gallatin"                  
## [32] "upload:nrhu_selection_Lolo2"                     
## [33] "sample:simplified_huc8"                          
## [34] "upload:sixty_basins_6_7_2017"                    
## [35] "upload:sixty_basins_6_7_2017_2"                  
## [36] "draw:test"                                       
## [37] "draw:test2"                                      
## [38] "draw:test3"                                      
## [39] "derivative:wbdhu8_alb_simp"

finding things

likewise

query(webprocess(), "algorithms")

## $`Timeseries Service Subset`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureTimeSeriesAlgorithm"
## 
## $`Categorical Coverage Fraction`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCategoricalGridCoverageAlgorithm"
## 
## $`OPeNDAP Subset`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCoverageOPeNDAPIntersectionAlgorithm"
## 
## $`Area Grid Statistics (unweighted)`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureGridStatisticsAlgorithm"
## 
## $`Area Grid Statistics (weighted)`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureWeightedGridStatisticsAlgorithm"
## 
## $`WCS Subset`
## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCoverageIntersectionAlgorithm"

and

query(webdata("prism"), "variables")

## [1] "ppt" "tmx" "tmn"

finding things

One of the more powerful uses of query is to find datasets that are already indexed by the Geo Data Portal:

webdatasets <- query("webdata")
length(webdatasets)

## [1] 173

webdatasets[61:65]

## An object of class "datagroup":
## [1] Dynamical Downscaled and Projected Climate for the Pacific Islands - Oahu 
##   url: https://cida.usgs.gov/thredds/dodsC/oahu 
## [2] Dynamical Downscaled and Projected Climate for the Pacific Islands - Samoa 
##   url: https://cida.usgs.gov/thredds/dodsC/samoa 
## [3] Eighth degree-CONUS Daily Downscaled Climate Projections Minimum and Maximum Temperature 
##   url: https://cida.usgs.gov/thredds/dodsC/dcp/conus_t 
## [4] Eighth degree-CONUS Daily Downscaled Climate Projections Precipitation 
##   url: https://cida.usgs.gov/thredds/dodsC/dcp/conus_pr 
## [5] Future California Basin Characterization Model Downscaled Climate and Hydrology 
##   url: https://cida.usgs.gov/thredds/dodsC/CA-BCM-2014/future

finding things

Methods for the dataset groups (of class datagroup) include title and abstract

title(webdatasets[87])

## [1] "SnowModel Output Upper Deschutes River Basin, Cascades, UTM Zone 10N NAD83 (2014)"

abstract(webdatasets[87])

## [1] "Daily precipitation, runoff (rain + snowmelt), snowfall, snow-water equivalent (SWE), and snowmelt, at a 100m spatial resolution for the period 1989 - 2011 for the Upper Deschutes River Basin (UDRB) study area and for 1989 - 2009 for the McKenzie River Basin (MRB) study area. Each variable is saved as a .tar file with 23 individual .nc (netcdf) files that contain gridded model output of size 781x385x365 for the UDRB and 759x1121x365 for the MRB. For example, the UDRB modeling domain is 781 rows by 385 columns at 100-m grid spacing, and data are stored in 365-day water-year time periods, one for each of the 23 years in the study period. The variable .tar files are further organized into climate-scenario folders for the 'reference_climate'  i.e. historical climate period, 't2' i.e. +2 degrees C added to each historical temperature record used as model forcing, 't2p10' i.e. +2 degrees C added to each historical temperature and +10% precipitation added to each historical precipitation record used as model forcing, 't4' i.e +4 degrees C added to each historical temperature record used as model forcing, and 't4p10' i.e. +4 degrees C added to each historical temperature and +10% precipitation added to each historical precipitation record used as model forcing. For more information see:  http://hdl.handle.net/1957/56058"

finding things

Choose a dataset you found for processing

fabric <- webdata(webdatasets[87])
variables(fabric) <- query(fabric, "variables")[1]
job <- geoknife(c(-89, 43.1), fabric)

finding things

There are hundreds (or potentially thousands) of additional OPeNDAP datasets that will work with geoknife, but need to be found through web searches or catalogs (e.g., www.esrl.noaa.gov/psd/thredds/dodsC/Datasets and apdrc.soest.hawaii.edu/data/data.php ).

One such example is Sea Surface Temperature from the Advanced Very High Resolution Radiometer (AVHRR) temperature sensing system. Specifying datasets such as this requires finding out the OPeNDAP endpoint (URL) for the dataset, and specifying it as the url to webdata (we found this example in the extensive apdrc.soest.hawaii.edu/data/data.php catalog):

fabric <- webdata(url = "dods://apdrc.soest.hawaii.edu/dods/public_data/satellite_product/AVHRR/avhrr_mon")

query for variables doesn't work for this dataset, because it actually doesn't have units and therefore "valid" variables are not returned (instead you get an empty list). From the OPeNDAP endpoint, it is clear that this dataset has one variable of interest, which is called 'sst':

variables(fabric) <- "sst"
query(fabric, "times")

## [1] NA NA

finding things

plotting the July surface temperature of a spot on the Caspian Sea is done by:

sst = result(geoknife(data.frame(caspian.sea = c(51, 40)), fabric, 
    wait = TRUE))
head(sst)
july.idx <- months(sst$DateTime) == "July"
plot(sst$DateTime[july.idx], sst$caspian.sea[july.idx], type = "l", 
    lwd = 2, col = "dodgerblue", ylab = "Sea Surface Temperature (degC)", 
    xlab = NA)

##     DateTime caspian.sea variable statistic
## 1 1990-01-01      11.250      sst      MEAN
## 2 1990-02-01      10.575      sst      MEAN

geoknife in action

For our lake modeling activities, we use geoknife to create driver data for each lake we are modeling, and also to extract land-cover data from the NLCD that is in the buffer of each lake (for wind sheltering).

Alternatively, there are some examples in our geoknife paper and our geoknife blog post

Looking at PRISM data for ecoregion:

Who we are

Introduction

Introduction

Introduction

Introduction

Overview

Overview

Stencil

Stencil

Fabric

Fabric

Knife

Knife

Knife

geoknife

geoknife

geoknife

geoknife

geoknife

gconfig

gconfig

finding things

finding things

finding things

finding things

finding things

finding things

finding things

geoknife in action

geoknife for the script averse