21 June, 2017
geoknife
?
The geoknife
package was created to support web-based geoprocessing of large gridded datasets according to their overlap with landscape (or aquatic/ocean) features that are often irregularly shaped. geoknife creates data access and subsequent geoprocessing requests for the USGS's Geo Data Portal to carry out on a web server.
geoknife
:
geoknife
:Where does the data come from?
What does geoknife
do to the data?
geoknife
subsets and/or summarizes data on a webserver, and let's you know when it is complete geoknife concepts:
geoknife(stencil, fabric, knife)
stencil
: the "feature" (lake, watershed, point, state, etc)fabric
: the web-available gridded data to be processedknife
: processing configurationgeoknife concepts:
geoknife(stencil, fabric, knife)
geoknife
uses the stencil
, fabric
and knife
to subset and/or summarize data on a webserver and deliver the result back to you.A stencil can be defined in a number of different ways:
simplegeom
class)stencil = c(-89, 46)
data.frame()
(simplegeom
class)stencil = data.frame(point1 = c(-89, 46), point2 = c(-88.6, 45.2))
webgeom
class) that are quickstartstencil = webgeom("state::New Hampshire,Wisconsin") stencil = webgeom("ecoregion::Colorado Plateaus,Driftless Area")
webgeom
class) from a shapefile that you have uploadedstencil = webgeom(geom = "upload:CIDA_TEST_", attribute = "Letter", value = "C")
print
in R will tell you what you are working with for the different object classes:
stencil = webgeom("HUC8::09020306,14060009") stencil
## An object of class "webgeom": ## url: https://cida.usgs.gov/gdp/geoserver/wfs ## geom: sample:simplified_huc8 ## attribute: HUC_8 ## values: 09020306, 14060009 ## wfs version: 1.1.0
and you can modify things:
values(stencil) <- c("07140103", "07140108", "12100202", "12060104", "12060101") stencil
## An object of class "webgeom": ## url: https://cida.usgs.gov/gdp/geoserver/wfs ## geom: sample:simplified_huc8 ## attribute: HUC_8 ## values: 07140103, 07140108, 12100202, 12060104, 12060101 ## wfs version: 1.1.0
There is only one geoknife
class that can be used for fabric, the webdata
class (maybe more will exist in the future).
webdata
has some important fields:
times
: the time range that you want access to (default is c(NA, NA)
, which is "all time")url
: the url of the dataset's OPeNDAP endpoint (or Web Covereage Service endpoint)variables
: the variables you want access to. print method works for these too:webdata()
## An object of class "webdata": ## times: NA, NA ## url: NA ## variables: NA
You can define values for those fields when creating webdata
:
fabric = webdata(url = "https://cida.usgs.gov/thredds/dodsC/prism_v2", times = c("1990-01-01", "2010-01-01"), variables = c("ppt", "tmx")) fabric
## An object of class "webdata": ## times: 1990-01-01T00:00:00Z, 2010-01-01T00:00:00Z ## url: https://cida.usgs.gov/thredds/dodsC/prism_v2 ## variables: ppt, tmx
and access and set them:
times(fabric)[1] <- "1899-01-01" variables(fabric) <- "tmn" fabric
## An object of class "webdata": ## times: 1899-01-01T06:00:00Z, 2010-01-01T00:00:00Z ## url: https://cida.usgs.gov/thredds/dodsC/prism_v2 ## variables: tmn
There is only one geoknife
class that can be used for knife, the webprocess
class (maybe more will exist in the future).
webprocess
has some important fields:
url
: the url of the web processing service to be usedalgorithm
: a list specifying the algorithm to be used for processing. Defaults to Area Grid Statistics (weighted).version
: the webprocessing version to use (there is only one right now, 1.0.0
)processInputs
: a list of processing details for the specified algorithm
. These details vary depending on algorithm
and are this field is automatically reset when the algorithm
field is set.wait
: do you want R to wait while the process runs? Defaults to FALSE
email
: an email address to use if you want to be notified when the process is completeprint method works for these too:
webprocess()
## An object of class "webprocess": ## url: https://cida-test.er.usgs.gov/gdp/process/WebProcessingService ## algorithm: Area Grid Statistics (weighted) ## web processing service version: 1.0.0 ## process inputs: ## SUMMARIZE_TIMESTEP: false ## SUMMARIZE_FEATURE_ATTRIBUTE: false ## REQUIRE_FULL_COVERAGE: true ## DELIMITER: COMMA ## STATISTICS: ## GROUP_BY: ## wait: FALSE ## email: NA
modify when you create:
webprocess(DELIMITER = "TAB", REQUIRE_FULL_COVERAGE = FALSE, wait = TRUE)
## An object of class "webprocess": ## url: https://cida-test.er.usgs.gov/gdp/process/WebProcessingService ## algorithm: Area Grid Statistics (weighted) ## web processing service version: 1.0.0 ## process inputs: ## SUMMARIZE_TIMESTEP: false ## SUMMARIZE_FEATURE_ATTRIBUTE: false ## REQUIRE_FULL_COVERAGE: false ## DELIMITER: TAB ## STATISTICS: ## GROUP_BY: ## wait: TRUE ## email: NA
Now we can put it all together
An example of processing PRISM data for two points:
job <- geoknife(stencil = c(-89, 42), fabric = "prism", wait = TRUE)
Note, the structure of the geoknife
function is geoknife(stencil, fabric, knife = webprocess(...), ...)
, meaning that only the stencil
and fabric
arguments are required, and the knife
(if ommitted) will use defaults.
The ...
mean that any additional named arguments passed into the geoknife()
will go into the webprocess()
. This feature can be handy for specifying wait=TRUE
and others
When a job is complete, you can download
it to a local file, or load the result
into R
download(job, destination = "prism_results.csv")
## [1] "prism_results.csv"
data <- result(job, with.units = TRUE) head(data)
## DateTime bufferedPoint variable statistic units ## 1 1895-01-01 41.1600 ppt MEAN mm/month ## 2 1895-02-01 13.9825 ppt MEAN mm/month ## 3 1895-03-01 24.1625 ppt MEAN mm/month ## 4 1895-04-01 33.1650 ppt MEAN mm/month ## 5 1895-05-01 78.7225 ppt MEAN mm/month ## 6 1895-06-01 47.4175 ppt MEAN mm/month
The text results of the process are formatted according to the naming of the features you use for stencil
In the example above, a simple lon/lat point was used, so it gets a default name ("bufferedPoint")
Using more features will yield more columns in the result:
fabric <- webdata("prism", variables = "tmx", times = c("1895-01-01", "1895-05-01")) job <- geoknife(stencil = "state::Wisconsin,Virginia", fabric, wait = TRUE) result(job, with.units = TRUE)
## DateTime Virginia Wisconsin variable statistic units ## 1 1895-01-01 5.276245 -7.090785 tmx MEAN degC ## 2 1895-02-01 2.623085 -5.470236 tmx MEAN degC ## 3 1895-03-01 12.363201 3.124446 tmx MEAN degC ## 4 1895-04-01 18.599579 15.556385 tmx MEAN degC ## 5 1895-05-01 22.696014 21.111172 tmx MEAN degC
Processing large web-available shapefiles with geoknife
Processing large web-available shapefiles with geoknife
stencil <- webgeom() query(stencil, "geoms")
## [1] "sample:Alaska" ## [2] "upload:CIDA_TEST_" ## [3] "sample:CONUS_Climate_Divisions" ## [4] "derivative:CONUS_States" ## [5] "sample:CONUS_states" ## [6] "sample:CSC_Boundaries" ## [7] "sample:Counties" ## [8] "sample:Ecoregions_Level_III" ## [9] "derivative:FWS_LCC" ## [10] "sample:Lanscape_Conservation_Cooperatives" ## [11] "derivative:Level_III_Ecoregions" ## [12] "draw:MJ" ## [13] "upload:MWCD" ## [14] "upload:MWCD_USGS_WatershedBoundary_with_WGS_1984" ## [15] "draw:Moj" ## [16] "draw:Mojave" ## [17] "sample:NCA_Regions" ## [18] "upload:PRMS_2016_HRUs_E_East" ## [19] "upload:PRMS_2016_HRUs_E_sub" ## [20] "upload:PRMS_2016_HRUs_F" ## [21] "upload:PRMS_2016_HRUs_G" ## [22] "upload:PRMS_2016_HRUs_H" ## [23] "upload:SenegalComm" ## [24] "upload:StudyExtentLTB" ## [25] "derivative:US_Counties" ## [26] "upload:Yahara_alb" ## [27] "derivative:climate_divisions" ## [28] "draw:fortcollins" ## [29] "upload:huc6bb" ## [30] "sample:nps_boundary" ## [31] "upload:nrhu_selection_Gallatin" ## [32] "upload:nrhu_selection_Lolo2" ## [33] "sample:simplified_huc8" ## [34] "upload:sixty_basins_6_7_2017" ## [35] "upload:sixty_basins_6_7_2017_2" ## [36] "draw:test" ## [37] "draw:test2" ## [38] "draw:test3" ## [39] "derivative:wbdhu8_alb_simp"
geom(stencil) <- "upload:CIDA_TEST_" query(stencil, "attributes")
## [1] "OBJECTID" "Letter"
attribute(stencil) <- "Letter" values(stencil) <- c("C", "I", "D", "A")
I like to toggle some default settings for knife. Those are available in the gconfig()
function, which behaves like graphics::par()
gconfig()
## $wps.url ## [1] "https://cida-test.er.usgs.gov/gdp/process/WebProcessingService" ## ## $sleep.time ## [1] 5 ## ## $wait ## [1] FALSE ## ## $email ## [1] NA ## ## $algorithm ## $algorithm$`Area Grid Statistics (weighted)` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureWeightedGridStatisticsAlgorithm" ## ## ## $verbose ## [1] FALSE ## ## $retries ## [1] 1 ## ## $version ## [1] "1.0.0"
gconfig("wait")
## [1] FALSE
A number of the defaults in geoknife
operations can be modified here, so if you always want to wait for jobs to finish, and want a different sleep time (the time geoknife
waits in between checking to see if the job is finished), do it here:
gconfig(wait = TRUE, sleep.time = 30)
but you can always override them with arguments to geoknife()
# will wait because gconfig('wait') is TRUE job <- geoknife("state::Wisconsin,Virginia", fabric) # will not wait for the job to complete job <- geoknife("state::Wisconsin,Virginia", fabric, wait = FALSE) # will not wait
OK, so these tools seem powerful, right? But how do you actually find things to process or figure out what the values of your webgeom
can be?
The query()
function is the catch-all for finding things. query
methods exist for webgeom
, webdata
, webprocess
, etc.
The format is query(.object, field)
, such as:
query(webgeom(), "geoms")
## [1] "sample:Alaska" ## [2] "upload:CIDA_TEST_" ## [3] "sample:CONUS_Climate_Divisions" ## [4] "derivative:CONUS_States" ## [5] "sample:CONUS_states" ## [6] "sample:CSC_Boundaries" ## [7] "sample:Counties" ## [8] "sample:Ecoregions_Level_III" ## [9] "derivative:FWS_LCC" ## [10] "sample:Lanscape_Conservation_Cooperatives" ## [11] "derivative:Level_III_Ecoregions" ## [12] "draw:MJ" ## [13] "upload:MWCD" ## [14] "upload:MWCD_USGS_WatershedBoundary_with_WGS_1984" ## [15] "draw:Moj" ## [16] "draw:Mojave" ## [17] "sample:NCA_Regions" ## [18] "upload:PRMS_2016_HRUs_E_East" ## [19] "upload:PRMS_2016_HRUs_E_sub" ## [20] "upload:PRMS_2016_HRUs_F" ## [21] "upload:PRMS_2016_HRUs_G" ## [22] "upload:PRMS_2016_HRUs_H" ## [23] "upload:SenegalComm" ## [24] "upload:StudyExtentLTB" ## [25] "derivative:US_Counties" ## [26] "upload:Yahara_alb" ## [27] "derivative:climate_divisions" ## [28] "draw:fortcollins" ## [29] "upload:huc6bb" ## [30] "sample:nps_boundary" ## [31] "upload:nrhu_selection_Gallatin" ## [32] "upload:nrhu_selection_Lolo2" ## [33] "sample:simplified_huc8" ## [34] "upload:sixty_basins_6_7_2017" ## [35] "upload:sixty_basins_6_7_2017_2" ## [36] "draw:test" ## [37] "draw:test2" ## [38] "draw:test3" ## [39] "derivative:wbdhu8_alb_simp"
likewise
query(webprocess(), "algorithms")
## $`Timeseries Service Subset` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureTimeSeriesAlgorithm" ## ## $`Categorical Coverage Fraction` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCategoricalGridCoverageAlgorithm" ## ## $`OPeNDAP Subset` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCoverageOPeNDAPIntersectionAlgorithm" ## ## $`Area Grid Statistics (unweighted)` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureGridStatisticsAlgorithm" ## ## $`Area Grid Statistics (weighted)` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureWeightedGridStatisticsAlgorithm" ## ## $`WCS Subset` ## [1] "gov.usgs.cida.gdp.wps.algorithm.FeatureCoverageIntersectionAlgorithm"
and
query(webdata("prism"), "variables")
## [1] "ppt" "tmx" "tmn"
One of the more powerful uses of query
is to find datasets that are already indexed by the Geo Data Portal:
webdatasets <- query("webdata") length(webdatasets)
## [1] 173
webdatasets[61:65]
## An object of class "datagroup": ## [1] Dynamical Downscaled and Projected Climate for the Pacific Islands - Oahu ## url: https://cida.usgs.gov/thredds/dodsC/oahu ## [2] Dynamical Downscaled and Projected Climate for the Pacific Islands - Samoa ## url: https://cida.usgs.gov/thredds/dodsC/samoa ## [3] Eighth degree-CONUS Daily Downscaled Climate Projections Minimum and Maximum Temperature ## url: https://cida.usgs.gov/thredds/dodsC/dcp/conus_t ## [4] Eighth degree-CONUS Daily Downscaled Climate Projections Precipitation ## url: https://cida.usgs.gov/thredds/dodsC/dcp/conus_pr ## [5] Future California Basin Characterization Model Downscaled Climate and Hydrology ## url: https://cida.usgs.gov/thredds/dodsC/CA-BCM-2014/future
Methods for the dataset groups (of class datagroup
) include title
and abstract
title(webdatasets[87])
## [1] "SnowModel Output Upper Deschutes River Basin, Cascades, UTM Zone 10N NAD83 (2014)"
abstract(webdatasets[87])
## [1] "Daily precipitation, runoff (rain + snowmelt), snowfall, snow-water equivalent (SWE), and snowmelt, at a 100m spatial resolution for the period 1989 - 2011 for the Upper Deschutes River Basin (UDRB) study area and for 1989 - 2009 for the McKenzie River Basin (MRB) study area. Each variable is saved as a .tar file with 23 individual .nc (netcdf) files that contain gridded model output of size 781x385x365 for the UDRB and 759x1121x365 for the MRB. For example, the UDRB modeling domain is 781 rows by 385 columns at 100-m grid spacing, and data are stored in 365-day water-year time periods, one for each of the 23 years in the study period. The variable .tar files are further organized into climate-scenario folders for the 'reference_climate' i.e. historical climate period, 't2' i.e. +2 degrees C added to each historical temperature record used as model forcing, 't2p10' i.e. +2 degrees C added to each historical temperature and +10% precipitation added to each historical precipitation record used as model forcing, 't4' i.e +4 degrees C added to each historical temperature record used as model forcing, and 't4p10' i.e. +4 degrees C added to each historical temperature and +10% precipitation added to each historical precipitation record used as model forcing. For more information see: http://hdl.handle.net/1957/56058"
Choose a dataset you found for processing
fabric <- webdata(webdatasets[87]) variables(fabric) <- query(fabric, "variables")[1] job <- geoknife(c(-89, 43.1), fabric)
There are hundreds (or potentially thousands) of additional OPeNDAP datasets that will work with geoknife, but need to be found through web searches or catalogs (e.g., www.esrl.noaa.gov/psd/thredds/dodsC/Datasets and apdrc.soest.hawaii.edu/data/data.php ).
One such example is Sea Surface Temperature from the Advanced Very High Resolution Radiometer (AVHRR) temperature sensing system. Specifying datasets such as this requires finding out the OPeNDAP endpoint (URL) for the dataset, and specifying it as the url
to webdata (we found this example in the extensive apdrc.soest.hawaii.edu/data/data.php catalog):
fabric <- webdata(url = "dods://apdrc.soest.hawaii.edu/dods/public_data/satellite_product/AVHRR/avhrr_mon")
query
for variables
doesn't work for this dataset, because it actually doesn't have units and therefore "valid" variables are not returned (instead you get an empty list). From the OPeNDAP endpoint, it is clear that this dataset has one variable of interest, which is called 'sst':
variables(fabric) <- "sst" query(fabric, "times")
## [1] NA NA
plotting the July surface temperature of a spot on the Caspian Sea is done by:
sst = result(geoknife(data.frame(caspian.sea = c(51, 40)), fabric, wait = TRUE)) head(sst) july.idx <- months(sst$DateTime) == "July" plot(sst$DateTime[july.idx], sst$caspian.sea[july.idx], type = "l", lwd = 2, col = "dodgerblue", ylab = "Sea Surface Temperature (degC)", xlab = NA)
## DateTime caspian.sea variable statistic ## 1 1990-01-01 11.250 sst MEAN ## 2 1990-02-01 10.575 sst MEAN
For our lake modeling activities, we use geoknife
to create driver data for each lake we are modeling, and also to extract land-cover data from the NLCD that is in the buffer of each lake (for wind sheltering).
Alternatively, there are some examples in our geoknife paper and our geoknife blog post
Looking at PRISM data for ecoregion: