10 March, 2016

Introduction

What is dataRetrieval?

  • R-package to get USGS/EPA water data into R

Where does the data come from?

  • US Geological Survey water data
    • National Water Information System (NWIS)
  • Water Quality Portal
    • USGS
    • EPA (EPA Storage and Retrieval Data Warehouse = STORET)
    • USDA
    • more being added….

What does dataRetrieval do to the data?

How to discover data?

  • Examples will be provided

Overview

Overview

Installation

dataRetrieval is available on the CRAN repository. The CRAN version is the most stable and user-tested:

install.packages("dataRetrieval")

Bug fixes and feature upgrades are vetted through a version of dataRetrieval that is available on a USGS-maintained R repository. To install from that repository:

install.packages("dataRetrieval", 
                 repos=c("http://owi.usgs.gov/R",
                         getOption("repos")))

More information can be found here.

Finally, the absolute cutting-edge version of dataRetrieval can be installed using the devtools package which pulls from GitHub:

library(devtools)
install_github("USGS-R/dataRetrieval")

dataRetrieval Help

Once the dataRetrieval package has been installed, it needs to be loaded in order to use any of the functions:

library(dataRetrieval)

There is a vignette that covers the full scope of the dataRetrieval package. It can be accessed with the following command:

vignette("dataRetrieval",package = "dataRetrieval")

Additionally, each function has a help file. These can be accessed by typing a question mark, followed by the function name in the R console:

?readNWISuv

Each function's help file has working examples to demonstrate the usage. The examples may have comments "## Not run". These examples CAN be run, they just are not run by the CRAN maintainors due to the external service calls.

Finally, if there are still questions that the vignette and help files don't answer, please post an issue on the dataRetrieval GitHub page:

https://github.com/USGS-R/dataRetrieval/issues

US Geological Survey Water Data Overview

National Water Information System (NWIS)

  • Unit Data
    • "Real-time" data
    • Available data from 2007 (big improvement from 120 days!)
  • Daily Data
    • Data aggregated from the unit data to a daily statistic
    • This data can go back many decades
  • Discrete Data
    • Water quality data
    • Groundwater level
    • Rating curves
    • Surfacewater measurements
  • Meta Data
    • Site information
    • Parameter information

USGS Basic Web Retrievals

The USGS uses various codes for basic retrievals

Here are some examples of common codes:
Parameter Codes Short Name
00060 Discharge
00065 Gage Height
00010 Temperature
00400 pH
Statistic Codes Short Name
00001 Maximum
00002 Minimum
00003 Mean
00008 Median

USGS Basic Web Retrievals: readNWISuv

Knowing a site number (or site numbers), paremeter code (or codes), and start and end date. Let's start by asking for discharge (parameter code = 00060) data for the Yahara River at Windsor, WI (an inlet to Lake Mendota).

siteNo <- "05427718"
pCode <- "00060"
start.date <- "2014-10-01"
end.date <- "2015-09-30"

yahara <- readNWISuv(siteNumbers = siteNo,
                     parameterCd = pCode,
                     startDate = start.date,
                     endDate = end.date)

USGS Basic Web Retrievals: renameNWISColumns

From the Yahara example, let's look at the data. The column names are:

names(yahara)
## [1] "agency_cd"        "site_no"          "dateTime"        
## [4] "X_00060_00011"    "X_00060_00011_cd" "tz_cd"

The names of the columns are based on the parameter and statistic codes. In many cases, you can clean up the names with a convenience function renameNWISColumns:

yahara <- renameNWISColumns(yahara)
names(yahara)
## [1] "agency_cd"    "site_no"      "dateTime"     "Flow_Inst"   
## [5] "Flow_Inst_cd" "tz_cd"

Explore Data

head(yahara)
##   agency_cd  site_no            dateTime Flow_Inst Flow_Inst_cd tz_cd
## 1      USGS 05427718 2014-10-01 05:00:00        16            A   UTC
## 2      USGS 05427718 2014-10-01 05:15:00        16            A   UTC
## 3      USGS 05427718 2014-10-01 05:30:00        16            A   UTC
## 4      USGS 05427718 2014-10-01 05:45:00        16            A   UTC
## 5      USGS 05427718 2014-10-01 06:00:00        16            A   UTC
## 6      USGS 05427718 2014-10-01 06:15:00        16            A   UTC

The data is returned as a data frame with 5 columns:

Column Name Type
agency_cd chr
site_no chr
dateTime POSIXct
Flow_Inst num
Flow_Inst_cd chr
tz_cd chr

Explore Data (cont.)

The returned data also has several attributes attached to the data frame. To see what the attributes are:

names(attributes(yahara))
## [1] "names"         "row.names"     "class"         "url"          
## [5] "siteInfo"      "variableInfo"  "disclaimer"    "statisticInfo"
## [9] "queryTime"

Each dataRetrieval return should have the attributes url, siteInfo, and variableInfo. Additional attributes are available depending on the data.

To access the attributes:

url <- attr(yahara, "url")

Raw Data

Explore Data (cont.)

library(ggplot2)
ts <- ggplot(yahara,
             aes(dateTime, Flow_Inst)) +
      geom_line()
ts

Use attributes for metadata:

parameterInfo <- attr(yahara, "variableInfo")
siteInfo <- attr(yahara, "siteInfo")
  
ts <- ts +
      xlab("") +
      ylab(parameterInfo$parameter_desc) +
      ggtitle(siteInfo$station_nm)
ts

USGS Basic Web Retrievals: readNWISdv

The unit value discharge data is 15 minute data availabe back to 2007.

Daily mean data is available for that site back to 1976.

daily <- readNWISdv(siteNo, pCode)
daily <- renameNWISColumns(daily)
dd <- ggplot(daily, aes(Date,Flow)) +
  xlab("") +
  ylab(attr(daily,"variableInfo")$parameter_nm) +
  geom_line()
dd

Multiple site and parameters

The NWIS functions can be used to get multiple sites and multiple parameters in one call.

sites <- c("05427850","05427718") # 2 stations on Yahara
pCodes <- c("00060","00010") #Temperature and discharge

multi <- readNWISuv(sites,pCodes, "2016-03-01","2016-03-08")
multi <- renameNWISColumns(multi)

names(multi)
## [1] "agency_cd"     "site_no"       "dateTime"      "Wtemp_Inst"   
## [5] "Flow_Inst"     "Wtemp_Inst_cd" "Flow_Inst_cd"  "tz_cd"
variableInfo <- attr(multi, "variableInfo")
siteInfo <- attr(multi, "siteInfo")

Multiple site and parameters (cont.)

Base R plots:

par(mfcol=c(2,1),oma=c(2,1,1,1), mar=c(1,4,0.1,0.1))

site1 <- multi[multi$site_no == sites[1],]
site2 <- multi[multi$site_no == sites[2],]
plot(site1$dateTime, site1$Flow_Inst, type="l",col="red",
     xlab = "",xaxt="n",
     ylab=variableInfo$param_units[variableInfo$parameterCd == "00060"])

lines(site2$dateTime, site2$Flow_Inst)
plot(site1$dateTime, site1$Wtemp_Inst, type="l",col="red",
     xlab = "", 
     ylab=variableInfo$param_units[variableInfo$parameterCd == "00010"])
lines(site2$dateTime, site2$Wtemp_Inst)
legend("topleft", legend = siteInfo$site_no, 
       cex=0.75, col = c("red","black"),lwd = 1)
box()

Multiple site and parameters (cont.)

Base R plots:

Multiple site and parameters (cont.)

ggplot2: Reshape data from "wide" to "long" first using tidyr package.

library(tidyr)
library(dplyr)
multi.data <- select(multi, site_no, dateTime, Wtemp_Inst, Flow_Inst)
head(multi.data)
##    site_no            dateTime Wtemp_Inst Flow_Inst
## 1 05427718 2016-03-01 06:00:00        4.4        39
## 2 05427718 2016-03-01 06:05:00        4.4        39
## 3 05427718 2016-03-01 06:10:00        4.4        39
## 4 05427718 2016-03-01 06:15:00        4.4        39
## 5 05427718 2016-03-01 06:20:00        4.4        39
## 6 05427718 2016-03-01 06:25:00        4.4        39
longMulti <- gather(multi.data, variable, value, -site_no, -dateTime)
head(longMulti)
##    site_no            dateTime   variable value
## 1 05427718 2016-03-01 06:00:00 Wtemp_Inst   4.4
## 2 05427718 2016-03-01 06:05:00 Wtemp_Inst   4.4
## 3 05427718 2016-03-01 06:10:00 Wtemp_Inst   4.4
## 4 05427718 2016-03-01 06:15:00 Wtemp_Inst   4.4
## 5 05427718 2016-03-01 06:20:00 Wtemp_Inst   4.4
## 6 05427718 2016-03-01 06:25:00 Wtemp_Inst   4.4

Multiple site and parameters (cont.)

ggplot2: Reshape data from "wide" to "long" first using tidyr package.

gp <- ggplot(longMulti, 
             aes(dateTime, value, color=site_no)) +
  geom_line() +
  facet_grid(variable ~ .,scales= "free")
gp

Multiple site and parameters (cont.)

To demonstrate the power of tidyr + dplyr + ggplot2 + dataRetrieval:

sites <- c("05427850","05427718","05428000","05427880") 
pCodes <- c("00060","00010","00055","00065") 

wideMulti <- readNWISuv(sites,pCodes, "2016-03-01","2016-03-08") %>%
         select(-ends_with("_cd"))

siteInfo <- attr(wideMulti, "siteInfo")
paramInfo <- attr(wideMulti, "variableInfo")

longMulti <- gather(wideMulti, variable, value, -site_no, -dateTime) %>%
         mutate(variable = as.factor(variable)) %>%
         mutate(site_no = as.factor(site_no))

levels(longMulti$variable) <- paramInfo$param_units
levels(longMulti$site_no) <- siteInfo$station_nm

gp <- ggplot(longMulti, 
             aes(dateTime, value, color=site_no)) +
  geom_line(size=1.5) + xlab("") + ylab("") +
  facet_grid(variable ~ .,scales= "free") + 
  theme(legend.title=element_blank(),
        legend.position='bottom',legend.direction = "vertical")
gp

Multiple site and parameters (cont.)

To demonstrate the power of tidyr + dplyr + ggplot2 + dataRetrieval:

USGS Advanced Retrievals: Discover Data

This is all great when you know your site numbers.

What do you do when you don't?

If you are looking in a specific area, you can browse the Environmental Data Discovery and Transformation tool: EnDDaT

If you don't….

Become familiar with the possibilities of the web services HERE

Overview

USGS Advanced Retrievals: readNWISdata

Let's look for long-term USGS phosphorous data in Wisconsin:

?readNWISdata

pCode <- c("00662","00665")
phWI <- readNWISdata(stateCd="WI", service="site",
                     seriesCatalogOutput=TRUE)

phWI.1 <- filter(phWI, parm_cd %in% pCode) %>%
            filter(count_nu > 300) %>%
            mutate(period = as.Date(end_date) - as.Date(begin_date)) %>%
            filter(period > 15*365)

readNWISdata

Let's look on a map

library(leaflet)
leaflet(data=phWI.1) %>% 
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(~dec_long_va,~dec_lat_va,
                   color = "red", radius=3, stroke=FALSE,
                   fillOpacity = 0.8, opacity = 0.8,
                   popup=~station_nm)

More readNWISdata examples

dataTemp <- readNWISdata(stateCd="OH", parameterCd="00010", service="dv")

instFlow <- readNWISdata(sites="05114000", service="iv", 
                   parameterCd="00060", 
                   startDate="2014-05-01T00:00Z",endDate="2014-05-01T12:00Z")
                                                   
instFlowCDT <- readNWISdata(sites="05114000", service="iv", 
                   parameterCd="00060", 
                   startDate="2014-05-01T00:00",endDate="2014-05-01T12:00",
                   tz="America/Chicago")

bBoxEx <- readNWISdata(bBox=c(-83,36.5,-81,38.5), parameterCd="00010")

waterYear <- readNWISdata(bBox=c(-83,36.5,-81,38.5), parameterCd="00010", 
                  service="dv", startDate="2013-10-01", endDate="2014-09-30")

wiGWL <- readNWISdata(stateCd="WI",service="gwlevels")
meas <- readNWISdata(state_cd="WI",service="measurements",format="rdb_expanded")

Water Quality Portal (WQP)

Water Quality Portal

  • Multiple agencies
    • USGS data comes from the NWIS database
    • EPA data comes from the STORET database (this includes many state, tribal, NGO, and academic groups)
  • WQP brings data from all these oranizations together and provides it in a single format

  • Much more verbose output than NWIS

  • To get non-NWIS data, need to use CharacteristicName instead of parameter code.
    • Need to be careful of units
    • Need to be careful of censored values
    • Dissolved vs filtered vs ….etc….

Overview

Water Quality Portal Basic Retrieval: readWQPqw

Much like the convenience functions for NWIS, there's a simple function for retrievals if the site number and parameter code or characteristic name is known.

sitePH <- phWI.1$site_no[1]

nwisQW <- readNWISqw(sitePH, "00665")
ncol(nwisQW)
## [1] 36
wqpQW <- readWQPqw(paste0("USGS-",sitePH),"00665")
ncol(wqpQW)
## [1] 65

Explore these results in RStudio.

Censored Data: NWIS

Censored data is particularly important for water quality data. Two examples of censored data are: * Left-censored - the data is less than the detection level of the measurement technique * Right-censored - the data is greater than the upper-limit of the measurement technique

NWIS data makes identifing this data easy using the column result_cd.

Censored Data: WQP

There's a little more work for WQP data. The following table has renamed the columns for space considerations.

Non-NWIS data might have different ways to indicate censoring.

Water Quality Portal Retrieval: readWQPdata

The true value of the Water Quality Portal is to explore water quality data from different sources.

Become familiar with the possibilities of the web services HERE

There's not a function in WQP that returns period of record information like we did above via NWIS data. The following function returns a list of sites that have collected phosphorus data in Wisconsin, but there's no way to know if that site has collected one sample, or thousands.

phosSites <- whatWQPsites(statecode="WI",characteristicName="Phosphorus")

Water Quality Portal Retrieval: readWQPdata

So, to get that information, you need to actually get that data.

phosData <- readWQPdata(statecode="WI",characteristicName="Phosphorus")
saveRDS(phosData, "phosData.rds")
unique(phosData$ResultMeasure.MeasureUnitCode)
##  [1] "mg/l"       "ug/l"       NA           "mg/kg"      "ppb"       
##  [6] "mg/l as P"  "mg/kg as P" "mg/l PO4"   "lb/day"     "mg/kg PO4" 
## [11] "%"

Use a little dplyr to get some information:

siteInfo <- attr(phosData, "siteInfo")

wiSummary <- phosData %>%
  filter(ResultMeasure.MeasureUnitCode %in% c("mg/l","mg/l as P")) %>%
  group_by(MonitoringLocationIdentifier) %>%
  summarise(count=n(),
            start=min(ActivityStartDateTime),
            end=max(ActivityStartDateTime),
            max = max(ResultMeasureValue, na.rm = TRUE)) %>%
  filter(count > 300) %>%
  arrange(-count) %>%
  left_join(siteInfo)
## Joining by: "MonitoringLocationIdentifier"

Water Quality Portal Retrieval: readWQPdata (cont.)

col_types <- c("darkblue","dodgerblue","green4","gold1","orange","brown","red")
leg_vals <- unique(as.numeric(quantile(wiSummary$max, 
                probs=c(0,0.01,0.1,0.25,0.5,0.75,0.9,.99,1), na.rm=TRUE)))
pal = colorBin(col_types, wiSummary$max, bins = leg_vals)
rad <-3*seq(1,4,length.out = 16)
wiSummary$sizes <- rad[as.numeric(cut(wiSummary$count, breaks=16))]
          
leaflet(data=wiSummary) %>% 
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(~dec_lon_va,~dec_lat_va,
                   fillColor = ~pal(max),
                   radius = ~sizes,
                   fillOpacity = 0.8, opacity = 0.8,stroke=FALSE,
                   popup=~station_nm) %>%
  addLegend(position = 'bottomleft',
            pal=pal,
            values=~max,
            opacity = 0.8,
            labFormat = labelFormat(digits = 1), 
            title = 'Max Value')

Water Quality Portal Retrieval: readWQPdata (cont.)

Water Quality Portal Advise

More filters are often slower. Bounding box queries are slow.

Check the website for new features!

Time/Time zone discussion

  • The arguments for all dataRetrieval functions concerning dates (startDate, endDate) can be R Date objects, or character strings, as long as the string is in the form "YYYY-MM-DD"

  • In R, one vector (or column in a data frame) can only have ONE timezone attribute
    • Sometimes in a single state, some sites will acknowledge daylight savings and some don't
    • dataRetrieval queries could easily span timezones
  • Therefore, dataRetrieval converts all date/times to UTC.

  • The user can specify a single timezone to override UTC. The allowable tz arguments are listed on the next slide

Allowable timezones

tz Name UTC_offset UTC_DST
America/New_York Eastern Time -05:00 -04:00
America/Chicago Central Time -06:00 -05:00
America/Denver Mountain Time -07:00 -06:00
America/Los_Angeles Pacific Time -08:00 -07:00
America/Anchorage Alaska Time -09:00 -08:00
America/Honolulu Hawaii Time -10:00 -09:00
America/Jamaica Eastern Standard Time(year-round) -05:00 -05:00
America/Managua Central Standard Time (year-round) -06:00 -06:00
America/Phoenix Mountain Standard Time(year-round) -07:00 -07:00
America/Metlakatla Pacific Standard Time(year-round -08:00 -08:00

Verbose and Progress

Use tools from the httr package to get data transfer information, and/or a progress bar on long-running retrievals.

library(httr)
set_config(verbose())
set_config(progress())
daily <- readNWISdv(siteNo, pCode)
-> GET /nwis/dv/?site=05427718&format=waterml%2C1.1&
  ParameterCd=00060&StatCd=00003&startDT=1851-01-01 
  HTTP/1.1
-> Host: waterservices.usgs.gov
-> User-Agent: libcurl/7.43.0 httr/1.1.0 dataRetrieval/2.5.2
-> Accept-Encoding: gzip, deflate
-> Accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/1.1 200 OK
<- Date: Wed, 09 Mar 2016 18:10:29 GMT
<- Server: GlassFish Server Open Source Edition  4.1
<- Vary: Accept-Encoding
<- Content-Encoding: gzip
<- Content-Type: text/xml;charset=UTF-8
<- X-UA-Compatible: IE=edge,chrome=1
<- ResponseTime: D=258500 t=1457547029642140
<- Connection: close
<- Transfer-Encoding: chunked
<- 
Downloading: 53 kB

More information: