The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.
NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!
Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.
The following packages will be used in Day 1’s session on data retrieval. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
"GADMTools", "sf", "tmap", "rnaturalearth", "rnaturalearthdata", # GIS in R
"readxl", "rvest", "htmltab", "stringr", "jsonlite", "httr", "geojsonsf", # Web services
"dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
"rFIA" # for USFS FIA data
)
install.packages(setdiff(packages, rownames(installed.packages())))
If folks are having trouble installing tmap
due to an issue with one of its dependencies, terra
, try running the following code, and then reinstall tmap
.
install.packages('terra', repos='https://rspatial.r-universe.dev')
install.packages('tmap')
After you run the code chunk above, please run the following code to ensure everything was successfully installed.
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
"GADMTools", "sf", "tmap", "rnaturalearth", "rnaturalearthdata", # GIS in R
"readxl", "rvest", "htmltab", "stringr", "jsonlite", "httr", "geojsonsf", # Web services
"dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
"rFIA")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.
The following packages will be used in Day 2’s session on functional programming. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.
packages <- c("tidyverse", "parallel", "microbenchmark",
"profvis", "modelr", "lubridate", "tidyselect")
install.packages(setdiff(packages, rownames(installed.packages())))
After you run the code chunk above, please run the following code to ensure everything was successfully installed.
packages <- c("tidyverse", "parallel", "microbenchmark", "profvis",
"modelr", "lubridate", "tidyselect")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.
If you are attending Day 3: R Markdown, you must have a LaTeX engine installed to output to PDF. The easiest, lightest weight install is tinytex
. The tinytex website includes download files and installation instructions.
The following packages will also used in Day 3’s session on R Markdown. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.
packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT",
"tidyverse", "flexdashboard", "crosstalk", "leaflet",
"DT", "echarts4r", "reactable", "plotly", "sparkline",
"dygraphs")
install.packages(setdiff(packages, rownames(installed.packages())))
After you run the code chunk above, please run the following code to ensure everything was successfully installed.
packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT",
"tidyverse", "flexdashboard", "crosstalk", "leaflet",
"DT", "echarts4r", "reactable", "plotly", "sparkline",
"dygraphs")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
If you want to be able to collapse rows in a kable()
, you’ll need the development version of kableExtra
, which can be installed from GitHub. However, to install from GitHub, you need the devtools
package. See Day 4 below for instructions on installing devtools
and RTools
(required for devtools
). After installing devtools
and RTools
, you can install the development version of kableExtra
with the following code:
devtools::install_github("haozhu233/kableExtra")
If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.
If you are attending Day 4: R Packages and Version Control, you need to install Git for Windows, RTools, and a number of packages prior to training. You’ll also need a GitHub account and to connect RStudio and GitHub prior to training for the best experience. More details to prepare for this session are below:
devtools
package allows you to install packages from GitHub, and will be the easiest way for others to install your packages in their R environment. The devtools
package has a lot of dependencies, and you often have to install new packages or update existing packages during the process. If you’re asked to update libraries while trying to install devtools
, you should update them. The most common reason the devtools
install doesn’t work is because you have an outdated version of one of its dependencies installed.
The following packages will also used in Day 3’s session on R Markdown. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.
packages <- c("devtools", "usethis", "roxygen2","stringr", "dplyr")
install.packages(setdiff(packages, rownames(installed.packages())))
These packages can take a while to download, and sometimes time out before finishing. After you run the code chunk above, please run the following code to ensure everything was successfully installed.
packages <- c("devtools", "usethis", "roxygen2", "stringr", "dplyr")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
The entire training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.
Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.
Half days are barely enough to scratch the surface of the more advanced topics we’re covering in R. Our goals with this training are to help you get beyond the initial learning curve, so you’re armed with the skills you need to continue learning on your own. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you begin applying and learning more about the tools we’re covering this week. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.
Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.
library(DBI)
library(odbc)
library(readr)
library(magrittr)
library(dplyr)
library(dbplyr)
#----------Connect to Access------------#
db_path <- "data/Trees.accdb"
conn_string <- paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=", db_path)
conn <- dbConnect(odbc(), .connection_string = conn_string)
#---------------Read the data---------------#
data <- tbl(conn, "tbl_Trees") %>%
collect()
# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
mutate_if(is.character, dplyr::na_if, "")
# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
#----------Option 1: Hard code db connection info (not great)------------#
sql_driver <- "SQL Server Native Client 11.0"
my_server <- "SERVER\\NAME"
my_database <- "Database_Name"
conn <- dbConnect(Driver = sql_driver, Server = my_server, Database = my_database, Trusted_Connection = "Yes", drv = odbc::odbc())
#----------Option 2: Read db connection info from shared drive------------#
# This is just a csv with columns for Driver, Server, and Database
params <- read_csv("path/to/csv/on/shared/drive/database-conn.csv")
params$Trusted_Connection <- "Yes"
params$drv <- odbc()
conn <- do.call(dbConnect, params)
#----------Option 3: Create a user DSN (easy to use in R, but has to be configured for each user on each computer)------------#
# Create DSN: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-how-to/configuring-the-sql-server-odbc-driver-add-a-data-source?view=sql-server-ver15
dsn_name <- "myDSN"
conn <- dbConnect(odbc(), dsn_name)
#---------------Read the data---------------#
data <- tbl(conn, in_schema("analysis", "Chemistry")) %>% # The first argument to in_schema is the schema name (SQL default is dbo) and the second is the table name.
collect()
# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
mutate_if(is.character, na_if, "")
# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
library(GADMTools) # for downloading high res administrative boundaries
library(rnaturalearth)# for natural earth download functions
library(rnaturalearthdata) # for cultural and physical boundaries of diff. res.
library(sf) # for working with simple features
library(tmap) # for mapping spatial data
If folks are having trouble installing tmap
due to an issue with one of its dependencies, terra
, try running the following code, and then reinstall tmap
.
install.packages('terra', repos='https://rspatial.r-universe.dev')
If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf
and tmap
packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.
The sf
package is the workhorse for anything you need to do with spatial vector data. File types with sf
are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf
package is also superseding the rgdal
package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf
couldn’t do.
The main application of tmap
package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2
. In practice for tmap
, it means that maps are built as a collection of many small functions/layers that get added together with pipes (%>%), and order matters. There are also tmap-enabled functions that you can use in ggplot2
plots too, but you can do a lot more in tmap
. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial
, which is an add-on package for plotting spatial data in ggplot2
.
The raster
package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster
package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).
Finally, the leaflet
package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.
leaflet
package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.
There’s an open source group that hosts a database of international geographic boundaries (called GADM), and you can download their data directly in R. Mutliple packages can do this, including GADMTools
and geodata
. The maintainer of geodata
also maintains the raster
package, so it’s a solid package. I’m going to show GADMTools
, because I found it easier to download and import as simple features, but geodata
is perfectly acceptable too.
library(GADMTools)
# Level 0: National
gadm_sf_loadCountries("USA", level = 0, basefile = "./data/")
us_bound <- readRDS("./data/USA_adm0.sf.rds")
us_bound_utm <- st_transform(st_as_sf(us_bound, crs = 4326), 5070)
st_write(us_bound_utm, "./data/US_Boundary.shp", append = F)
# Level 1: State/Province
gadm_sf_loadCountries("USA", level = 1, basefile = "./data/")
us_states <- readRDS("./data/USA_adm1.sf.rds")
us_state_utm <- st_transform(st_as_sf(us_states, crs = 4326), 5070)
st_write(us_state_utm, "./data/US_States.shp", append = F)
# Level 2: County/district
gadm_sf_loadCountries("USA", level = 2, basefile = "./data/")
us_cnty <- readRDS("./data/USA_adm2.sf.rds")
us_cnty_utm <- st_transform(st_as_sf(us_cnty, crs = 4326), 5070)
st_write(us_cnty_utm, "./data/US_Counties.shp", append = F)
plot(us_state_utm[1])
There’s another open source group called Natural Earth that hosts administrative boundaries of differing resolutions for easier small-scale to large-scale plotting. You can download data from their website, or via the rnaturalearth
R package. Categories of data include cultural, physical or raster. I’ll show an example of downloading lake data from this site.
library(rnaturalearth)
library(rnaturalearthdata)
lakes <- ne_download(scale = 10, # large scale
category = "physical",
type = 'lakes_north_america', # a named file on their website
destdir = paste0(getwd(), "/data"), # save to working dir.
returnclass = 'sf' # return as sf instead of sp
)
lakes_utm <- st_transform(lakes, crs = 5070)
The urls below are the different park tiles available, which can be plotted in the background of leaflet
or tmap
maps. I’ll show how that works with an example using tmap
.
# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
Using the us_cnty_utm
layer, which we downloaded from GADM and then reprojected to EPSG 5070 (UTM Albers), we’ll make a quick map to show how tmap
works. First I’ll filter the full county layer to just include the state of Minnesota. Then I’ll make a map with the MN layer called mn_map
, which first adds the mn_cnty
layer via tm_shape()
, then we specifies how the layer should look using tm_borders()
. I added the rivers and lakes layers I got from Natural Earth too. Finally we use tmap_options()
to set the baselayers of the map, which are the part tiles that were defined above.
# Map
mn_cnty <- st_as_sf(us_cnty_utm[us_cnty_utm$NAME_1 == "Minnesota",])
mn_map <- tm_shape(mn_cnty, projection = 5070) +
tm_borders(col = "black")
mn_map_lakes <- mn_map +
tm_shape(lakes_utm) +
tm_fill(col = 'lightblue')
mn_map_lakes
Now we’ll add the NPS baselayers to the map and change the mode to make the plot interactive.
tmap_options(basemaps = c(Map = NPSbasic,
Imagery = NPSimagery,
Light = NPSlight,
Slate = NPSslate))
tmap_mode('view') # set to interactive mode
mn_map
tmap_mode('plot') # return to static mode
# Packages used in this section
pkgs <- c("tidyverse",
"readxl",
"rvest",
"htmltab",
"stringr")
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
R
, Part 1
This section focuses on using R
to read delimited flat files (e.g., .csv) from the Internet, download and unzip compressed files, and scrape HTML tables from webpages.
We often get flat files from the Internet by clicking on a hyperlink or download button. With HTML tables we may simply copy the table and paste it into Excel. Why, then, should we bother writing R
code to do these processes? Perhaps we simply want to document all of our data import and processing steps in a reproducible and portable R
script. Or maybe the data update frequently and we don’t want to manually download the data ad nauseam. Whatever the reason, R
has solutions. But before we talk about those solutions, let’s go over some good practices when it comes to getting data from the Internet.
Check webpage security! Heed browser warnings about potentially unsafe sites. Don’t download data from HTTP connections (stick with HTTPS).
Respect copyrights and webpage terms and conditions for using data. Understand Creative Commons Licenses.
Don’t burden a webpage server with repeated downloads of the same data. Especially when working with large files, download the data to your computer and work with it from there.
Understand the data you’re using. Make sure it’s from a trusted source, and verify the quality of the data before you use or share it.
When web services require user authentication via API tokens, usernames, passwords, etc., don’t put this sensitive information directly in an R
script that may be shared with others. Instead, save website passwords and API tokens in an .Renviron file to be sourced by the R
script.
With that, we’re ready to get started.
If we have the URL for a .csv or .txt file, we can read it into R
as we would read any flat file from a file path on our computer. The first argument for these functions (e.g., readr::read_csv()
) is typically file =
, which can be an absolute file path, a file path relative to the current working directory, OR a complete URL. For example, we can write dat <- readr::read_csv("./data/butterflies.csv")
to reference a file path on our computer or dat <- readr::read_csv("https://raw.githubusercontent//butterflies.csv")
get the data from a webpage.
The example below uses data on National Park Service visitor counts since 1904. These data were originally accessed from the US Department of the Interior National Park Service’s Integrated Resource Management Applications data portal (https://irma.nps.gov/), but here we read the data from a Tidy Tuesday Project
### NOTES:
# 1. We can feed the URL directly into the `readr::read_csv()` function instead of first assigning it to a variable name.
# 2. For other types of flat files (e.g., fixed width or tab-delimited), simply replace `read_csv` with the relevant function
# Specify URL where file is stored
file_loc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv"
# Read in the file from the URL as we would read in a file stored on our computer
dat <- readr::read_csv(file_loc)
head(dat)
Additional notes on reading flat files in R
:
The fread()
function from the data.table
package can be much faster than base R or dplyr
functions for reading in very large flat files in table format. The resulting object is classified as both a data.table
and a data.frame
, which means we can wrangle the data with either data.table
or dplyr
functions. For very large data sets, using data.table
functions on data.table
objects is faster and more efficient than is working with data.frame
objects and dplyr
functions. But check out the dtplyr
package, which converts dplyr
syntax to data.table
syntax, combining some advantages data tables with the familiar syntax of dplyr
.
If a very large flat file includes a lot of non-numeric data, the vroom::vroom()
function can be much faster than data.table::fread()
for reading the file into an R
. The resulting object is a tibble
.
Instead of reading a file directly into an R
object, we can first download the file to a specified file path, then read from that file path into R
. This option is useful if we are trying to read a file with a function that does not accept URLs. Even if a function DOES accept URLs, it is generally good practice to separate the step of downloading files from working with them, especially when working with large data files. As a general rule, we want to minimize the number of times we may re-download a file and burden the servers.
The download.file()
function in the example below allows us to download various file types from the Internet and save them to specified file paths on the computer.
### NOTES:
# 1. Setting the `download.file()` `method` argument to `libcurl` allows access to ftps:// URLs and also allows simultaneous downloads, i.e., the `url` and `destfile` arguments can have character vectors with multiple items
# 2. Check out `?download.file` for other useful arguments, e.g., setting `timeout` length
# Specify URL where file is stored (alternatively, We can feed the URL and destination paths directly into the `download.file()` function)
file_loc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv"
# If we want to save the file in a subfolder of the current working directory, make sure the subfolder exists
if(!dir.exists("./downloads")) dir.create("./downloads")
# Specify the destination path where file should be saved
dest <- paste0("./downloads/park_visits_", gsub("-", "", x = Sys.Date()), ".csv") # saves file with attached date in yyyymmdd format (optional)
download.file(url = file_loc, destfile = dest) # save the file to the specified destination
We can use the download.file()
function to download a compressed file (e.g., a zipped shapefile) to a specified file path. Then we can useunzip()
to look at the contents of the compressed file, extract and read in file(s) of interest, and/or save extracted files to a specified file path. This process works with compressed files for which we have a URL.
The typical functions for reading flat files in R (e.g., read.table()
, read.csv()
, readr::read_delim()
, readr::read_csv()
, etc.) supposedly can read compressed files, but I often get error messages with them unless I download then unzip()
.
If we have no need to keep a compressed file after extracting its contents and/or reading its content(s) into an R
object, we can download the file into a temporary directory that will be automatically cleared when we exit the R session.
The example below uses data from a survey focusing on people’s anticipation about social distancing rules and firm closures during the 2020 COVID-19 pandemic. SOURCE: Lange, Fabian and Lars Vilhuber. 2020. “Uncertainty in times of COVID-19: Raw survey data [dataset].” Available at https://labordynamicsinstitute.github.io//covid19-expectations-data.
### NOTES:
# 1. On Windows, if the `mode` argument is not supplied and the URL ends in one of .gz, .bz2, .xz, .tgz, .zip, .jar, .rda, .rds or .RData, then the default mode is "wb" (for reading compressed files) -- but it never hurts to set the argument explicitly.
# 2. The `unzip()` function extracts all files unless one or more files is specified with the `files` argument
# 3. To unzip a single file, we can use unz() instead of unzip() -- but check out `?unz` for the correct argument names
# Specify URL where file is stored
file_loc <- "https://www.zenodo.org/record/3966534/files/labordynamicsinstitute/covid19-expectations-data-v20200622-clean.zip?download=1"
# Specify the destination where file should be saved. In this example we save the compressed file to a temporary directory that will be cleared after we exit the current `R` session. Alternatively, we could have provided an actual file path, e.g., `tmp <- ("./my_downloads/butterflies.zip")
tmp <- tempfile()
# We can see the file path of the temporary directory by typing `tempdir()` in the consoloe
# Download the file
download.file(url = file_loc, destfile = tmp, mode = "wb") # set mode to "wb" if the file is compressed (but okay if omitted because this is automatically set as the default mode when the file has a recognized compressed file extension -- see Note # 1 above)
# Now we can look at file names and sizes (in the temporary directory, AFTER downloading) without extracting the files
unzip(zipfile = tmp, list = TRUE) %>% head(30)
# Hmmm...there are a lot of files, so let's try again but specifically searching for a .csv or .xlsx to read in
unzip(zipfile = tmp, list = TRUE) %>%
dplyr::filter(., grepl(".csv|.xlsx", Name)) %>%
dplyr::pull(Name)
# Looks like we have 65 files that are .csv or .xlsx
# Read in the (unzipped) actual file of interest and call it 'dat'. I'm picking an .xls file for this example just for practice. But in general, it's faster and "cleaner" to read data in .csv or .tsv format if those alternatives are available
dat <- readxl::read_excel(unzip(zipfile = tmp, files = "labordynamicsinstitute-covid19-expectations-data-fd6cf45/auxiliary/aggregage_agegrp_ca.xlsx"))
head(dat) # examine the file
# We can extract the files and save them to a specified location. Remember that we had saved the zipped file to a temporary directory, so that file will be automatically deleted when we close the `R` session. But these extracted files will remain where we specified in the code line below.
# If the subfolder `downloads` doesn't already exist, it will be created automatically by the `unzip()` function.
unzip(zipfile = tmp, exdir = "./downloads/example_extracted_files_folder")
CHALLENGE: Practice reading in flat files and compressed files from the Internet.
One way to easily find practice files for this challenge is with Google Dataset Search. In the search bar, enter a topic (e.g., ‘invasive plants’) for which you would like to find a data set. Filter the search with these criteria:
Download format
, select Tabular
and Archive
Usage rights
, select Non-commercial use allowed
The search results will tell you the format of the available data and provide a link to the relevant webpage from which you can download the data. You can usually get the URL for the data by right-clicking on the data hyperlink and then selecting ‘Copy link address’.
If you find other file types that interest you (e.g., XML), use the Internet andR
help files to figure out how to import and read them into R.
Sometimes the data we want to work with are displayed in an HTML table on a webpage, rather than accessible via a download link or a REST API (more about that in later). One way to get the data would be to copy-paste from the webpage into an Excel worksheet, then read the worksheet into R
(or save as .csv, then read into R
). Alternatively, we can automate the process by coding it into an R
script.
This webpage has a nice explanation on how to understand the HTML code for <table>
elements. It’s a useful primer for understanding what we will be doing next.
In the examples below we pull data from HTML tables on this Wikipedia page about COVID vaccine deployment.
rvest
and htmltab
to read in an HTML table
# NOTES:
# 1. With `rvest::html_table()` we can also specify the node by number(s), e.g., `html_table(tabs[c(2, 6)])` or `html_table(tabs[[2]])`
# 2. If we don't specify a node by number, `rvest::html_table()` will convert all the tables into tibbles and combine them into a list of tibbles.
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
# First, read in an HTML page and then look in that object for HTML elements that are defined as `<table>`
html_pg <- rvest::read_html(page_url)
tabs <- rvest::html_elements(html_pg, "table")
# `tabs` may show more tables than you were expecting to see. Remember that HTML elements classified as `<table>` may refer to any content laid out in tabular format--not necessarily numeric data tables
# Use `str_detect()` to find which table has HTML text matching the title (caption) of the table we want. This code returns a logical vector for each element in `tabs`, indicating if it has string matching the title we specified
match_vec <- tabs %>%
rvest::html_text() %>%
stringr::str_detect("COVID-19 vaccine distribution by country")
# Now use `match_vec` with `html_table` to read in the table of interest
this_table <- rvest::html_table(tabs[[which(match_vec)]])
head(this_table)
# Notice that`this_table` includes a column with no heading and with only NA values. If we look at the table on the webpage, we see that this first column actually had no heading and had flag icons for values. This is why we see that column in our tibble/data.frame. In addition, the column headings include the superscripts `[a]` and `[b]`, which we don't actually want in our column headings
# A relatively new package, `htmltab`, has a function that works quite well when it comes to giving us a "clean" table. We will provide this function with the URL and with the table rank we had already identified with string matching.
this_table2 <- htmltab::htmltab(doc = page_url, which = which(match_vec))
head(this_table2)
htmltab()
and HTML developer tools to read in a specific HTML table
If we are having difficulty using rvest
functions to find the table we want, we can instead inspect the source code of a webpage to identify the XPath (rank) for the table we want. However, if the rank of the table changes in the future (e.g., the webpage is updated with additional tables) then we would need to update our code with the new table XPath information.
One way to find the XPath for the table, using Google Chrome or Microsoft Bing developer tools:
On the webpage, hover your mouse over the table of interest and (right click) > Inspect
In the developer tools ‘element’ view you should see a line of HTML code highlighted (corresponding to what you hovered over on the webpage). Depending on where in the table you clicked, you may have to go up a few lines of code to find the start of the table (look for “<table class =”; a gray vertical line to the left of the HTML code should also lead you up to where the table starts). Check the HTML caption for the table–if a caption exists–to make sure you have the right table. (NOTE: You can also use the ‘inspect element’ icon in developer tools and highlight a part of the table, to find the corresponding HTML code)
Click on the HTML code for the table class to highlight it. Then (right click) > Copy > Copy full XPath
Paste the XPath into your R
script and assign it to a variable name, e.g., xp_full
Once you have found the XPath for the table of interest, you can use it in the htmltab()
function as shown below.
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
# The XPath below was determined by following the steps above re: using your web brower's developer tools
xp_full <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[6]/div/div/table"
this_table <- htmltab::htmltab(doc = page_url, which = xp_full)
head(this_table) # Note that we seem to have an extra column. If we look at the HTML code for the table, we see that the `Doses(millions)` tab actually spans two columns. First column contains the integer, second column contains the decimal point and one value past the decimal point. We will want to paste the values in those two columns together.
On the COVID-19 vaccines webpage, the table captioned “COVID-19 vaccine production by country” has nested headers. Let’s see what happens when we read it into R
with rvest::
html_table()versus
htmltab::htmltab()`
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
html_pg <- rvest::read_html(page_url)
tabs <- rvest::html_elements(html_pg, "table")
# Use `str_detect()` to find which table has HTML text matching the title (caption) of the table we want. This code returns a logical vector for each element in `tabs`, indicating if it has string matching the title we specified
match_vec <- tabs %>%
rvest::html_text() %>%
stringr::str_detect("COVID-19 vaccine production by country")
# Let's first try using `rvest::html_table()`
this_table <- rvest::html_table(tabs[[which(match_vec)]])
head(this_table)
# Youch. The default output has part of the headers as the first row of data
# Now try with `htmltab::htmltab()`
this_table2 <- htmltab::htmltab(doc = page_url, which = which(match_vec))
head(this_table2)
# WA-LAH! As it turns out, the default output for `htmltab()` has a rather nice solution for the issue of nested headers
CHALLENGE: Practice reading in an HTML table from the Internet.
You will quickly learn that HTML tables can have challenging formatting idiosyncrasies that will test your data wrangling skills.
# Packages used in this section
pkgs <- c("tidyverse",
"jsonlite",
"httr",
"geojsonsf")
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
R
, Part 2
This section focuses on parsing JSON and GeoJSON data formats obtained from RESTful web services.
Sometimes, the data we want to pull from the Internet may not be available as a simple .csv or compressed file accessible with a click on a hyperlink or download button. Many agencies and other “data keepers” make their data available via APIs (Application Programming Interfaces) over the internet, i.e., through a web service. APIs define how requests for information (data, images, etc.) should be made from a “client” (e.g., R
running from our computer) to a “server”, and how the server will then respond–hopefully with the requested information!
REST (Representational State Transfer) is a type of web service that is popular for data exchange because it is lightweight (fast), human readable, and relatively easy to build. Pulling data from a RESTful web service requires a URL, which is the API “endpoint” that tells the web service what data we are requesting.
Typically when we ping a data request to a server we are asking for just a subset of the data available that the web service offers. For example, we may want monthly precipitation total from a particular weather station for a specific range of years. Therefore, our data request (via the URL) will include query parameters that specify the subset of data we want (similar to using dplyr::filter()
to subset data).
Some RESTful web services also require an API password, username, and/or token (to authenticate the request). This additional information will also be incorporated in the URL.
The Internet is chock-full of information on API’s and web services, and REST architecture. Some good starting points are here and here.
JSON (Java Script Object Notation) is a highly scalable, human-readable, language-independent data format commonly used for storing and exchanging data via web services. JSON data are written as key/value pairs, e.g., “name”:“Sebastian”. Curly braces {}
and square brackets []
are used to structure the data in a way that allows hierarchical/relational data structures (as we would see in a database with linked tables) to be clearly represented in a text file. Here is an example of JSON formatted data.
GeoJSON is a data format written in JSON, but following a specific standard for encoding geographic data structures such as points, lines, and polygons. We can think of GeoJSON data format as a JSON format with additional rules specific to how geographic data should be represented.
R
?
Geographic data provided in JSON format can still (potentially) be used for mapping in R
, just as .csv files with latitude and longitude data columns can be used to create point maps. If the geographic data structures are more than just points with x/y coordinates, it may take quite a bit of coding to parse the JSON object and reorganize the data so the object can be used for mapping.
The National Park Service provides geographic data via web services as described on this page.
NOAA developed and maintains the Applied Climate Information System web service, which compiles and summarizes climate data for public use. The datasets are described here and the RESTful web service query builder is [here] (http://builder.rcc-acis.org/).
USGS water quality data can be obtained through the Water Quality Portal (a RESTful web service), as described here. Even though the USGS dataRetrieval
package has functions to query and download data from the Water Quality Portal, the functions may not be able to accommodate complex queries. In that case we can always write code to query the web service directly.
These are just three examples. Once you learn how to import and work with data from RESTful web services, a whole new world of data will be at your fingertips.
R
Packages for Accessing Data from RESTful Web Services
Before scripting code to pull data from a RESTful web service, it may be useful to see if an R
package has already been created as a wrapper for accessing data from the web service. For example, USGS has created an R
package called dataRetrieval
that has functions for querying and downloading data from its water quality web services. The site ropensci is a great source for finding R
packages that make it easier to query, import, and work with publicly available scientific data (typically via web services). Many of the best-known citizen science projects have created R
packages that are API wrappers. For example:
In these examples we will read in National Park Service inventory data that are publicly available from ArcGIS REST services. While our examples focus on NPS data, the code can be modified to get data from other RESTful web services.
These examples are intended only to GET YOU STARTED with retrieving JSON and GeoJSON formatted data from web services. There is SO much more to learn than we what cover here.
To begin, we need to find the ArcGIS REST endpoint for a park’s vegetation inventory data.
Go to https://nps.maps.arcgis.com/home/index.html
(these are publicly available data).
Click on the search icon (magnifying glass) in the top right corner. In the search bar, type inventory_vegetation
and a park name or 4-letter code, e.g., inventory_vegetation cong
. (NOTE that inventory_geology
will give the corresponding maps with geology data). You should see a page like this:
Click on the map service hyperlink. In the image above, the hyperlink is the blue texted “IMD Vegetation Map Service for Congaree National Park”.
You can select “Open in Map Viewer” to see what the available GIS layers look like. To find the URL for a particular GIS layer, click on the hyperlink under Layers
. You should then see a page that looks like this:
From here, the steps differ slightly for each example below
In this example we will enter data query parameters on the web page, then generate (GET) the full URL with embedded query. This URL will be used to download the requested data in JSON format. The data represent geographical points so should be downloaded in GeoJSON format. But we will request the data in JSON format to show you how to then convert the data to a list of data frames.
Proceeding from Step 4…
5. Under Layers
, click on Vegetation Points for Congaree National Park
. This is a map layer that includes vegetation plot and observation points for the park.
This page shows useful information about the supported query formats (JSON, AMF, geoJSON), mapped vegetation categories, map projection (ESPG 26917) and map extent (boundary box). At the bottom of the page you should see a list of data fields in the data set. Click on the Query
hyperlink.
On the blue query page (shown below), enter 1=1
(no backticks) for Where:. This means we are not applying any filters to subset the data. Enter *
(no back ticks) for Out Fields:. This means we want all data fields to be included. For Format:, select JSON
.
Query(GET)
button at the bottom on the query page.You should now see a webpage with the requested data in JSON format. Copy the URL for this page and enter it in your R
script. Save it to a variable named file_loc
, then proceed with the following code.
# NOTE:
# 1. The `jsonlite::fromJSON` function converts objects from JSON format to another `R` object such as a data frame.
# 2. The `jsonlite::fromJSON` function can take a JSON string, URL or file as its first argument. However, we will download the data to a temporary file, THEN apply the `jsonlite::fromJSON` function.
# 3. Since the data represent geographic points and we want to use the data for mapping, we have one more step to do. The function`sf::st_as_sf()` converts the specified data frame to a simple feature (similar to a shapefile) for mapping. Because we downloaded data in JSON format instead of GeoJSON, we have to explicitly specify the columns that hold the x and y coordinate data, and we have to specify the CRS.
tmp <- tempfile()
download.file(file_loc, tmp) # the data are still in JSON format
veg_pts_df <- jsonlite::fromJSON(tmp)
# View(df)
# `df` is a list of data frames.
veg_pts_sf = sf::st_as_sf(veg_pts_df$features$geometry, coords = c("x", "y"), crs = 26917)
plot(veg_pts_sf)
httr
functions to build the full URL
In this example we will use httr
functions to build the full URL that will be used to download the requested data in JSON format. After that, the remaining steps will be the same as in Example 1. Building the full URL in R
is more reproducible and transparent than generating the full URL from a web service query tool.
For web services that require client authentication through usernames, passwords, API tokens, etc., the sensitive information should be saved in and then sourced from an .Renviron file rather than hard coded into the script that imports the data.
Proceeding from Step 4…
Layers
, click on Vegetation Points for Congaree National Park
.Copy the URL for the resulting page and enter it in your R
script. Save it to a variable named base_url
as shown below, then proceed with the rest of the code.
base_url <- httr::parse_url("https://irmaservices.nps.gov/arcgis/rest/services/Inventory_Vegetation/Vegetation_Map_Service_for_Congaree_National_Park/MapServer/0")
base_url$path <- paste(base_url$path, "query", sep = "/")
base_url$query <- list(where = "1=1", # code in the query parameters
outFields = "*",
f = "json")
request <- httr::build_url(base_url) # compare this URL to the one we used in Example 1
# Everything from here down is the same as in Example 1, but full URL is in `request`
tmp <- tempfile()
download.file(request, tmp)
veg_pts_df <- jsonlite::fromJSON(tmp)
veg_pts_sf = sf::st_as_sf(veg_pts_df$features$attributes, coords = c("x" = "X_Coord", "y" = "Y_Coord"), crs = 26917)
# View(veg_pts_sf)
plot(veg_pts_sf["Pnts_Type"], main = "Congaree National Park Vegetation Points")
We will now download inventory vegetation data in GeoJSON format, then use the geojsonsf::geojson_sf()
function to convert the object to an sf
object. As stated in ?geojson_sf
, “Geojson specification RFC7946 says all CRS should be the World Geodetic System 1984 (WGS 84) [WGS84] datum, with longitude and latitude units of decimal degrees. This is equivalent to the coordinate reference system identified by the Open Geospatial Consortium (OGC) URN urn:ogc:def:crs:OGC::CRS84”. In other words, regardless of the coordinate reference system of the GeoJSON object, the final sf
object will have CRS of WGS 84.
We will use httr
functions to build the full URL, but as in Example 1 we could also just use the ArcGIS query builder to generate the full URL.
base_url <- httr::parse_url("https://irmaservices.nps.gov/arcgis/rest/services/Inventory_Vegetation/Vegetation_Map_Service_for_Congaree_National_Park/MapServer/2") # note that the vegetation layer is MapServer/2
base_url$path <- paste(base_url$path, "query", sep = "/")
base_url$query <- list(where = "1=1", # code in the query parameters
outFields = "*",
f = "geojson")
request <- httr::build_url(base_url) # compare this URL to the one we used in Example 1
tmp <- tempfile()
download.file(request, tmp)
# Because the data are in GeoJSON format we will use `geojsonsf::geojson_sf()` to convert the object to an `sf` object
geo_sf <- geojsonsf::geojson_sf(tmp)
# View(geo_sf)
plot(geo_sf["MapUnit_Name"], main = "Congaree National Park Vegetation Types") # quick plot using base R - clearly issues with an unreadable legend, so needs tweaking. Also remember this `sf` has a different CRS than does `veg_pts_sf` so one would have to be transformed before both layers could be mapped together
You can download USFS FIA data in R using their DataStore url. The code below downloads multiple states’ PLOT table, prints the length of the table (which you’ll want to check against the DataStore recrods), and saves them to a data folder on my hard drive.
library(purrr)
# Create function to iterate download
downFIA <- function(state, table){
csv_name <- paste(state, table, sep = "_")
csv_link <- paste0('http://apps.fs.usda.gov/fia/datamart/CSV/', csv_name, '.csv')
assign(csv_name, read.csv(csv_link))
write.csv(csv_name, paste0("./data/", csv_name, '.csv'))
}
# Create list of states of interest
states = c("CT", "RI", "DE")
# Use purrr::map to download each state's PLOT table.
map(states, ~downFIA(., "PLOT"))
Note that you can take this another step further with purrr::map2()
to download from a list of states and a list of tables. But note that these are big files and the process may crash if trying to do too much in one call.
rFIA
The rFIA package was recently developed through a cooperative agreement between NETN and Hunter Stanke, a grad student at Michigan State University. The package was designed to make it easier to access and import FIA data into R and to perform common queries and analyses for specified areas and time series with the FIA data.
FIA dataset are huge and can take awhile to download. For the purposes here, we’re going to download DE and RI’s data, since they’re the smallest.
library(rFIA)
# Download CT and RI state data tables to data folder
getFIA(states = c('DE', 'RI'), dir = "./data")
# Download reference tables (ie lookup tables)
getFIA(states = 'ref', tables = c('SPECIES', 'PLANT_DICTIONARY'))
The code below will import database objects into your global environment. To see the names of the tables within the database object, you can just write names(fia_all)
. To view a specific table, you can write fia_all$PLOT
.
# Import all states in R that's in the data folder
fia_all <- readFIA("./data")
names(fia_all) # Print table names
table(fia_all$PLOT$STATECD) # View number of plots by state code. Should be 2.
table(fia_all$PLOT$INVYR) # Range of inventory years in the data
# Import RI data only
fia_ri <- readFIA("./data", states = 'RI')
names(fia_ri)
head(fia_ri$PLOT)
table(fia_ri$PLOT$STATECD) # View number of plots by state code. Now only 1.
table(fia_ri$PLOT$INVYR)
Other really useful features of the rFIA
package are that you can clip the data to a time period or county. The code below clips the FIA data for RI to the most recent survey of each plot in their sample design.
# Clip all data to most recent sample
all_recent <- clipFIA(fia_all, mostRecent = TRUE)
# Print list of counties in RI
countiesRI
# Clip RI to Washington County
ri_Wash <- clipFIA(fia_ri, mask = countiesRI[5,], mostRecent = F)
You can also calculate common summary statistics and sampling errors using other functions within rFIA
, including carbon, biomass, diversity, down woody material, seedling density, tree population estimates, tree growth and mortality, etc. I’ll show this for the tree population estimates using the tpa()
function. These functions also allow you to summarize or group at multiple levels, including the state level, the plot level, the subplot level and the tree level. You can also group by species level and size class using this function. Grouping by ownership is also possible, which allows federal lands to be summarized differently than private lands. The grpBy
argument also allows you to specify multiple grouping variables.
fia_ri_rec <- clipFIA(fia_ri, mostRecent = TRUE)
# State level tree population estimates
tpa_ri <- tpa(fia_ri)
# Plot level tree population estimates for most recent survey
tpa_ri_plot <- tpa(fia_ri_rec, byPlot = TRUE)
# Species level and size class level tree population estimates
tpa_ri_sppsz <- tpa(fia_ri_rec, bySpecies = TRUE, bySizeClass = TRUE)
# by Ownership
tpa_ri_own <- tpa(fia_ri_rec, grpBy = OWNGRPCD)
The package also has some built-in plotting functions that help you visualize the data created by the previous step. The plot below is the trees per acre for the state of RI over time.
head(tpa_ri)
## # A tibble: 6 x 8
## YEAR TPA BAA TPA_SE BAA_SE nPlots_TREE nPlots_AREA N
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 2005 516. 112. 10.6 3.89 66 67 95
## 2 2006 496. 111. 8.85 3.23 93 95 140
## 3 2007 501. 112. 6.89 3.29 125 126 198
## 4 2008 499. 115. 6.86 3.36 121 122 193
## 5 2009 516. 116. 6.86 3.21 116 118 191
## 6 2010 504. 118. 6.95 3.19 118 119 196
plotFIA(tpa_ri, y = TPA, plot.title = "Trees per acre")
The goal of this demonstration will be to retrieve both an NPS Aquarius continuous dataset and a USGS NWIS continuous dataset through R, then combine them to do some basic plotting.
The code required will need to be saved to a directory in your project directory or (more easily) accessed from my github. The source code is Aquatic Informatic’s, but I have made some modifications, and added some basic tools to streamline the process. The required code can be accessed via the “AndrewBirchHydro/albAquariusTools” GitHub repository, but should be imported from Github in the first code chunk.
First, make sure you are connected to the VPN. Then use the chunk below to import the required tools and connect to Aquarius:
#the following toolbox can be pulled directly from my github:
#source("Aquarius basics.R") or....
source("https://raw.githubusercontent.com/AndrewBirchHydro/albAquariusTools/main/Aquarius%20basics.R")
timeseries$connect("https://aquarius.nps.gov/aquarius", "aqreadonly", "aqreadonly")
publishapiurl='https://aquarius.nps.gov/aquarius/Publish/v2'
#we also will need to load a few packages. If you do not have these, you will need to install them in the console.
library(dataRetrieval)
library(lubridate)
library(ggplot2)
library(jsonlite)
library(httr)
After running the chunk above, you should be connected and can now access NPS Aquarius stations and data.
By assigning the variable, “Network” below, you can get a list of sites within a given network:
Lets start by pulling all locations in SCPN:
#assign the network name of interest
Network <- "Southern Colorado Plateau Network"
#this function will print the available locations in that network
Print_Locations(Network = Network)
#if you want a dataframe of the locations, you can assign it to 'Locations_in_network' here:
Locations_in_Network <- as.data.frame(Print_Locations(Network = Network))
Once you have selected the location of interest, you can use the print_datasets()
function to get information on all of the available time series for a selected location:
For this exercise, we will go with the location Rito de los Frijoles at BAND Visitor Center with the identifier: ‘BANDRIT01’
Entering this location identifier into the Print_datasets()
function (easiest to copy and paste from the table above) will provide a list of datasets available at this location:
#enter the location of interest to get a list of datasets:
Print_datasets(Location = "BANDRIT01")
#you can also print available metadata for a given location based on it's identifier:
Print_metadata(Location = "BANDRIT01")
In the table above, the available datasets for Rito de los Frijoles at BAND Visitor Center include air temperature, water temperature, and voltage from their respective instruments. Lets take a look at the water temperature record.
Using the Get_timeseries2()
function, we will retrieve this dataset and assign it to the variable raw_record. We can then use the TS_simplify()
function to do some basic data wrangling and convert it to a cleaned up dataframe.
Enter the Identifier for the dataset below to retrieve the raw record, then use the function TS_simplify()
to convert it to a useable dataframe:
#this line will pull the dataset from Aquarius
raw_record <- Get_timeseries2(record = "Water Temp.AquaticLogger@BANDRIT01")
#this function will "clean it up"
temp_data <- TS_simplify(data = raw_record)
We should now have a simplified continuous time series of water temperature at Rito de los Frijoles at BAND Visitor Center with columns for water temperature, and the date and time.
Lets take a quick look at the dataset:
# some summary information
head(temp_data)
## date_time value
## 1 2007-12-05 09:00:00 12.292
## 2 2007-12-05 09:15:00 12.678
## 3 2007-12-05 09:30:00 13.257
## 4 2007-12-05 09:45:00 13.882
## 5 2007-12-05 10:00:00 14.218
## 6 2007-12-05 10:15:00 14.170
summary(temp_data)
## date_time value
## Min. :2007-12-05 09:00:00 Min. :-11.686
## 1st Qu.:2011-06-04 23:41:15 1st Qu.: 3.367
## Median :2014-03-25 00:52:30 Median : 9.176
## Mean :2014-09-26 14:14:41 Mean : 9.882
## 3rd Qu.:2018-05-16 18:33:45 3rd Qu.: 15.748
## Max. :2021-10-29 14:00:00 Max. : 40.761
## NA's :48
# a quick plot
ggplot(temp_data, aes(x = date_time, y = value)) +
geom_path() +
theme_bw() +
xlab("Time") +
ylab("Water Temperature (C)")
Lets say we want to pair this temperature dataset with discharge for an analysis.
There were no discharge data available in Aquarius for this site, but after looking at a map, there is a nearby NWIS station at RITO DE LOS FRIJOLES In BANDELIER T MON, NM. The station’s ID number is 08313350
We can pull discharge data from NWIS to pair with the temperature record using the dataRetrieval
package.
The package includes a function readNWISuv()
which we will use to grab the discharge data. It requires variables for the site number, USGS parameter code, and start/ end dates, which we will assign below (the USGS code for discharge is 00060)
#00060- USGS code for discharge
siteNo <- "08313350"
parameter <- "00060"
#for start and end dates, you can enter basically all of known time to get whatever data is available, or narrow it down to a particular window. In this instance, lets go from 1900 to 2025 to ensure we pull everything available
start.date <- "1900-01-01"
end.date <- "2025-01-01"
With the required variables defined, we can now use the readNWISuv()
function to pull our discharge dataset:
(it may take a few seconds if it is a long record)
NWIS_query <- readNWISuv(siteNumbers = siteNo,
parameterCd = parameter,
startDate = start.date,
endDate = end.date)
The dataframe will come in with some goofy USGS formatting and column names, which can be reassigned easily to match our Aquarius temperature record:
colnames(NWIS_query) <- c("agency", "identifier", "date_time", "Q_cfs", "time_zone")
Lets take a quick look at the discharge data:
head(NWIS_query)
## agency identifier date_time Q_cfs time_zone NA
## 1 USGS 08313350 1993-10-01 06:15:00 0.72 A [91] UTC
## 2 USGS 08313350 1993-10-01 06:30:00 0.72 A [91] UTC
## 3 USGS 08313350 1993-10-01 06:45:00 0.72 A [91] UTC
## 4 USGS 08313350 1993-10-01 07:00:00 0.72 A [91] UTC
## 5 USGS 08313350 1993-10-01 07:15:00 0.72 A [91] UTC
## 6 USGS 08313350 1993-10-01 07:30:00 0.72 A [91] UTC
summary(NWIS_query)
## agency identifier date_time
## Length:383517 Length:383517 Min. :1993-10-01 06:15:00
## Class :character Class :character 1st Qu.:1996-02-26 23:15:00
## Mode :character Mode :character Median :2013-03-26 22:20:00
## Mean :2006-12-22 17:29:37
## 3rd Qu.:2015-06-17 00:20:00
## Max. :2016-10-30 05:55:00
## Q_cfs time_zone NA
## Min. : 0.000 Length:383517 Length:383517
## 1st Qu.: 0.500 Class :character Class :character
## Median : 0.820 Mode :character Mode :character
## Mean : 1.577
## 3rd Qu.: 1.180
## Max. :9500.000
# we'll plot both the water temperature and discharge records for a quick visual comparison
ggplot(data = NWIS_query, aes(x = date_time, y = Q_cfs)) +
geom_path() +
theme_bw() +
scale_y_log10() +
xlab("Time") +
ylab("Discharge (cfs)")
ggplot(temp_data, aes(x = date_time, y = value)) +
geom_path() +
theme_bw() +
xlab("Time") +
ylab("Water Temperature (C)")
Uh oh! Taking a look at the data above, there is a significant data gap in discharge from this station from around 1997-2013. This is fairly common, and emphasizes the need to always visually inspect data prior to conducting any kind of analysis.
There is some overlap between the two records however, so lets merge them by date_time to get a combined time series where the records overlap.
# we will enter FALSE for all.x and all.y, as we only want the period where the records overlap one another:
combined_record <- merge(x = NWIS_query, y = temp_data,
by = "date_time", all.x = F, all.y = F)
We should now have a record of both water temperature from Aquarius, and discharge from NWIS. Lets take a quick looks at it:
head(combined_record)
## date_time agency identifier Q_cfs time_zone NA value
## 1 2012-07-05 06:00:00 USGS 08313350 0.43 A UTC 17.4
## 2 2012-07-05 06:15:00 USGS 08313350 0.40 A UTC 17.3
## 3 2012-07-05 06:30:00 USGS 08313350 0.40 A UTC 17.3
## 4 2012-07-05 06:45:00 USGS 08313350 0.40 A UTC 17.2
## 5 2012-07-05 07:00:00 USGS 08313350 0.43 A UTC 17.1
## 6 2012-07-05 07:15:00 USGS 08313350 0.40 A UTC 17.1
summary(combined_record)
## date_time agency identifier
## Min. :2012-07-05 06:00:00 Length:55844 Length:55844
## 1st Qu.:2013-06-06 23:56:15 Class :character Class :character
## Median :2014-09-05 15:07:30 Mode :character Mode :character
## Mean :2014-10-04 03:29:06
## 3rd Qu.:2016-04-10 16:37:30
## Max. :2016-10-30 05:45:00
## Q_cfs time_zone NA value
## Min. : 0.000 Length:55844 Length:55844 Min. : 0.121
## 1st Qu.: 0.370 Class :character Class :character 1st Qu.: 9.373
## Median : 0.670 Mode :character Mode :character Median :14.709
## Mean : 2.081 Mean :13.933
## 3rd Qu.: 1.140 3rd Qu.:18.140
## Max. :9500.000 Max. :35.542
# lets take a basic look at the time series together- not that since they are on different scales, a secondary axis is required
ggplot(combined_record, aes(x = date_time, y = Q_cfs)) +
geom_path(color = "blue") +
geom_path(color = "red", aes(x = date_time, y = value * 500)) +
theme_bw() +
scale_y_continuous(name = "Discharge (cfs)", sec.axis = sec_axis(~ . / 500, name = "Water Temperature (C)"))
# lets plot temperature and discharge against one another to see their relationship:
ggplot(combined_record, aes(x = Q_cfs, y = value)) +
geom_point() +
theme_bw() +
xlab("Discharge (cfs)") +
ylab("Water Temperature (C)") +
scale_x_log10()
Not much information here is there? What if we want to see the temperature-discharge relationship by month or year?
Lets assign columns for month and year, and recreate the plot based on those parameters.
# This will require the lubridate package
library(lubridate)
# this will create a new column with the numeric month derived from the date_time column as a factor
combined_record$month <- as.factor(month(combined_record$date_time))
# this will create a new column with the year derived from the date_time column as a factor
combined_record$year <- as.factor(year(combined_record$date_time))
Now lets recreate our plot using this new month column for color faceted by year.
ggplot(combined_record, aes(x = Q_cfs, y = value, color = month)) +
geom_point() +
theme_bw() +
xlab("Discharge (cfs)") +
ylab("Water Temperature (C)") +
scale_x_log10() +
ggtitle("2013") +
facet_grid(year ~ .)
sf
and tmap
packages.
help()
- searches help files for packages loaded into namespace?
- searches help files for packages loaded into namespace??
- searches all help files regardless of wether or not they are loaded' '
=
<-
==
!
!=
>
>=
<
<=
%in%
|
&
library()
[]
[[]]
rm()
function(){}
data.frame
file.path()
format
object.size()
for(){}
apply()
tapply()
sapply()
lapply()
do.call()
rbind()
names()
, setNames()
, colNames()
nrow()
rpois()
quantile()
plot()
vector()
read.csv()
class()
is.numeric()
is.na()
mean
if(){}else{}
return()
glm()
all()
map_if()
head()
paste0()
length()
c()
boxplot()
%>%
.
filter()
mutate()
case_when
bind_rows()
select()
rename()
date()
group_by()
across()
summarize()
pivot_longer()
where()
starts_with()
any_of()
unite
pivot_longer
pivot_longer()
microbenchmark()
ggplot()
geom_point()
geom_line()
facet_grid()
ggsave()
map
map2
map_dbl
discard()
set_names()
map()
map2()
map_dbl()
discard()
map_if()
mdy()
mdy_hms()
add_predictions()
read_csv()
profvis()
Writing code is a collection of evolutionary continua along introductory and advanced methods, new and old methods, less efficient and more efficient methods, etc. This is all driven by an evolution in the availability of functions that do different things as well as an evolution in your coding style. Our goal in this course is to provide you with tools that will expand your toolkit and help you evolve your coding approaches. The only thing you will ever master in R is the ability to read R code and the ability to try new ways to do things - it is constantly changing.
This module will provide a look at how to make a function and then we will look at how you apply that function in iteration. The goal with this is to equip you with the two most powerful tools any R user can have:Thomas - A function is a container for a series of steps that are performed on data. It has some fixed expectations about input and output - another way of saying that is that a function expects a specific pattern(s) for input and for output. A function is also an expression of what another person thinks the right way to do something is. The nice thing is that if you don’t like all that, you can write your own function, but be selective about that.
JP - Functions are a way of taking a large problem and breaking it down into simpler discrete steps. Each function can just focus on one step and makes it easier to do that step in isolation. You can then reuse the functions to help solve new problems in the future.
Someone writes a for
loop as a their first method of iteration. After learning this, most people typically write a for
loop like what we looked at in the intro course where the for
loop contains the full expression of code that will be implemented.
After a while they begin to recognize that they are reusing or even rewriting that code in the same script or other projects. So they wrap it in a function so that they can reuse it anywhere.
They quickly realize that their for
loop wrapped in a function isn’t as flexible or as useful as they thought it would be so they start breaking it up into smaller functions. Perhaps they keep the for loop.
They remove the for loop from inside the function. But they may be putting some combination of the function inside of a for loop to do what they need to do.
They get tired of having to remember the tricks needed to make the for loop work they way they want each time and they turn to a “functional.” They know what they want to do to the data and just want to iterate that over the data and get an expected result without too much ceremony. So they turn to a functional.
Functionals continue to evolve as new packages are released or redeveloped and the coder seeks simpler, faster, or clearer ways to accomplish their goals.
For loops aren’t bad; but duplicated code can conceal important differences, and why do more work than you have to? - Hadley Wickham
Early on, many functions start out as for loops. A for loop has a for()
statement which contains a definition of the iteration conditions usually something likei in vec
. Then it has the body of the loop wrapped in {}
which contains the expressions that will evaluate i
.
Consider the simple for loop that returns a dataframe to console:
nIter=10 #number of iterations
for (i in 1:nIter) {
print(data.frame(x=i + 1, y=i))
}
## x y
## 1 2 1
## x y
## 1 3 2
## x y
## 1 4 3
## x y
## 1 5 4
## x y
## 1 6 5
## x y
## 1 7 6
## x y
## 1 8 7
## x y
## 1 9 8
## x y
## 1 10 9
## x y
## 1 11 10
If we want to use those data for something else, we need to capture them into a variable that exists in the global environment. You will likely see a few different approaches to get it out of the loop. The easiest way is to ‘grow’ an object by appending data to the end of it. This is intuitive, but it is the wrong way because of how [R] works. Growing a dataframe forces [R] to make a copy of the object and allocate another block of memory to that copy. This takes time. Bigger an bigger object take longer and longer to find free blocks of memory
#growing a dataframe with rbind
nIter=10000
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-rbind(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.47 seconds on my system
#growing a dataframe with append()
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-append(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.93 seconds on my system
system.time()
is a built function for measuring how long it takes to execute a string of expressions. Both of these produce the desired output, however, they take about 4.75 seconds to execute on my machine. Compare that to preallocated output:
nIter=10000 # number of iterations
OutDat <- data.frame(x=rep(NA,nIter),y=rep(NA,nIter)) # define an preallocated dataframe
system.time(for (i in 1:nIter) {
OutDat[i,] <- data.frame(x=i + 1, y=i)
}) #2.5 seconds on my system
rm(OutDat)
Even on this toy example, the preallocated for
loop executes in half the time. However, we have had to put a fair amount of thought into preallocating the output object, and added code to assign output.
In short, if you are outputting something from a for
loop you should always preallocate. Then you should go find a functional that does what you want.
First we need to create the file names and paths we want to read in.
#create a few variables that will be used for file path manipulation later on
inPath <- "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/" #location of the data
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
) #names of the files
Now we will use a basic for
loop to read the data into a preallocated list. The result will be a list of dataframes. Lists of data are handy for processing a lot of similar data. Sometimes they are inefficient if the individual files are too large. But, for today, we are OK.
#preallocate the output object
HoboData<-vector(mode = "list", length = length(fNames))%>%
setNames(.,fNames)
#read in the data
for(i in fNames){
HoboData[[i]]<-read.csv(file.path(inPath,i), skip = 1, header = T)[1:3]%>%
setNames(.,c("idx", "DateTime", "T_F"))
}
#str(HoboData) #uncomment to inspect
format(object.size(HoboData), units="Mb") #how many Mb of memory is this object now occupying
## [1] "2.2 Mb"
names(HoboData)
## [1] "APIS01_20548905_2021_temp.csv" "APIS02_20549198_2021_temp.csv"
## [3] "APIS03_20557246_2021_temp.csv" "APIS04_20597702_2021_temp.csv"
## [5] "APIS05_20597703_2021_temp.csv"
head(HoboData[[1]])
## idx DateTime T_F
## 1 1 03/14/21 17:00:00 78.274
## 2 2 03/14/21 18:00:00 36.531
## 3 3 03/14/21 19:00:00 28.762
## 4 4 03/14/21 20:00:00 26.668
## 5 5 03/14/21 21:00:00 26.031
## 6 6 03/14/21 22:00:00 25.390
This is now a list of dataframes and it is taking up ~2.3 Mb of memory - not too much. The next step is to actually do the QA/QC flagging on the object HoboData
.
Next we will write a loop that will flag the data based on some criteria and then write it out to a csv file on our computer.
OutPath <- paste0(getwd(), "/hobo_outputs/") # or put your preferred path here (we will use this later)
# set up some arbitrary QA/QC thresholds
TempUpper <- 40
TempLower <- (-40)
#not outputting anything so we aren't going to preallocate.
for (i in fNames) {
# 1 Read in the data
OutPut <- HoboData[[i]] %>% #read in the data and define a output
# 2 Generate your output in this case, some QA/QC flagging
dplyr::mutate(
# Create a variable that collects temperature change flags. This can be thought of as the 'first derivative' of the data.
TChangeFlag = ifelse(
c(abs(diff(T_F, lag = 1)) > 10, FALSE), #logical vector
yes = "D10", #TRUE case
no = NA), #FALSE case
# Create a variable that captures flags for high and low temp data.
Flag = dplyr::case_when(
is.na(T_F) ~ "MISSING",
T_F > TempUpper ~ "High",
T_F < TempLower ~ "Low"
)
) %>%
# unite is like a 'paste mutate' that combines columns and replaces them with 1 new column
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
# 3 output the data
dir.create(OutPath, showWarnings = F)
write.csv(OutPut,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("OutPut","i"))
}
To make sure we know what is going on here, let’s look at some newish functions (maybe) and talk through what each piece does.
mutate()
- Is a dplyr function for creating new columns in a dataframe
diff()
- Is a handy function for calculating the difference of any element in a vector and a value preceding it. The preceding value is defined by the lag
argument. lag = 1
defines it as the preceding value
case_when()
- This is basically our if/else statement
unite()
- This will combine multiple columns of data into a new column, and replace those columns with that new column.
In R (and computer science) there are two types of functions: “pure” functions and “impure” functions. This is a definition that describes their behavior, not their quality. For our purposes, an “impure” function is one that has “side effects” (e.g. modifies your system (write.csv) or the global environment) or one that can be affected by something in the global environment (e.g. it depends on a variable with a certain name being present in the global environment that is not named in the argument list). A pure function doesn’t do this and only depends on the input arguments and does not modify the state of anything else in the global environment.
In general, and especially if you are preparing packages, you should try and separate pure and impure behaviors so that pure functions are totally pure and impure functions are totally impure. It helps avoid unexpected surprises when executing a function. This is a goal, not a requirement of creating a function and is perhaps less critical if you are developing functions for you personal use and not developing a package.
Pure - if is some sort of mathematical function it is probably pure.sqrt
log
mean
etc
write.csv
plot
We will point out some examples of this in the functions we are about to develop from that for loop.
A function usually has 3 components - the function name, arguments, and the body/source code/expressions. Let’s look at lm(x)
as an example:
name - In this case it is simply “lm” which stands for linear model. Function names should be simple and somewhat intuitive. Some say they should be verbs. You should also be careful not to give your function the same name as something that exists in base R or in a package that you might commonly use. This can cause conflicts in your environment and unexpected results.
arguments - These are variables that are created in the parentheses and passed into the body of the equation for evaluation. Arguments tell the function what the data are and they tell it how to handle the data. In this case lm(formula, data)
is telling the lm
function that the first thing will be a formula and the second item will be a data object.
Almost all functions have more than 1 argument, however most of the time you are only specifying 2 or 3 when you call the function. If you want to know what arguments a function can accept, help()
will take you to a hopefully useful explanation. In the help, you can see what arguments have defaults, what those defaults are, and when you should think about changing the defaults. If help is unhelpful, you can always put the function name in the console and it should show you the arguments and their defaults.
source code - This is what the function does (also called ‘source’, body, expressions). 95% of the time you can safely ignore the source. However, it is useful to look at the source when you want to understand why a function is doing what it is doing, modify a function, see what arguments it can accept, what the defaults are, etc. The source can be accessed by typing the function name without parentheses in the console.
Unlike a for loop, R will return the last value of the function body automatically. This can be accomplished with an explicit return()
statement or by writing the object name as the last line of code.
#run these in the console without the () to see what lies underneath
lm
Some functions show R source, some don’t. If they don’t that means they are compiled in C. lm
has quite a lot to show us. This is just a handy trick for knowing what is going on and possibly why something isn’t working. Note, here it returns the output z
(aka the model object) by simply having z
as the last line.
First things first. How do we create a function? There is a function to create functions, it is called function()
. It can be used in two ways, either to create a named function or an anonymous function. The difference between the two is that a named function can be reused by just calling the name, while an anonymous function will need to be copied and pasted each time you want to use it. For now, we are going to focus on named functions, but will show you an anonymous function when we use one later on.
mean2<- #Tell [R] that I want this new function to be named "mean2"
function(x,na.rm=T){ #the function consists of 1 parameter named x (aka the data) The { begins the function source code / expressions.
mean(x,na.rm=na.rm) #in the mean function change the default for na.rm=T
}#close function
mean2(c(1:9,NA))
## [1] 5
Let’s unpack this a little:
mean2<-
- This bit here is what makes this a named function. We assign the name of the new function to mean2
.
function(x, na.rm=T){
this defines our new function with the arguments x
and na.rm
. x
is our data and na.rm
is the value of na.rm
we want to pass to mean
.
x
- it is important to note that no initial or default value is supplied for x. This is because we want to ensure that the user inputs a value for this each time. The function will not run without it.
na.rm=T
- this sets the default value of na.rm in our function to be TRUE
. Because we have defined it in the list of function arguments, we no longer have to supply it when we are calling the function. But, we do we have the option of specifying it if we want to change it.
mean(x,na.rm=na.rm)
- this passes our data to x and passes the value entered for na.rm
to na.rm
.Common questions:
Do I need the {}
? Not if it is on 1 line.
Do I need to name my function? Not if you don’t want to use it again.
Do I have to name (e.g. data=d
) the variable when I pass the argument? No, but it doesn’t hurt. If you aren’t naming things, then the next question is critical.
Does variable order matter? If you name everything when you call the function, no. If you aren’t naming everything (who does) then it matters for arguments with no defaults.
Let’s start evolving that original for
loop into a function (or functions) so that we can rerun it on other Hobo data in the future without needing to copy and paste this. As written, the primary purpose of the for loop was to flag and write out the data. A logical first step is to basically just wrap the for
loop in a function(){}
declaration and give it a few arguments.
Note as I progress through this, I am going to comment out non-essential bits of code and remove old comments so that you can see how the code evolves. It will probably get a little messy, but bear with me.
#OutPath <- paste0(getwd(), "/hobo_outputs/")
#TempUpper <- 40
#TempLower <- (-40)
HoboQAQC <- function(data, #The data of course
fNames = names(data), #extract the file names that are supplied to data as the default filenames
OutPath = paste0(getwd(), "/hobo_outputs/"), #set up an output path character string.
TempUpper = 40, #Make the temperature flag thresholds into one of the function's arguments.
TempLower = -40) {
for (i in fNames) {
# 1 Read in the data
OutPut <- data[[i]] %>% # Note, this is now called 'data'
# 2 Generate your output in this case, some QA/QC flagging
dplyr::mutate(
TChangeFlag = ifelse(
c(abs(diff(T_F, lag = 1)) > 10, FALSE),
yes = "D10",
no = NA
),
Flag = dplyr::case_when(
is.na(T_F) ~ "MISSING",
T_F > TempUpper ~ "High",
T_F < TempLower ~ "Low"
)
) %>%
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
# 3 output the data
dir.create(OutPath, showWarnings = F)
write.csv(OutPut,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("OutPut", "i"))
}
}
#HoboQAQC(data=HoboData) #uncomment to call function
By wrapping it in function(){}
and setting some default values, we can call this by simply invoking HoboQAQC(data=HoboData)
.This is basically identical to our original for
loop, but we have made some changes. The primary change is that we can put any list that has the same columns and list structure into this function now.
Let’s break it down:
HoboQAQC <- function(
- declare a function that is going to be named HoboQAQC
. It will take the arguments that are listed in function()
data,
- The first argument of our function. Something has to be supplied here each time in the form of data = VariableName
or just VariableName
. If just a variable name is supplied, the function will assume that the first thing it receives is the data. What this does is in the environment of that function, it makes a variable named data
that contains a copy of the data that was in VariableName
.
fNames = names(data),
- This supplies the list of file names that we would like to use for naming output. The default is to take those names from the names of the data names(data)
. This assumes that the list was named. If it was not named you could supply a separate vector of names that corresponds to the data here.
OutPath = paste0(getwd(), "/hobo_outputs/"),
- Define the output datapath. The default is to create something under your current working directory. You could supply any path here.
TempUpper = 40,
and TempLower = -40)
- These are the same as before, but we have pulled them into the definition of the function. {
- defines the start of the function environment. All variables defined in here and in the function list are available within the function environment only. The function is closed at the bottom with a corresponding }
. What are the impure aspects of this function? The obvious one is that it is creating directories and writing out files to the hard drive.
Let’s say we want to go back and do this again next year. We could could easily reuse this. However, this is not as flexible as it could be. This is pretty rigid in terms of what things must be named (e.g. temperature must be named T_F
), not all of the thresholds can be changed, and it is not putting our data into an environment variable.
Now, let’s turn our previous flagging loop into a function that we can reuse later. First, we are going to take for loop apart and focus on creating a function that will flag the data, then we will create one that writes the data.
functions shown here are to demonstrate function development and should be not be used for actual data QA/QC.
#OutPath <- paste0(getwd(), "/hobo_outputs/")
#for (i in fNames) {
HoboFlag <- function(data, #argument that accepts the name of the input data, assumed to be a dataframe.
TempName="T_F", #Gives us some flexibility if we want to supply a different column name
TChange=10, #Allows to provide a value for the temperature change flagging threshold
TempUpper=40, #provide a value for the upper temperature limit flag
TempLower=-40) { #provide a value for the lower temperature limit flag
#OutPut <- HoboData[[i]] %>%
OutPut<-dplyr::mutate(data,
TChangeFlag = ifelse(
c(abs(diff(get(TempName), lag = 1)) > TChange, FALSE),
yes = paste0("D", TChange),
no = NA
),
Flag = dplyr::case_when(
is.na(get(TempName)) ~ "MISSING",
get(TempName) > TempUpper ~ "High",
get(TempName) < TempLower ~ "Low"
)
) %>%
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
return(OutPut) # because this is a function, we have to tell it what it is returning. In this case it is the output variable we defined above.
rm(list=c("OutPut","i")) #optional cleanup
}
That gives us a function that will flag the data. By adding those arguments to the argument list we have given ourselves some flexibility to change the flagging criteria in the function. We gave these defaults, so we wouldn’t have to enter them each time unless we want to change them.
There are two new things we did here:
get()
- This searches for an object name supplied as a character string or symbol. If we were to supply it as a symbol T_F
in the function arguments, then the function would look for T_F
in the in the global environment and wouldn’t find it. By providing it as a character string, it is treating it as a piece of data for now. get
then uses that to look within the environment of the dplyr
pipe to find the object with the corresponding name.
return()
- We talked about this a little earlier. If you want something exported from a function into the global environment you can tell it to return()
something.
Is HoboFlag
pure or impure?
I am not sure if it is entirely pure, but it is much closer to that side of the spectrum.
Arguably, the data writing step was a completely separate task. So, we should create a second function to tackle that.
functions shown here are to demonstrate function development and should be not be used for actual data QA/QC.
#Writes the data out and creates directories.
HoboWrite.csv <- function(i, #an index variable, needs to be the filename.
data, #this is the input data, it could be a list or a dataframe, but ultimately we will need to output a dataframe from this function.
OutPath=paste0(getwd(), "/hobo_outputs/"), #same as before
...) { # ellipsis will let us pass in some optional arguments on the columns we want down below.
dir.create(OutPath, showWarnings = F, recursive = T) # recursive=T will create all sub-directories as well
# selects the data that may have been passed from an lapply pipe or a for loop.
if (is.data.frame(data)) {
df <- data[...] # data passed by a for loop
} else {
df <- data[[i]][...] #data passed by an lapply pipe
}
# write out the data
write.csv(df,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("df","i"))
}
This changed a bit from that original version. An if/else
statement was added. This was done to allow some flexibility so this function could work with either a for
loop or a functional named lapply()
(we will get to that in a minute) and still track the file name without a whole lot of issue.
Is this pure or impure? This is now much further over onto the impure side of the spectrum.
Some questions you might have: How many arguments can you have? There is no limit but too many arguments that have to be entered kind of defeats the purpose of a function. It also makes it harder to use. It is fairly common to see 1-3 required arguments and then a slew of optional ones.
Can my functions call other functions? Yes, totally common. lm()
really just formats the output and calls lm.fit()
to do the heavy lifting.
Should I have functions that depend on multiple packages? It depends on the nature of your package. But in general no. You want to keep it to just a few packages. Most of the packages you are using are either compiled in C or are driven by base R. When you start writing functions that depend on a pyramid scheme of other functions, you are setting yourself up for unstable functions. You are not in control of how other people decide to change (or worse not update) their packages.
But moving on, let’s see how we did and use the general framework of our original for
loop.
#OutPath <- paste0(getwd(), "/hobo_outputs/") #change this if you want it going elsewhere
for (i in fNames) {
HoboFlag(HoboData[[i]])%>%
HoboWrite.csv(i, data= . , OutPath = OutPath) #I want the data to pass into that second position so I am using '.'
rm(i) #optional housekeeping
}
Because we have created functions with a variety of arguments, if we want to change all or some of the flagging values we are using, we don’t need to do a whole lot other than supply different values to those variables.
Note: Giving it a different value when you call the function (see below) only changes the value of that argument for that one call of the function. The default will be as before in the next call of the function if it is not changed.
OutPath <- paste0(getwd(), "/hobo_outputs/")
for (i in fNames) {
HoboFlag(HoboData[[i]], TChange=15)%>%
HoboWrite.csv(i, data= . , OutPath = OutPath)
rm(i) #optional housekeeping
}
Wouldn’t it be nice if we weren’t setting up that for
loop every time and it was easier to get the data back out of this?
“Functional programming” isn’t about writing code that works or is functional, it is a paradigm for coding. The end goal of functional programming is more stable, transparent, and reliable code. The basic idea is: We have these nice functions that behave in very specific and predictable ways, why stuff them into unruly for loops and spend time trying to figure out how to iterate them and get the data out?
For our purposes, a functional is a higher-order function that takes a function (and data) as its input (maybe a few other things as well) and returns a predefined output (.e.g. vector, list, dataframe, etc.). Each functional has a fixed type of input and a fixed type of output. This is one of the reasons why functionals are preferred over for
loops - no surprises. Functionals have an expected behavior for iteration and ouput, so it is just a matter of picking the functional (or combination of functionals) that do what you want. This course focuses on apply
and map
families.
One family of functionals that often replaces for()
is the apply
family of functions. There are a lot of options for what to use depending on your inputs, what you want to do, and expected outputs. Two of the primary advantages of functionals is a known mode of iteration and a known output.
apply()
- operates on things resembling a dataframe. It will operate on everything in that object so it has to be reduced to only data classes that the vectorized function you want to apply can handle. It can operate by row MARGIN=1
or by column MARGIN=2
and the output will be a named vector. (It can also operate by each element of the dataframe if you set the MARGIN=c(1,2)
, I have difficulty imagining why you would want to do that.)
apply(mtcars, MARGIN = 2, FUN=function(x) sd(x)/mean(x)) #anonymous function that calculates the coefficient of variation
But perhaps we prefer list output to vector output. lapply()
could be called to operate on a dataframe, in which case it will operate on each column. This is similar to apply
with MARGIN=2
but with list output.
lapply(mtcars, FUN=function(x) sd(x)/mean2(x))
In general, I do not find myself frequently using apply
families of functions to operate on single dataframes. There are more convenient options available for those types of tasks in dplyr
and other coding frameworks in R.
lapply()
- Takes a list (or vector) as input and provides a list as output. So, we could pass it a list of vectors or a list of dataframes.
data_list<-list(A=1:10, B=11:20, C=c(21:29,NA))
lapply(data_list, mean2)
sapply()
- Works exactly the same as lapply, but will attempt to simplify the output to a vector, matrix, or array if possible. This is useful when that is the output you need.
data_list<-list(A=1:10, B=11:20, C=c(21:29,NA))
sapply(data_list, mean2)
You will find and learn about apply functions as you need them. In my work, I find lapply()
to be the most versatile for the types of data operations I need to do.
vectorized
. What we can accomplish with apply
ing mean
to mtcars
could just as easily be done with colmeans()
. Google and Stackoverflow are a good way to find these simpler solutions.
Now we are going to step back to those functions we created HoboFlag()
and HoboWrite()
and call them with lapply.
OutPath <- paste0(getwd(), "/hobo_outputs/")
lapply(HoboData, FUN=HoboFlag)%>%
lapply(names(.), FUN=HoboWrite.csv, data=., OutPath=OutPath)
If we want those data available in the global environment for further manipulation, we just run the HoboFlag function through lapply()
separately and assign its output to a variable in the environment.
d<-lapply(HoboData, FUN=HoboFlag)
#str(d) #uncomment to inspect
Once we are done with further manipulation, we have the option to pass it back to the HoboWrite.csv()
function.
OutPath <- paste0(getwd(), "/hobo_outputs/")
lapply(names(d), FUN=HoboWrite.csv, data=d, OutPath=OutPath )
Or, we could get wild and pass it to lm, then extract coefficients, and then…
Sites<-names(d)[-3] #remove a problem site with no data
lapply(Sites, FUN=function(x) lm(T_F~idx, data=d[[x]]))%>% #anonymous function
lapply(coef)%>% #extract slope and intercept
bind_rows()%>% #glue all the list items together into 1
dplyr::mutate(Sites=Sites,.before="(Intercept)") #add the new column before the existing column named "(Intercept)"
We are dealing with small problems. Small problems seldom push into the limits of your computer. Large datasets and complex functions can take a long time to process (even after you fully optimize them). In R, this is primarily a function of your processor speed. R is only running on a single thread (thread not core). In other words, for something like lapply()
or map()
it processes each iteration sequentially on a single thread. It doesn’t need to be that way. My computer has 4 physical cores and 2 threads (logical cores) per physical core. You could be executing different independent iterations steps on separate cores/threads and recombining the results. This is called ‘parallel processing’.
But, the good news is that the futures packages provides a unified framework for conducting parallel processing in R providing solutions for apply
and map
families. For apply family it is the future.apply
package and for map it is furrr
.
You can experiment with different ways to do this, but we are going to create a plotting function. This is an ‘impure’ function that save the output to your hard drive.
#You don't need to run this. I have run this for you below.
library(future.apply)
HoboData2 <- c(rep(HoboData, 5)) #optionally make the dataset larger. This will take the below code longer to run.
plan(multisession, workers = availableCores()-1) # initiate a multisession cores/sessions is set to use 1 fewer than the max detected. Reduces chance of overwhelming the system.
microbenchmark::microbenchmark(
"sequential"=lapply(HoboData2, FUN=HoboFlag),
"parallel"=future_lapply(HoboData2, FUN=HoboFlag),
times=5, #number of times to do this
unit="s" #units of time needed to completely evaluate each expression
)
plan("sequential") # Important! close the multicore session by setting it back to sequential.
rm(HoboData2)
I will let you run this on your own machines if you are interested. I wouldn’t recommend doing this now, it takes about 5-10 mins. Results will vary by computer. My run says that parallelization was 23% faster than sequential. Not a huge speed improvement, but something to keep in mind to try if a chunk of code is taking ~30 minutes to execute.
That may seem excessive, but large list can easily be built during modeling exercises. If you run this on smaller datasets or faster tasks, you may see little gain or you may even discover that it is slower. This is because there is ‘overhead’ involved in initiating and sending tasks to threads. As such, this only returns a benefit when the individual task takes longer than the sum total of overhead required to run a parallel session.
One important thing to remember is that initiating a parallel session can slow down your computer significantly if not done properly.For that reason be sure to remember to set the number of cores or threads you plan on using to nCores-1. In future_lapply
we do this with availableCores()-1
. On my machine availableCores()
returns the number of logical cores (threads) which is 8.
Custom functions are the most useful when you have a complicated task that you wish to do many times.
Step 1: Write and test the function.
Step 2: Iterate the function
This section of the course focuses on using the purrr
package to interate functions.
We will need the tidvyerse
, lubridate
and modelr
libraries
library(tidyverse)
library(lubridate)
library(modelr)
purrr
package - data cleaning
General purrr
references:
Iteration Chapter in R for Data Science
purrr
is a package that is part of the tidyverse
that focuses on iterating over lists
, such as lists
of data.frames
or model results. Combining custom functions and this package can be extremely useful in quickly doing repetitive data manipulation, analysis and visualization. If your task basically consists of starting with a lot of datasets (or one dataset that you then divide by site, species, etc. ) and then doing the same thing to all the datasets, purrr
is a good option to consider.
If you are starting out with one data.frame
and need to make it into a list
of data.frames
based on a column, you can usedata %>% split(f=as.factor(data$column))
. You can also use dplyr
functions like so : data %>% group_by %>% group_split()
from dplyr
can do that, but if you want to name the lists you will have to do that afterwards by hand using the names()
function like: names(MyList)<-c("NameOne","NameTwo")
which can be annoying.
We will walk through an example analysis using purrr
functions to do iteration.
#file names
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
)
# Make path to files
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)
First, I will read in all the data. My plan it to make a list
of data.frames
and then carry out all the analysis on that list. set_names()
is used to name the fPaths
vector, and those names will be carried forward to the data list. The map
function will take each element of fPaths
and feed them one by one into read_csv()
. The advantage of doing iteration this way is that it is more concise than a for()
loop, but somewhat easier to read than the lapply
function. map()
takes takes in a list
or a vector
. You then specify a function to use with a ~
in front of it. The .x
tells map where to put the data it gets from fPaths
.
fPaths<- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05"))
fPaths # This thing is now a vector where each element has a name
## APIS01
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS01_20548905_2021_temp.csv"
## APIS02
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS02_20549198_2021_temp.csv"
## APIS03
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS03_20557246_2021_temp.csv"
## APIS04
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS04_20597702_2021_temp.csv"
## APIS05
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS05_20597703_2021_temp.csv"
## import the site data into a list where each element comes from a different file
intensity_data_raw <- fPaths %>% map( ~ read_csv(file = .x, skip = 1))
In this case we have a list
of datasets
Now we can examine this data:
class(intensity_data_raw) # its a list
## [1] "list"
length(intensity_data_raw) # 5 elements
## [1] 5
class(intensity_data_raw[[1]]) # each element is a data.frame (well, tibble actually)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
intensity_data_raw %>% map_dbl(~ nrow(.x))
## APIS01 APIS02 APIS03 APIS04 APIS05
## 4554 6287 0 6286 6287
intensity_data_raw %>% map(~ colnames(.x))
## $APIS01
## [1] "#"
## [2] "Date Time, GMT-05:00"
## [3] "Temp, °F (LGR S/N: 20548905, SEN S/N: 20548905, LBL: APIS01)"
## [4] "Bad Battery (LGR S/N: 20548905)"
## [5] "Coupler Attached (LGR S/N: 20548905)"
## [6] "Host Connected (LGR S/N: 20548905)"
## [7] "Stopped (LGR S/N: 20548905)"
## [8] "End Of File (LGR S/N: 20548905)"
##
## $APIS02
## [1] "#"
## [2] "Date Time, GMT-05:00"
## [3] "Temp, °F (LGR S/N: 20549198, SEN S/N: 20549198)"
## [4] "Intensity, lum/ft² (LGR S/N: 20549198, SEN S/N: 20549198)"
## [5] "Bad Battery (LGR S/N: 20549198)"
## [6] "Good Battery (LGR S/N: 20549198)"
## [7] "Coupler Attached (LGR S/N: 20549198)"
## [8] "Host Connected (LGR S/N: 20549198)"
## [9] "Stopped (LGR S/N: 20549198)"
## [10] "End Of File (LGR S/N: 20549198)"
##
## $APIS03
## [1] "#"
## [2] "Date Time, GMT-05:00"
## [3] "Temp, °F (LGR S/N: 20557246, SEN S/N: 20557246)"
## [4] "Intensity, lum/ft² (LGR S/N: 20557246, SEN S/N: 20557246)"
##
## $APIS04
## [1] "#"
## [2] "Date Time, GMT-05:00"
## [3] "Temp, °F (LGR S/N: 20597702, SEN S/N: 20597702)"
## [4] "Intensity, lum/ft² (LGR S/N: 20597702, SEN S/N: 20597702)"
## [5] "Bad Battery (LGR S/N: 20597702)"
## [6] "Good Battery (LGR S/N: 20597702)"
## [7] "Coupler Attached (LGR S/N: 20597702)"
## [8] "Host Connected (LGR S/N: 20597702)"
## [9] "Stopped (LGR S/N: 20597702)"
## [10] "End Of File (LGR S/N: 20597702)"
##
## $APIS05
## [1] "#"
## [2] "Date Time, GMT-05:00"
## [3] "Temp, °F (LGR S/N: 20597703, SEN S/N: 20597703)"
## [4] "Intensity, lum/ft² (LGR S/N: 20597703, SEN S/N: 20597703)"
## [5] "Bad Battery (LGR S/N: 20597703)"
## [6] "Good Battery (LGR S/N: 20597703)"
## [7] "Coupler Attached (LGR S/N: 20597703)"
## [8] "Host Connected (LGR S/N: 20597703)"
## [9] "Stopped (LGR S/N: 20597703)"
## [10] "End Of File (LGR S/N: 20597703)"
I used map_dbl()
instead of map()
to look at the number of rows. That function is just like map()
except it will return a numeric vector
rather than a list
. In this case it is handy as I already know that nrow()
will return a number, and a vector
is a more compact way of showing that. Note that all the lists and the elements of the row number vector are named with the site code.
Other options for map
are:
map_lgl()
for a logical vector
map_chr()
for a character vector
map_int()
for an integer
map_dfr()
for making a data.frame
by binding rows
map_dfc()
for making a data.frame
by biding columns
When we look at the raw files from the loggers we see they are a Horror!
One file has no data.
One file is missing the luminosity data.
There are a bunch of columns we don’t want.
The serial number is part of many column names, so the columns change with each file.
The column names have lots of spaces, punctuation, a superscript, and even an degree symbol.
I do NOT want to have to fix this one file at a time - particularly if there is a large number of files. I want to get the Date, Temp and Intensity data and not the rest of the columns.
First I will use discard()
from the purrr
package to just drop the site with no data. (Note keep()
is the opposite of discard()
).
Then, I will make a function that can take the raw data and correct these issues. Note the use of starts_with()
to select the columns whose names start the same but have different serial numbers at the end. This is from a small package called tidyselect
. Many of the tidyverse
packages use this as a back-end and export these functions. Other similar helpers include ends_with()
, any_of()
, all_of()
, contains()
, and where()
. Its not a very big package, so its worth checking to see what it has.
I added in a Temp_C
column and transformed the temperature data. Using some functions from lubrdiate
make the Date_Time
column be that data type and extract Date
and Hour
components.
Then I will use map to get the more consistent data
library(lubridate)
intensity_data_raw <- intensity_data_raw %>% discard(~ nrow(.x) == 0)
Fix_Data <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
intensity_data <- intensity_data_raw %>% map(~ Fix_Data(.x))
Now all the data is in the same format.
The data is measured on an hourly basis - it might be useful to make a summary by day. I will use the group_by()
and across()
to summarize each dataset. The across()
function lets me select many columns and perform a function on them, without having to know what their names are. This is used for functions like mutate()
and filter()
.
The where()
tells across()
which columns to work with. The mean()
is indicating what will happen to the selected columns. In this case it basically indicates that any column that is numeric should be summarized by its daily mean. By not having to name each column, I don’t need to write special code to account for the missing Intensity
column in one dataset.
Summarise_Data <- function(data) {
data %>%
group_by(Date) %>%
summarise(across(where(is.numeric), mean))
}
summary_data <- intensity_data %>% map(~ Summarise_Data(.x))
At the end of the datasets are some days where Intensity
is always 0. I am going treat that as an error and get rid of those rows. Again across()
is the way to accomplish this. The first park of the function indicates which columns to use. The any_of()
tells the filter to only work on the Intensity
column, if there is one. The ~.x != 0
is saying that 0 should be filtered out of any column that meets the requirements (i.e. any column called “Intensity”).
### Oops - some of the intensity data is 0, fix this for later use in gamma glm().
Zero_Intensity_Fix <- function(data) {
data %>%
filter(across(any_of("Intensity"), ~ .x != 0))
}
summary_data <- summary_data %>% map(~ Zero_Intensity_Fix(.x))
purrr
- Graphing 1
Now we will write a function to make a ggplot graph and use map()
to graph all the data. The where()
from the tidyselect
package is helpful again.
Summary_Graph <- function(data) {
graph_data <- data %>%
pivot_longer(cols = where(is.numeric), names_to = "Measure", values_to = "Value") %>%
filter(Measure != "Temp_F")
Graph <- ggplot(graph_data, aes(x = Date, y = Value, group = Measure, color = Measure)) +
geom_point() +
facet_grid(Measure ~ ., scales = "free_y")
return(Graph)
}
Graphs1 <- summary_data %>% map(~ Summary_Graph(.x))
Note that in the Summary_Graph()
function I first created the dataset I wanted using pivot_longer()
and saved it to graph_data
and then made the graph. Because I was making 2 things in this function I assigned the graph to Graph
and then used the return()
function to indicate what should be the output. Because I made the graph last, by default it would be what the function returns, but as your functions get more complicated it can be good to specify that explicitly. Also, I could have written return(list(graph_data,Graph))
and returned both objects.
purrr
- Analyzing Data - map_if()
Now we will use general linear regression to examine the relationship between light and temperature. There are a couple of challenges here:
One of the data sets does not have the Intensity
column, so we need to skip it. We could just cut if from the list, but lets assume that you will be getting more datasets in the future and want the function to just handle that kind of error on its own. To accomplish this, the map_if()
function is used. This function is similar to a normal if()
function, but can be used in the context of iterating over a list.
Regression analysis can get kind of complicated and there are lots of potential arguments to your regression function. See ?glm
for a list of options with that function. We want to be able to experiment with different ways to fit the data without having to specify every argument in advance or constantly rewrite the function. To do this, the ellipsis (...
) is used.
Intensity_Regression <- function(data, ...) {
glm(Intensity ~ Temp_C, data = data, ...)
}
Gauss_Reg <- summary_data %>%
map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
~ Intensity_Regression(.x, family = gaussian),
.else = ~ NA
)
Here is what is going on:
map_if()
first checks to make sure that Intensity
and Temp_C
are both present in the column names of each data.frame
If they are, it runs the Intensity_Regression()
function. It they are not present it just returns NA
.
Intensity_Regression()
takes in the data, and the ...
indicates that there may be some other arguments as well. These are other arguments are not specified in advance. The ...
then appears again in the glm()
function, which indicates that whatever these extra arguments are should go to glm()
. You can only send the ...
to one place inside your function body, you can’t send it to multiple functions. If for some reason you want to store whatever is in the ...
inside your function you can use match.call(expand.dots = FALSE)$ `...`
.
When I made the Gauss_Reg
output I indicated I wanted a Gaussian (normal) distribution for my regression. So, the family=
argument was passed to glm()
. Now I will do it again with a Gamma
distribution.
Gamma_Reg <- summary_data %>%
map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
~ Intensity_Regression(.x, family = Gamma(link="log")),
.else = ~NA
)
map2()
Lets make a graph of the data vs. fit for each result of the Gaussian regression.
First, we will add the predictions of the model to each dataset using the add_predicitons()
from the modelr
package.
library(modelr)
summary_data <- map2(.x = summary_data, .y = Gauss_Reg, .f = ~ {
if ("Intensity" %in% colnames(.x)) {
add_predictions(data = .x, model = .y, var = "Pred")
} else {
.x
}
})
The map2()
function was used. What this does is take two lists - .x
and .y
and then feeds them into a function in parallel. So what happens here is that the function takes the first site from summary_data
and the first regression from Gauss_Reg
and sees that there is no Intensity
data available so it just returns the original data.frame
. Then it takes the second element from each list, sees that there is Intensity
data and uses add_predictions()
to add a new column to the data and return it. This then repeats down the list. If you have 3+ lists to iterate over, use the pmap()
function instead.
Now that we have the predictions in the data.frames
, let’s make a graph to compare the model predictions with the data.
Pred_Graph_Data <- bind_rows(summary_data, .id = "Site") %>% filter(!is.na(Pred))
ggplot(Pred_Graph_Data, aes(x = Date, y = Intensity)) +
geom_point() +
geom_line(aes(x = Date, y = Pred, color = "red")) +
facet_grid(Site ~ ., scales = "free_y")
We can feed summary_data
to bind_rows()
and it will turn the list of data.frames
into one combined data.frame
. The .id
argument tells bind_rows()
what to name the column that tells you the site name, and the site name is already present as the name of the input list.
library(tidyverse)
library(lubridate)
library(microbenchmark)
library(profvis)
“Performance” here refers to how fast your code does something.
Faster is generally better, but it often requires more time to write and can include code that is harder to read.
Most of the time you can ignore performance. If you run the code and are happy with how long it takes, then there is no problem.
Performance can be a consideration when:
You are doing the same analysis many (dozens to hundreds) of times. Think of an analysis of each species at many parks, or each water quality measure at many sites.
If you have very large data files that you need to manipulate.
If you are doing a simulation study and need to create and manipulate large numbers of data sets.
If you are doing some sort of interactive visualization and need results very quickly.
For this section we will need the tidyverse
, lubridate
, profvis
and microbenchmark
libraries.
library(tidyverse)
library(lubridate)
library(profvis)
library(microbenchmark)
R has “vectorization” - this means there are some functions that take a vector as an input, perform a calculation or operation on each element and return a vector as output. Some examples:
Numeric Vector
# A vector of numbers
x <- 1:10
# Arithmetic is vectorized
x + 1
## [1] 2 3 4 5 6 7 8 9 10 11
Character Vector
# a character vector (letters)
x <- letters[1:10]
# paste is vectorized
paste("Letter:", x)
## [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d" "Letter: e" "Letter: f"
## [7] "Letter: g" "Letter: h" "Letter: i" "Letter: j"
Logical Vector
# a logical vector
x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)
# logical operations are vectorized
TRUE & x
## [1] TRUE TRUE FALSE TRUE FALSE FALSE
Date Vector
# a date vector
library(lubridate)
x <- mdy("02/07/2022", "02/08/2022", "02/09/2022", "02/10/2022")
x
## [1] "2022-02-07" "2022-02-08" "2022-02-09" "2022-02-10"
## adding 1 shows you the next day
x + 1
## [1] "2022-02-08" "2022-02-09" "2022-02-10" "2022-02-11"
rm(x)
Most functions in R are vectorized. This includes:
Mathematical functions, including statistical functions.
Logical operators.
Functions such as rowMeans()
that work on matrices.
Text manipulation functions.
Many of these functions are written in C which makes them extremely fast.
You SHOULD NOT write a function or a for()
loop to do iteration when the pre-existing vectorization will do the same thing.
You SHOULD make use of vectorization in your functions to keep them fast and readable.
When you use the dplyr
function mutate()
you are making use of vectorization.
“Profiling” is looking at each part of a function to find out how long it takes. This can allow you to identify bottlenecks where speeding up your code might be helpful. Then tools such as microbenchmark
can be used to compare options. RStudio has profiling tools built in but to use then you need to install the profvis
package.
Once you do this you can access the Profile menu in the top menu bar. You can select some code, and use “Profile Selected Line(s)” to run the profiler.
However, you can run the profvis()
function to do the same thing with a little more control over the outcome. Here we will run the function. In case you no longer have the logger data handy - here is the code to load it again.
library(tidyverse)
library(lubridate)
library(profvis)
# Recreate intensity data
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
)
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)
intensity_data_raw <- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05")) %>%
map(~ read_csv(file = .x, , skip = 1)) %>%
discard(~ nrow(.x) == 0)
# Fix_Data function
Fix_Data <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
With that loaded we can run profvis()
profvis::profvis(Fix_Data(intensity_data_raw[[2]]), interval = 0.005)
This will give you a “Flame Graph” which tells you how much time each part of your function was running. Here we will profile the Fix_Data()
function from earlier.
profvis
(the profile tool) in the bottom row was running the whole time. One row up is rename()
from dplyr
a nd mdy_hms()
from lubridate
. Only 5 ms was spent on rename()
where as 15ms was spent on mdy_hms()
. Above those two function are the functions that they called, which are not relevant to us now. You don’t see functions like select()
or mutate()
as they ran too quickly. So if you wanted to speed up Fix_Data()
the place to look for improvements is in the date conversion.
Let’s say that you wanted to speed up the Fix_Data()
function. Benchmarking is a tool you can uses to compare different versions of a function to see which one runs faster. To do this we will use the microbenchmark
package. Earlier the as.POSIXct()
function was used to convert the DateTime
columns from character
to date
. Lets see if that is quicker. We will also try the base function strptime
.
library(microbenchmark)
# Original function
Fix_Data_Lubridate <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
# New version using as.POSIXct()
Fix_Data_POSIXct <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(
Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = as.POSIXct(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
Date = date(Date_Time)
)
}
# new version using strptime
Fix_Data_strptime <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(
Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = strptime(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
Date = date(Date_Time)
)
}
# Do Comparison
mb<-microbenchmark(
Lubridate = Fix_Data_Lubridate(intensity_data_raw[[2]]),
POSIXct = Fix_Data_POSIXct(intensity_data_raw[[2]]),
Strptime = Fix_Data_strptime(intensity_data_raw[[2]]),
times = 10L, unit = "ms"
)
mb
## Unit: milliseconds
## expr min lq mean median uq max neval
## Lubridate 23.2089 23.2898 24.90526 24.19820 24.8005 32.6782 10
## POSIXct 82.5985 84.5467 88.28583 86.34290 93.0844 97.7106 10
## Strptime 44.6630 45.0732 49.73176 46.76175 49.9721 69.8149 10
boxplot(mb)
As it happens, the lubridate
version came out the fastest. You should also consider if this data is representative of the work you need to do. It is not unusual for one way of coding to be fastest with large amounts of data, but not have an advantage for small amounts of data.
[1] https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
[2] http://rstudio-pubs-static.s3.amazonaws.com/5526_83e42f97a07141e88b75f642dbae8b1b.html
[3] https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply]
[5] http://adv-r.had.co.nz/Functional-programming.html
[6] https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf Purrr Cheat Sheet
[7] https://r4ds.had.co.nz/iteration.html Iteration Chapter in R for Data Science
[8] https://rstudio.github.io/profvis/ Provfis Introduction
[9] https://rstudio.github.io/profvis/faq.html Provfis FAQ
R Markdown is a special file format that you can open in RStudio (go to File > New File, and “R Markdown…” is the 3rd option). R Markdown takes what you put into the .Rmd file, knits the pieces together, and renders it into the format you specified using PanDoc, which is typically installed as part of the RStudio IDE bundle. The knit and render steps generally occur at the same time, and the terms are often used interchangeably. For example, the Knit button () knits and renders the .Rmd to your output file. The knit shortcut is also super handy, which is Ctrl + Shift + K.
The most commonly used outputs are HTML, PDF, and Word. Each output format has pros and cons, and drives which text editing language (e.g., HTML or LaTeX) you need to use for advanced/custom needs. We’ll cover these in more detail in a minute.To get started, you’ll need to make sure you have the rmarkdown
package installed. The knitr
package, which does a lot of heavy lifting, is a dependency of rmarkdown
, so both will be installed with the line of code below.
install.packages("rmarkdown")
Once you have rmarkdown
installed, you should be able to go to File > New File > R Markdown…, and start a new .Rmd file. After selecting “R Markdown…”, you will be taken to another window where you can add a title and author and choose the output format. For now, let’s just use the default settings and output to HTML. You should now see an Untitled .Rmd file in your R session with some information already in the YAML and example plain text and code chunks. You can also start with a blank .Rmd, which will be more convenient once you get the hang of it.
For all of the output types, the built-in markdown functions, like ‘#’ for level 1 header, render as you expect. Most output types also have an additional code base that allows more advanced and customized features.
The code base for HTML is obviously HTML and cascading style sheets (CSS). HTML is used to structure the content. CSS is used to define the style (e.g., font type and size for each header level).
Rendering to PDF requires a LaTeX engine that runs under the hood. The easiest engine to install is tinytex
, and there are instructions and download files on the “Prep for Training” tab of this site.
tinytex
that was developed by the same developer as the knitr
package. To install the tinytex
LaTeX engine, you can run the following code:install.packages('tinytex')
tinytex::install_tinytex()
#Check that it worked:
writeLines(c(
'\\documentclass{article}',
'\\begin{document}', 'Hello world!', '\\end{document}'
), 'test.tex')
tinytex::pdflatex('test.tex')
# Console should say [1] "test.pdf"
# If you ever run into issues with markdown not knitting PDF,
# sometimes running the code below will fix the problem:
tinytex::reinstall_tinytex()
Rendering to Word can be a helpful way to generate a Results section for a report, or even the entire report for the first draft. It saves having to copy/paste figures and tables from R to Word, and makes life easier if you need to rerun your analysis or update a figure. You just update the .Rmd code and render again to Word. Admittedly I rarely use this option because the functionality is much more limited than outputting to PDF or HTML(See first Con below). See the Resources tab for links to websites and blogs that cover more detail on outputting to Word than we’ll cover here.
output: word_document
. You really can only apply the main styles in the template you imported. Figures and tables don’t always render all that well (or at all), and often need to be tweaked again in Word. Headers and footers don’t render either. To do anything beyond applying main styles, you’ll likely need to employ packages in the officeverse
, including officedown
and flextable
packages. I’ve only spent a little time looking over these packages. They seem useful, but require learning another set of packages and conventions.redoc
may make porting changes to the Word document back to R Markdown relatively straightforward (see the redoc GitHub page for more detail.The .Rmd file itself consists of 3 main pieces. There’s the YAML (Yet Another Markup Language) code at the top, which is contained within ---
, like the image below. The top YAML is typically where you define features that apply to the whole document, like the output format, authors, parameters (more on that later), whether to add a table of contents, etc. The YAML below is what we’re using for this website. Note that indenting is very important in YAML. The css: custom_styles.css tells Markdown that I want the styles defined in my css, rather than the default styling in Markdown. This is optional, and is here just to show a variation on the default YAML you get when starting a new .Rmd. If you don’t want to use your own custom style sheet, then your YAML would just have the following in one line: output: html_document
.
This is what it sounds like- it’s just text. You can write anything you want outside of a code chunk, and it will render as if you’re writing in a word processor, rather than as code. Although, note that special characters like % and & may need to be escaped with a /
before the symbol, particularly if you’re using LaTeX (more on that later).
You can format text using Markdown’s built in functions, like those shown below. For a more detailed list of these formatting functions, check out the R Markdown Cheatsheet. You can also code HTML directly in R Markdown, which I actually find easier the more I get comfortable with HTML. The section below shows how to use the same common styles with R Markdown and HTML and what the output looks like.
The actual text in the .Rmd:
# First-level header
## Second-level header
...
###### Sixth-level header
*italic* or _italic_
**bold** or __bold__
superscript^2^
endash: --
Example sentence: *Picea rubens* is the dominant species in **Acadia National Park**.
The HTML version:
<h1>First-level header</h1>
<h2>Second-level header</h2>
...
<h6>Sixth-level header</h6>
<i>italic</i>
<b>bold</b>
superscript<sup>2</sup>
endash: –
Example sentence: <i>Picea rubens</i> is the dominant species in <b>Acadia National Park</b>.
The text renders as:
italic or italic
bold or bold
superscript2
endash: –
Example sentence: Picea rubens is the dominant species in Acadia National Park.
#### This is how to specify header level 4, which renders as:
Code chunks are also what they sound like. They’re chunks of R code (can be of other coding languages too), which run like they’re in an R script. They’re contained within back ticks and curly brackets, like below.
```{r}
```
`````
{ }
. Common chunk options are below:
echo = TRUE
prints the code chunk to the output. FALSE
omits the code from output.
results = 'hide'
omits results of code chunk from output. show
is the default.
include = FALSE
executes the code, but omits the code and results from the output.
eval = FALSE
does not execute the code chunk, but can print the code, if echo = TRUE
.
error = TRUE
allows a code chunk to fail and for the document to continue to knit.
cache = TRUE
allows you to cache the output associated with that code chunk, and will only rerun that chunk if the code inside the chunk changes. Note that if the objects or data in the code chunk are changed, but the code within the chunk is still the same, the code chunk won’t realize that it needs to rerun. You need to be careful about using the cache option.
fig.cap = "Caption text"
allows you to add a figure caption.
fig.width = 4
; fig.height = 3
allows you set the figure size in inches.
out.width = 4
; out.height = 3
allows you set the figure or table size as a percentage of the container/page size.
message = FALSE, warning = FALSE
prevent messages or warnings from chatty packages from being included in the output.
See the R Markdown Cheatsheet for a complete list of code chunk options.
Another trick to using code chunk options, is that they can be conditional based on the results from another code chunk. For example, I have a QC report that runs 40+ checks on a week’s worth of forest data, but the report only includes checks that returned at least 1 value/error. Checks that returned nothing are omitted from the report using conditional eval. I’ll show an example of that later.
You can also reference objects in your R session within the plain text using back ticks (i.e., `). In the example below, I’ll first calculate a mean value that I’ll pretend is the average number of species on a plot. Then I’ll use inline code chunks to print that number in the plain text.
First, we calculate numspp
with echo = F. Here I’m using the runif()
function to randomly generate 50 numbers from 0 to 100 using the uniform distribution, just to have some data to summarize. Here, the include = F will run the code chunk, but won’t show the code or the results in the markdown document.
```{r, include = F}
numspp <- round(mean(runif(50, 0, 100)), 1)
```
`````
To use this in a sentence, the plain text looks like this:
The average number of species found in each plot was `r numspp`
.
And renders like this:
The average number of species found in each plot was 44.8.
How would you run a code chunk, but only show the results, not the code?
Answer Include in the code chunk options: echo = F
There are quite a few packages that can help you make publication quality and customized tables. The two tables I see used most frequently are kable()
in the knitr
package and datatables()
in the DT
package (not to be confused with data.table()
package for data wrangling in R). The learning curve for kable is pretty shallow, and runs HTML under the hood. The learning curve for DT is a bit steeper, and has javascript under the hood. That means you can customize and add more features using those languages, if you know them. I tend to stick with kable, because I find HTML/CSS easier to code. If I need more bells and whistles, then I use datatables.
First, we’ll load a fake wetland dataset on our GitHub repo to make some summary tables using each package. The code below downloads the dataset from the training GitHub repo, and then summarizes the number of invasive and protected species per site. For both examples, the output format is HTML. If I were outputting to PDF, then I’d need to specify the format as ‘latex’ and use LaTeX code for any custom features not built into kable.
library(tidyverse)
wetdat <- read.csv(
"https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
wetsum <- wetdat %>% group_by(Site_Name, Year) %>%
summarize(num_inv = sum(Invasive), num_prot = sum(Protected),
.groups = 'drop')
The code below creates a simple table that renders in HTML, is only as wide as the records in the table, and has alternating row colors. If you’re outputting to PDF, your format will be “latex” instead of “HTML” and you’ll need to use LaTeX for any custom formatting/styling that aren’t built into kable and kableExtra.
Note also that the version of kableExtra
on CRAN currently has a bug that causes collapse_rows()
not to function. I’ll show what this does in a minute, but for now, just know that if you want to collapse rows in your kable
, you’ll need to install the development version of kableExtra
on GitHub. Code for that is below. You’ll need the devtools
package installed to install it. If you’ve already loaded kableExtra
in your session, you’ll also need to restart your session (Note: Ctrl + Shift + F10 is the fastest way to restart your R session).
devtools::install_github("haozhu233/kableExtra")
library(kableExtra) # for extra kable features
library(knitr) # for kable
wet_kable <- kable(wetsum, format = 'html') %>% # if using pdf, need LaTeX
kable_styling(full_width = FALSE, bootstrap_options = 'striped') #kableExtra function
wet_kable
Site_Name | Year | num_inv | num_prot |
---|---|---|---|
RAM-05 | 2012 | 0 | 3 |
RAM-05 | 2017 | 1 | 3 |
RAM-41 | 2012 | 0 | 0 |
RAM-41 | 2017 | 1 | 0 |
RAM-44 | 2012 | 1 | 0 |
RAM-44 | 2017 | 1 | 0 |
RAM-53 | 2012 | 2 | 0 |
RAM-53 | 2017 | 3 | 1 |
RAM-62 | 2012 | 0 | 0 |
RAM-62 | 2017 | 0 | 0 |
SEN-01 | 2011 | 0 | 1 |
SEN-02 | 2011 | 0 | 1 |
SEN-03 | 2011 | 0 | 0 |
Note the use of pipes in the code above. The great thing about kable
and kableExtra
is that you can pipe functions together to build out a large table with all kinds of formatting, including conditional formatting. You can also make a custom kable
function that has all of the formatting options you want, and just specify the dataset to build the table for. You can then pipe more features onto that function. We’ll show a couple of these examples below.
# custom kable function that requires data, column names and caption
make_kable <- function(data, colnames = NA, caption = NA){
kab <- kable(data, format = 'html', col.names = colnames, align = 'c', caption = caption) %>%
kable_styling(fixed_thead = TRUE,
bootstrap_options = c('condensed', 'bordered', 'striped'),
full_width = FALSE,
position = 'left',
font_size = 12) %>%
row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>%
row_spec(nrow(data), extra_css = 'border-bottom: 1px solid #000000;')
}
# use function with wetsum data
wetkab2 <- make_kable(wetsum,
colnames = c("Site", "Year", "# Invasive", "# Protected"),
caption = "Table 1. Summary of wetland data") %>%
scroll_box(height = "250px")
Site | Year | # Invasive | # Protected |
---|---|---|---|
RAM-05 | 2012 | 0 | 3 |
RAM-05 | 2017 | 1 | 3 |
RAM-41 | 2012 | 0 | 0 |
RAM-41 | 2017 | 1 | 0 |
RAM-44 | 2012 | 1 | 0 |
RAM-44 | 2017 | 1 | 0 |
RAM-53 | 2012 | 2 | 0 |
RAM-53 | 2017 | 3 | 1 |
RAM-62 | 2012 | 0 | 0 |
RAM-62 | 2017 | 0 | 0 |
SEN-01 | 2011 | 0 | 1 |
SEN-02 | 2011 | 0 | 1 |
SEN-03 | 2011 | 0 | 0 |
NA
, you don’t have to specify them for the function. If you don’t, the column names in the table will be the names in the dataframe, and the caption will be omitted. If you don’t set a default for an argument, it will be required, like the data.
align = 'c'
. If you wanted the first column to be left, and the next 3 to be centered, you would write align = c('l', rep('c', 3))
.
fixed_thead = TRUE
means that if a scroll bar is added to the table, the table header (top row), will be fixed. Here we piped a scroll_box at the end of the code to show how that works. You can add a scroll box to the width of the page by adding a width = "###px"
to the argument.
Error in UseMethod("nodeset_apply")
. This is why I piped it at the end, instead of adding to the function.
position = 'left'
will left justify the table on the page (default is center).
row_spec(0, )
adds a black border to the top and bottom of the header, which kable considers row 0.
row_spec(nrow(data))
is adding a black border to the bottom of the table regardless of the number of rows in the table.
ifelse()
that ends in FALSE is just allowing the default color to be printed instead of the conditional color. That allows the alternating row colors to remain.
collapse_rows()
pipe after any column_spec()
or row_spec()
calls. You can also collapse on multiple columns, but it is finicky about the order of the pipes. Just use trial and error until you find the order that works. Note also that collapse_rows()
only works in the development version of the package (see above for installation instructions).
wetkab3 <- make_kable(wetsum,
colnames = c("Site", "Year", "# Invasive", "# Protected"),
caption = "Table 1. Summary of wetland data") %>%
column_spec(3, background = ifelse(wetsum$num_inv > 0, "orange", FALSE)) %>%
collapse_rows(1, valign = 'top')
Site | Year | # Invasive | # Protected |
---|---|---|---|
RAM-05 | 2012 | 0 | 3 |
2017 | 1 | 3 | |
RAM-41 | 2012 | 0 | 0 |
2017 | 1 | 0 | |
RAM-44 | 2012 | 1 | 0 |
2017 | 1 | 0 | |
RAM-53 | 2012 | 2 | 0 |
2017 | 3 | 1 | |
RAM-62 | 2012 | 0 | 0 |
2017 | 0 | 0 | |
SEN-01 | 2011 | 0 | 1 |
SEN-02 | 2011 | 0 | 1 |
SEN-03 | 2011 | 0 | 0 |
wetsum
dataframe, how would you make a table where all but the first column was center aligned, but the first column left aligned?ans_tab <- kable(wetsum, format = 'html',
align = c('l', 'c', 'c', 'c')) %>% # Answer 1
kable_styling(fixed_thead = TRUE,
full_width = FALSE,
position = 'left') # Answer 2
ans_tab
Site_Name | Year | num_inv | num_prot |
---|---|---|---|
RAM-05 | 2012 | 0 | 3 |
RAM-05 | 2017 | 1 | 3 |
RAM-41 | 2012 | 0 | 0 |
RAM-41 | 2017 | 1 | 0 |
RAM-44 | 2012 | 1 | 0 |
RAM-44 | 2017 | 1 | 0 |
RAM-53 | 2012 | 2 | 0 |
RAM-53 | 2017 | 3 | 1 |
RAM-62 | 2012 | 0 | 0 |
RAM-62 | 2017 | 0 | 0 |
SEN-01 | 2011 | 0 | 1 |
SEN-02 | 2011 | 0 | 1 |
SEN-03 | 2011 | 0 | 0 |
Using the same wetsum
dataset we created earlier, we’ll make a table using datatable()
and will add some of the features that kable()
doesn’t have and that usually lead me to choose datatable over kable. We’ll start with a basic example and build on it.
library(DT)
wetdt <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"))
wetdt
The resulting table has several nice features that kable doesn’t offer.
If you want to show more or less entries in your table at a time, you can specify different values by adding options and then specifying values either for pageLength
, or for lengthMenu
. The pageLength
option takes 1 value and will then display that number of entries in the table. The lengthMenu
is similar, but also allows you to add multiple values to this list, which are then added to the dropdown menu in the Show [##] entries box. That allows the user to select the number of entries they want to see at a time.
I also added an option that stops the table from spanning the entire page.
# modify pageLength and lengthMenu
wetdt2 <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"),
width = "45%",
options = list(pageLength = 10,
lengthMenu = c(5, 10, 20))
)
cell-border stripe
, which adds vertical lines between columns the same way bootstrap_options
added striped cells in kable.
datatable()
instead of kable()
. The code below adds a filter to the top of the columns in the table. This is a particularly useful feature if your tables have a lot of rows.
formatStyle()
at the end allows us to set conditional formatting like we did for kable
, where any value > 0 in the 3rd column will be orange.
wetdt3 <- datatable(data.frame(wetsum, "Notes" = NA),
width = "45%",
colnames = c("Site", "Year", "# Invasive", "# Protected", "Notes"),
options = list(pageLength = 10),
class = 'cell-border stripe',
filter = list(position = c('top'), clear = FALSE),
editable = list(target = 'cell', disable = list(columns = 1:4))) %>%
formatStyle(3, backgroundColor = styleInterval(0, c('white', "orange")))
wetdt3
There are multiple ways to display a image that’s stored on disk. The easiest way to do it is with markdown code in the plain text part of your document, which looks like:
![Map of Region-1 IMD parks.](./images/map_of_parks.jpg){width=400px}
Note that inserting a hyperlinked url is very similar. Just omit the ! and put the url in parenthesis instead of the path to the image. Like: [Link to IMD home page](https://www.nps.gov/im)
You can also use the HTML tag. The code below will produce the exact same image as the markdown code above. I seem to be able to remember the <img>
tag better than the markdown version, so I tend to use this more often.
<img src="./images/map_of_parks.jpg" alt = "Map of Region-1 IMD parks" width="400px">
.
I also like knitr’s include_graphics()
function, as it can make iteration easier. For example, you can include a bunch of figures in a report based on a list of file names (See Regeneration Summary in Gallery). Below I’m including all the photos in a photopoint folder, and making them only 25% of the width of the page. That puts them in a grid, which can be handy. You can also add breaks between each item in the list, and then they’ll plot separately. If you have an analysis with lots of plots, and they take awhile to render, I tend to write the plots to a disk and them bring them into the markdown document with include_graphics()
. Using include_graphics()
also means you’re running it in code chunks instead of the plain text, and allows you to dynamically number and reference figure names and specify figure captions at the same time. I’ll show that trick in a minute.
```{r photopoints, echo = T, out.width = "25%"}
photos <- list.files("./images/photopoints", full.names = TRUE)
include_graphics(photos)
```
`````
Code chunk options include several handy options to customize figures. These include:
fig.align
: defines how the figure will be justified on the page. Options are ‘center’, ‘left’, ‘right’.
fig.cap
: adds a caption to the figure. Must be quoted.
fig.height
& fig.width
: sets the height and width of the figure in inches. Must be numeric and is not quoted.
out.height
& out.width
: sets the height and width of the plot in the opened file. In this case, you set the dimensions as percents, like out.width = "50%"
to make the figure half the width of the page of the rendered document.
You can also set global options so that all figures default to a certain size or alignment. That way, you’d only need to specify figure options in the code chunk if you want to stray from your default settings. Global options can be set like:
knitr::opts_chunk$set(fig.height=4, fig.width=6, fig.align='left')
Dynamic figure and table numbering and cross-referencing is one great feature of Markdown. The easiest way to add dynamic figure numbering and cross-referencing is to output to bookdown’s html_document2
, instead of rmarkdown’s html_document
, which is what I’ve shown so far. You’ll need to add a few lines to the YAML code as well, which we show below. The numbered_sections: false and number_sections: false prevent bookdown from adding numbering to each section. If you like them, you can delete those lines of code. You’ll also need to install bookdown (i.e., install.packages('bookdown')
).
output:
bookdown::html_document2:
numbered_sections: false
number_sections: false
fig_caption: true
To see how all of this works, we need to create a couple of plots. Sourcing the code below will generate a fake dataset and then creates 2 plots. If this doesn’t work for some reason, you can copy and paste the code directly from the script named Generate_fake_invasive_data_and_plots.R in our IMD_R_Training_Advanced repository. I’m just trying to save room on the page for code that’s not important.
library(dplyr)
library(ggplot2)
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")
The code below prints the plots with their respective figure numbers. Note that code chunk names should be alpha-numeric, and can’t include spaces or underscores.
```{r fig-inv-all, fig.cap = "Trends in invasive plant cover in NETN parks.", out.width = "50%"}
invplot_all
```
```{r fig-inv-acad, fig.cap = "Trends in invasive plant cover in ACAD.", out.width = "50%"}
invplot_ACAD
```
Notice that the figures are numbered consecutively in the order that they appear. For cross-referencing, each code chunk needs a unique name and a figure caption must be defined in the chunk options. To cross-reference, you then write “??”. The same works for tables too, but you use “tab” instead of “fig”. The following text:
As you can see in Figure \@ref(fig:fig-inv-acad), invasive cover appears to be declining. Whereas, invasive cover appears more stable in other NETN parks (Figure \@ref(fig:fig-inv-all)).
renders as:
As you can see in Figure 2, invasive cover appears to be declining. Whereas, invasive cover appears more stable in other NETN parks (Figure 1).
To show the subtle differences between rendering to different outputs, I rendered roughly the same Markdown report to HTML, Word and PDF. The RMD files and their output are posted to the repository for this website (KateMMiller/IMD_R_Training_Advanced). The actual folder is called Markdown_examples.
A couple of notes on the Rmd files in that folder:officedown
has a function to make that work, but I didn’t have a chance to look into it. I also couldn’t get cross-referencing to work with the figures. Again, officedown
is likely the answer. I just ran out of time (and enthusiasm?) to troubleshoot it.
bookdown
version of html or or pdf document, rather than the rmarkdown
version. The default formatting for bookdown
adds a bunch of section numbering, which I didn’t want. That’s what this numbered_section
stuff is all about. What’s useful for writing a book (most of the online R help books are written in bookdown
) is not useful for all purposes.
css: example_styles.css
.
reference_docx
. The one I’m importing is literally a lightweight version of the NPS NRR template.
officedown
or officer
package is supposed to be able to do this, but it didn’t work on my first try.
One great feature of R Markdown that really helps with automated reporting is the use of parameters. Parameters are things that you will set in the YAML at the top, and that will be used in the report to do things like filtering data or printing names. For example in the code below, we set park, park_long, and cycle as parameters in the YAML. We then specify them in the next code chunk using params$park
to create a plot for just the park we specified, and printed the params$park_long
in the caption for that figure. Then we made a table of the park and cycle we just specified using params$park
and params$cycle
, and again used the param$park_long
in the table caption.
---
output: html_document
params:
park: "ACAD"
park_long: "Acadia National Park"
cycle: 3
---
Inside the report, here’s some code to pull in data and make some plots. Note where I added #### Use # of params. That’s where the parameters are specified in the code.
````
```{r imports, include = FALSE}
# Import data from github
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")
table(invdata$park, invdata$cycle) # Should see 3 parks each with 3 cycles
# Create invplot_fun to plot different params
invplot_fun <- function(data, park){
p <-
ggplot(data %>% filter(park == park),
aes(x = cycle, y = inv_cover))+
geom_jitter(color = "#69A466", width = 0.1) +
geom_boxplot(aes(group = cycle), width = 0.18, lwd = 0.4, alpha = 0)+
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = 'Cycle', y = "Invasive % Cover") +
scale_x_continuous(limits = c(0.9, 3.1), breaks = c(1, 2, 3))
return(p)
}
# Plot invasives by cycle for specified park
invplot <- invplot_fun(invdata, park = param$park) #### Use 1 of param
# Filter invasive data to only include specified park and cycle
parkinv <- invdata %>%
filter(park == params$park) %>%
filter(cycle == params$cycle) %>% #### Use 2 of params
select(-park)
parktab <- kable(parkinv, format = 'html',
col.names = c("Plot", "Park", "% Invasive Cover", "% Canopy Cover"),
caption = paste0("Table of forest plots in ", params$park_long) #### Use 3 of param
) %>%
kable_styling(full_width = FALSE, position = 'left',
bootstrap_options = 'striped') %>%
scroll_box(height = "400px", width = "420px")
```
````
Finally, I added some plain text below to show how to specify the parameters as inline code in text:
```
### Invasive summary example
The following data are summarized for `r params$park_long`
```
Then I print the plot and table using the following code chunks:
```{r plotinv, echo = FALSE, fig.cap = paste0("Invasive trends in plant cover in", params$park_long), out.width = "50%"}
invplot
```
```{r parktable, echo = FALSE}
parktab
```
````
The following data are summarized for Acadia National Park:
Plot | Cycle | % Invasive Cover | % Canopy Cover |
---|---|---|---|
ACAD-01 | 3 | 1.2262568 | 81.88817 |
ACAD-02 | 3 | 0.2877510 | 56.05282 |
ACAD-03 | 3 | 4.5368474 | 76.32668 |
ACAD-04 | 3 | 5.0732908 | 67.99973 |
ACAD-05 | 3 | 0.2333573 | 76.22409 |
ACAD-06 | 3 | 0.0000000 | 88.09978 |
ACAD-07 | 3 | 0.0000000 | 88.81575 |
ACAD-08 | 3 | 2.9293494 | 53.97968 |
ACAD-09 | 3 | 0.0000000 | 75.43722 |
ACAD-10 | 3 | 0.0000000 | 66.81620 |
ACAD-11 | 3 | 2.5702070 | 64.26872 |
ACAD-12 | 3 | 0.0000000 | 97.12055 |
ACAD-13 | 3 | 0.0000000 | 73.33739 |
ACAD-14 | 3 | 0.0000000 | 58.74383 |
ACAD-15 | 3 | 0.0000000 | 75.38991 |
ACAD-16 | 3 | 3.3882025 | 69.33882 |
ACAD-17 | 3 | 1.1438505 | 60.82027 |
ACAD-18 | 3 | 1.7857986 | 64.73210 |
ACAD-19 | 3 | 1.8585776 | 78.08021 |
ACAD-20 | 3 | 1.0697659 | 75.02951 |
ACAD-21 | 3 | 1.6636475 | 67.65857 |
ACAD-22 | 3 | 3.1050853 | 70.33997 |
ACAD-23 | 3 | 2.3414571 | 82.94252 |
ACAD-24 | 3 | 0.3196272 | 73.83292 |
ACAD-25 | 3 | 0.0000000 | 67.27918 |
ACAD-26 | 3 | 1.4275352 | 72.12411 |
ACAD-27 | 3 | 3.5254140 | 62.02793 |
ACAD-28 | 3 | 0.2582069 | 81.31767 |
ACAD-29 | 3 | 0.0657203 | 80.97248 |
ACAD-30 | 3 | 2.1435462 | 54.95987 |
ACAD-31 | 3 | 1.8006971 | 79.54069 |
ACAD-32 | 3 | 0.0000000 | 66.28073 |
ACAD-33 | 3 | 0.6599651 | 71.10860 |
ACAD-34 | 3 | 2.5895022 | 55.35212 |
ACAD-35 | 3 | 0.0000000 | 70.15709 |
ACAD-36 | 3 | 0.2552997 | 72.18458 |
ACAD-37 | 3 | 0.4498149 | 70.17703 |
ACAD-38 | 3 | 0.0000000 | 75.88522 |
ACAD-39 | 3 | 0.0000000 | 89.46357 |
ACAD-40 | 3 | 0.7745124 | 62.74404 |
ACAD-41 | 3 | 0.7494048 | 98.09753 |
ACAD-42 | 3 | 0.0000000 | 88.88522 |
ACAD-43 | 3 | 0.0000000 | 71.95222 |
ACAD-44 | 3 | 0.0000000 | 81.87620 |
ACAD-45 | 3 | 3.3797115 | 74.91392 |
ACAD-46 | 3 | 2.9029968 | 81.61064 |
ACAD-47 | 3 | 0.3277854 | 71.95225 |
ACAD-48 | 3 | 0.4411102 | 65.64562 |
ACAD-49 | 3 | 0.0000000 | 69.21644 |
ACAD-50 | 3 | 0.0000000 | 81.55512 |
Answer 1. Change the YAML to the following:
---
output: html_document
params:
park: "SARA"
park_long: "Saratoga National Historical Park"
cycle: 3
---
params
Taking this a step further, you can run a script that iterates through a list (e.g., list of parks), and generates a report for each item on the list. I saved the code above where we created a park-specific figure and table using parameters in the YAML, called example_for_params.Rmd. Now I’ll show the code to render an HTML for each park in the dataset (ACAD, MABI, and SARA). The example .Rmd I used for this is
# load packages
library(purrr)
library(rmarkdown)
# Set in and out directories and file to iterate on
indir <- c("./Markdown_examples/")
outdir <- c("./Markdown_examples/output/") # Included, to see how to change out dir.
rmd <- "example_for_params.Rmd"
# Set up parameter lists. Have to repeat 3 times because each park has 3 cycles
park_list <- rep(c("ACAD", "MABI", "SARA"), each = 3)
park_long_list <- rep(c("Acadia National Park",
"Marsh-Billings-Rockefeller NHP",
"Saratoga National Historical Park"), each = 3)
cycle_list = rep(c(1, 2, 3), 3)
# Create the render function that we will iterate with
render_fun <- function(parkcode, parklong, cyclename){
render(input = paste0(indir, rmd),
params = list(park = parkcode,
park_long = parklong,
cycle = cyclename),
output_file = paste0("Example_for_", parkcode,
"_cycle_", cyclename, ".html"),
output_dir = outdir)
}
# Map render_fun() to the lists.
# Note that you must have the same # of elements in each list
pmap(list(park_list, park_long_list, cycle_list),
~render_fun(..1, ..2, ..3))
Not only can you set eval = TRUE/FALSE
or include = TRUE/FALSE
, but you can set these conditionally using objects in your global environment. For example, I have an .Rmd report that runs weekly checks on our forest data during the field season, and it’s primarily looking for errors, like missing data or values that are out of range. I only want the report to print the results of a check if it finds something. The way I do this is below:
````
```{r QCcheck1, include = FALSE}
fake_check1 <- data.frame(results = c(1:5))
#fake_check1 <- data.frame()
eval_check1 <- ifelse(nrow(fake_check1) > 0, TRUE, FALSE)
```
```{r, check1, eval = eval_check1}
print(nrow(fake_check1))
```
````
Another trick that has come in handy (although is not the easiest thing to implement) is to dynamically generate tabsets using lists. For example, we have one water summary .Rmd that takes park as a parameter. Using the park, the code figures out which sites occur at that park, and populates the site tabs specific to the park for that report. Here’s a simplified version of that approach below.
The first code chunk performs the following steps:tabset_data
dataset with 3 parks. Each park has a different numbers of sites, and each site has been sampled 10 times between 2010:2021.
tabset_data
to generate a unique list of parks (park_list
), a list of sites (site_list
).
ggplot_fun
to iterate the site_list
through.
plot_list
using purrr::map()
. Note also that set_names()
.
````
```{r ts_lists, echo = F, results = 'hide'}
library(purrr)
library(ggplot2)
# Create dataframe with 10 years of data for each plot and number of species found
years <- 2010:2021
numACAD <- 10
numMABI <- 5
numSARA <- 7
tot_rows <- length(years) * (numACAD + numMABI + numSARA)
tabset_data <- data.frame(
site = c(paste0(rep("ACAD-", numACAD * length(years)),
sprintf("%02d", rep(1:numACAD, each = length(years)))),
paste0(rep("MABI-", numMABI * length(years)),
sprintf("%02d", rep(1:numMABI, each = length(years)))),
paste0(rep("SARA-", numSARA * length(years)),
sprintf("%02d", rep(1:numSARA, each = length(years))))),
year = rep(years, numACAD + numMABI + numSARA),
numspp = as.integer(runif(tot_rows, 1, 20))
)
tabset_data$park <- substr(tabset_data$site, 1, 4)
head(tabset_data)
# Create park list for the for loop below
park_list <- unique(tabset_data$park)
site_list <- unique(tabset_data$site)
# ggplot function to implement in for loop
ggplot_fun <- function(site_name){
dat <- dplyr::filter(tabset_data, site == site_name)
p <- ggplot(dat, aes(x = year, y = numspp))+
geom_point() +
geom_line() +
labs(x = "Year", y = "Species richness") +
theme_classic()
}
# Create list of ggplots
plot_list <- set_names(map(site_list, ~ggplot_fun(.x)), site_list)
```
````
Next code chunk pulls the data together and renders the park and site tabs. The steps are described below:
cat("## Dynamic Tab Example {.tabset}", "\n")
sets up tabsets for the next header level (###)
for(i in seq_along(park_list))
) sets up the loop for parks. For each park, a tabset with the park code is rendered via cat("### ", park, "{.tabset}", "\n")
park_site_list
.
park_site_list
paste0()
of things that build out the rest of the report for each element in the park_site_list
via site_template
.
"#### {{site}} \n\n",
"##### Summary for {{site}}", "\n\n",
{r
and ending with “``\n\n")
that prings the site’s element in the plot_list that we set up in the previous chunk.
Note the liberal use of line breaks via \n
to make this all work. Tabsets need line breaks between themselves and anything that follows to render as a tab. Also, don’t ask me why sometimes I used {{}} and sometimes I used []. It takes trial and error to make this thing work!
````
```{r net_tabs, echo = F, results = 'asis'}
cat("## Dynamic Tab Example {.tabset}", "\n")
for(i in seq_along(park_list)){
park <- park_list[i]
cat("### ", park, "{.tabset}", "\n")
park_site_list = sort(unique(tabset_data$site[tabset_data$park == park]))
for(j in seq_along(park_site_list)){
site = park_site_list[j]
site_template <-
paste0("#### {{site}} \n\n",
"##### Summary for {{site}}", "\n\n",
"```{r ", site, "_plot, echo = F, fig.cap = ('Trends in species richness in {{site}}'), out.width = '40%'}\n\n",
"print(plot_list[site])",
"\n\n",
"```\n\n")
site_tabs <- knitr::knit_expand(text = site_template)
cat(knitr::knit(text = unlist(site_tabs), quiet = TRUE))
}
}
```
````
---
output:
pdf_document:
keep_tex: yes
---
ggplot
objects that were turned into plotly
plots to be interactive.
officeverse
covers the main packages that add more functionality for outputting .Rmd to Microsoft products, including Word and Powerpoint.
# Packages used in this section
pkgs <- c("tidyverse",
"flexdashboard",
"crosstalk",
"scales",
"DT",
"echarts4r",
#"echarts4r.maps", # used for demo, not needed by participants
"reactable",
"plotly",
"sparkline")
installed_pkgs <- pkgs %in% installed.packages()
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE)
lapply(pkgs, library, character.only = TRUE)
In this section we will get you started with using R markdown to build an interactive dashboard. We will use R markdown’s flex dashboard template to build this dashboard because:
It is easier than creating a dashboard with R Shiny.
It is not R Shiny.
If you don’t understand why that second point is important, it’s okay. What IS important is that we will be using html widgets to make our dashboard interactive. When we “knit” the dashboard we will generate an interactive HTML file that can be used by anyone with access to a web browser. The lucky recipient of the dashboard will not need R knowledge or R software. The dashboard–a simple html file–will not be able to run statistical analyses or do any kind of calculations on-the-fly, but it will be able to sort and filter data and generate tables and plots based on user-selected subsets of data. This feature of a simple html dashboard can be useful when we have large data sets and don’t want to flip through 100 pages of plots to find the one we are looking for.
Here are screenshots of a dashboard made with only html widgets. The actual .Rmd file (park_visits.Rmd
) that built this dashboard can be found in the RMarkdown example folder associated with this training.
Reading about building a dashboard isn’t particularly useful, so we will just jump in and sink or swim. We’re going to open a new flex dashboard template, practice doing a few different dashboard layouts, then use the fake invasive data and plots to make a dashboard together.
To open a new flex dashboard template:
File
> New File
> R Markdown
… [click]From Template
> Flex Dashboard
[click on it and then click the OK
button]You should see something like this show up on your RStudio script pane.
You should see something like this.
If you look at the pre-populated code in your template and compare that to the html dashboard that resulted from knitting the file, you should start to get a feel for what the lines and hashmarks mean on the template.
Now make each of these changes and see what each one does:
Change the vertical layout from “fill” to “scroll”… then knit to see
Change the orientation from “columns” to “rows”… then knit to see
Look on this page to figure out how:
Just like ggplot2
has themes to choose from, so does flex dashboard. Choose from one of options below and change your dashboard theme (the image below the list shows you where to type it on your dashboard–heed the indention!)
That’s really all you need to know to get started on a simple dashboard.
Now we’re going to load data and use some html widgets (just think of them like R
packages) to put an interactive table and two plots on your dashboard. For the table we will use DT::datatable()
. For the plots we will use the plotly
and echarts4r
packages. We will use the package crosstalk
to add user input boxes that will subset the data you see in the table and in the plotly
plot.
setup
chunk provided on the dashboard template, to write the code for reading in a data set and loading the libraries we need. You can call the code chunk anything you want–there is nothing special about the name setup
. In the example below I have read in the fake invasive plants data from the R Markdown > Parameters
tab of the training website.The DT::datatable()
function is pretty easy to use and has some nice features for sorting, filtering, and displaying data tables with various themes. Just as important is the fact that it is one of the few html widgets that is currently compatible with crosstalk
. So when a user makes data selections via a crosstalk
input box, datatable()
function will update the table data to reflect the user selections. We will get the tables and plots on our dashboard, then add the crosstalk
features last.
The DT
website shows us that the datatable()
function only requires one argument – a data object like a matrix or data frame. So:
In one of your dashboard “boxes” (e.g., Chart A, B, C or any additional boxes you created on your own), add the code line to create a datatable
. Make sure you put the code inside a code chunk, so R
recognizes it as code to execute. Don’t filter a subset of the data before you pass it to datatable()
– we will be creating user input boxes to filter the data based on user request.
Then run that one chunk to make sure your code works. To run a single chunk, click the green “sideways” triangle at the top right corner of the chunk. When you hover over it, you should see “Run code chunk”. You should see a table appear below the chunk (assuming you’re on the default flex dashboard setting of “Chunk Output Inline”–you can change this option by clicking on the gear wheel, if you would like).
To get the image below, I read the invdata
data frame into datatable()
.
At some point you can return to this table do some numbers rounding, change column names, or do any kind of formatting you would like. You will learn, in fact, that it’s quite simple to generate a basic dashboard. Formatting the dashboard is what will take up at least 90% of your dashboarding time (unless you don’t care how good it looks).
As a side note, I really like the reactable
package for making nested collapsing tables with informative icons and interactivity with plots (e.g., you can select a record on a table and it will highlight the corresponding point on a scatter plot). It takes a bit of time to learn but it’s worth checking out.
In another dashboard “box”, use plotly
to build a plot with the same data frame you passed to datatable()
. Again, don’t filter out a subset of the data–just pass the entire data frame to the plot function. For starters, the easiest way to do this is:
Use ggplot()
to make a plot from the data frame you passed to datatable()
. Assign that plot to a variable name. Then on the next line (after the plot code), pass your plot to the ggplotly()
function. So for example, you could write something like:
p <- ggplot(some_dataframe, aes(x = cycle, y = inv_cover)) + geom_point()
ggplotly(p)
Run that code chunk. You should see an interactive version of your ggplot. The plot will look a little different when it comes out of ggplotly()
–that’s an example of the potential formatting issues I mentioned.
As of now, the crosstalk
package–which is pretty new–has three functions that can be used to filter the data based on user input. These functions are:
filter_select()
, which creates a text box with drop-down menufilter_slider()
, which creates a two-sided slder that can be used to specify a range of dataAll of these filter functions subset the data in the same way that dplyr::filter()
works.
Before adding filter functions to our dashboard, we need to create a SharedData
object, which just means we need to wrap a data frame in the SharedData$new()
function. This data frame is the one that will be queried by the filter functions, and the resulting (subset) of data will be used by crosstalk
-compatible html widgets to update their plots and tables. On our current dashboard, the DT
and plotly
objects are crosstalk
-compatible. It’s possible to create different sets of filter functions to work on different data frames and compatible widgets, all on the same dashboard. The crosstalk
website has examples.
Follow the steps below to add a crosstalk
filter to your dashboard.
Create a new code chunk within your dashboard’s input sidebar (remember that was one of the earlier practice tasks?)? You can name your code chunk, but it’s not necessary.
In the code chunk, write code to pass a data frame to the SharedData$new()
function and assign the output to a variable name. You should be using the same data frame you has passed to DT::datatable()
and ggplotly()
. For example:
dat_shared <- SharedData$new(invdata)
filter_select(id = "park", label = "Select a park", sharedData = dat_shared, ~park)
In the example above I’ve created a selection filter that I’ve named “park” (any unique ID is fine). I have written a label to prompt the user to select a park. I’ve told the function the name of the shared data frame to operate on. And lastly, I’ve identified the column in that data frame that the filtering will act on.
All together, the code chunk should look something like this:
Now pass the shared data frame to your DT::datatable()
and ggplotly()
functions (i.e., replace the old data frame name with the name of the shared data frame). We are telling these two functions to use the SHARED data frame to populate the table/plot.
Knit your dashboard to see if everything is working so far. You should see a user input box (or slider or checkbox, depending on what you chose) in the user sidebar on the left side of the dashboard. When you make a selection in that box you should see your table and plot update.
This one is up to you. You can create a different kind of plotly()
figure (based on the same shared data frame) to put on your dashboard. OR you can try out one of these very cool echarts4r
plots with your data. Note that in the examples, they use |>
to pipe their functions instead of %>%
. Either one will work. Also note that echarts4r
is not yet compatible with crosstalk
. But that’s no reason to not use a widget on your dashboard. Interactive plots and tables can be useful even if they don’t interact with crosstalk
shared data frames.
This page has links to many other really nice html widgets that you can use on your R markdown dashboards. Check them out!
As you become more familiar with coding in R, your code will become longer and more complex. You will have projects that you revisit and update each year. You may wish to share your code with others and allow them to make suggestions and contributions.
If your code was a Word document, this is the point where you would turn on Track Changes. Think of version control as Track Changes for your code. Version control is much more sophisticated and flexible, which means that it comes with a steeper learning curve. If you come away from this class feeling like you have no idea what you are doing, don’t despair! Embrace the learning curve and remember that you don’t have to be an expert in version control to take advantage of its most useful features.
…because we’ve all been here before. Version control is worth the learning curve because:
There are a variety of version control systems available. We will be using Git, since it is by far the most commonly used system and it is free and open source.
You’ll need the following installed and ready to use prior to this course:
This is the general workflow that version control follows. For the moment, ignore the stuff on the right - we’ll get to that soon.
Understandably, lots of people confuse Git and GitHub or use the terms interchangeably. They’re two separate (but related) things though. Git is the version control software that you install on your computer. It runs on your local hard drive and does all the heavy lifting when it comes to tracking changes to your code. GitHub is an online service for storing and collaborating on code. It has a nice user interface, and it is great for facilitating collaboration and project management. Even if you’re working on your own, it’s a great place to back up your local code. Git can be configured to sync your code to and from GitHub when you tell it to.
There are a few fundamental concepts that you should keep in mind as you are learning Git. It’s ok if they don’t fully make sense right now - we will keep revisiting them as we practice.
Git was originally developed to be used at the command line. You can still do everything in Git at the command line if you want to, but most people prefer a point-and-click user interface (often referred to as a Git client) for most tasks. For the purpose of this class, we’ll do a few things at the command line, but we’ll focus on using the RStudio Git client when possible. As you get more comfortable with Git, you may find it helpful to practice using Git at the command line in order to understand what’s going on under the hood. If not, that’s completely fine! If you want nothing to do with the command line, you can install a more fully-featured Git client. Search the list of approved software for “Git” to see which Git client(s) are available.
You can use Git completely independently from RStudio if you want to, but for the purposes of this class we’ll use RStudio’s built-in Git client. Without it, you have to type Git commands in the terminal. You can download other Git clients that do more than the one in RStudio, but we’ll stick with RStudio in this class since it’s simple and convenient.
Open RStudio. If you have a project open, close it out by going to File > Close Project. Look in the top right pane - at this point, the Git tab will not be present.
Go to Tools > Global Options. In the list on the left, select Git/SVN. Make sure that the Enable version control interface for RStudio projects box is checked and verify that everything looks like the screenshot below (your git.exe path may differ, that’s ok).
Click Apply, then OK. Now RStudio should be set up to use Git.
Now you need to connect your GitHub account. The easiest way to do this is to use the usethis
package.
Step one: create a personal access token
GitHub uses personal access tokens for authentication. Instead of entering your GitHub username and password into RStudio, you’ll create an access token and store that. A personal access token is more secure because you can revoke it at any time, you can set it to expire after a fixed amount of time, and it doesn’t provide full access to your account. However, you should treat it like a password: don’t share it with others, don’t hard-code it into a script, etc. If you must save it somewhere (not necessary), use a password manager like KeePass.
# This opens up a browser window where you can create a personal access token. It may ask you to log into GitHub or confirm your password.
usethis::create_github_token()
Once you are logged into GitHub, you should see a page that looks like this:
Put something descriptive in the Note field, like “RStudio”. Next, look at the Expiration dropdown. By default, tokens are set to expire after 30 days, at which point you will have to repeat this process. You have the option to set the expiration date to later or never. Use your best judgment. You can leave the scope settings as-is. Scroll to the bottom and click Generate token. The only time that GitHub will show you your personal access token is when you create it. So make sure you copy it to your clipboard so that you can finish connecting to GitHub from RStudio!
Step two: insert your personal access token in the Git credential store
Run the code below and when prompted, paste in your personal access token.
gitcreds::gitcreds_set()
If you already have Git credentials stored, gitcreds_set()
will prompt you to keep or replace them. Unless you have recently gone through the process of creating a personal access token, go ahead and allow it to replace them with your new access token.
The easiest way to create a new RStudio project with a Git repository is to go to File > New Project > New Directory > New Project and make sure to check the Create a git repository box before you click the Create Project button.
Alternately, you can use the usethis::create_project()
and usethis::use_git()
functions to initialize a project with a Git repository. usethis::use_git()
is also the easiest way to add a Git repository to an existing project.
At this point you should have an empty project open in RStudio, and you should see a Git tab in the top right panel. Note that everything you have created so far exists only on your local hard drive - we haven’t done anything with GitHub yet.
Click on the Git tab. If you used the create package wizard, you’ll see two files with yellow question marks next to them.
The question marks mean that Git recognizes that you’ve created new files, but it is not tracking changes to them yet. It seems like it should start tracking change by default, like a word processor, but we don’t necessarily want Git to track every file we create in our project folder. Remember that eventually we’re going to put our repository on GitHub, in the cloud. GitHub has privacy settings, but even so we don’t want to store things like sensitive or unpublished data there. Information for connecting to a database is another example of something we might store in our project folder but wouldn’t want to publish to GitHub.
Before we do anything else, let’s make our first commit. Remember, think of a commit as a snapshot of our project at a given point in time. In the Git tab, click on the Commit button.
There’s not much to see here yet. Note the Staged column in the file list. In order to stage a file for commit, click the checkbox next to it. Go ahead and stage both files. Note the green A next to each file: this means that you’ve told Git to include the contents of the file in the next commit. A common misconception is that checking the Staged box actually creates a new commit. That’s not the case, though! You’ve composed your snapshot, but you haven’t pressed the shutter button.
To actually commit the staged files, enter something brief but descriptive in the Commit message box (“Initial commit” is typically used for the first commit). Then click on the Commit button. Now you’ve taken your snapshot! Git will give you some feedback, letting you know that your commit includes 17 new lines of code in two files, and two new files were created: .gitignore and git-practice.Rproj.
Click Close.
At this point, it’s time to create a home for your project on GitHub. Even if you never share your GitHub repository with anyone else, it’s a great place to back up your code. Log into GitHub and click the + in the top right corner. Select New repository.
Leave Repository template set to No template. Give your repository a name. Ideally this should match the name of your project, for your own sanity, but it doesn’t have to. Go ahead and include a very brief description of your project. For the purpose of this class, set your repository to Public. In general, use your best judgment when it comes to public vs. private repositories.
Things to consider: how easily do you want to share your code with others? Can your code benefit others? Is there a risk of accidentally committing sensitive information to the repository? Is there any sensitive information in your code? You should never ever hard-code things like login information, but something like code pertaining to sensitive species in a park might be worth keeping private.
In the Initialize this repository with: section, leave all the boxes unchecked. You already have a .gitignore file and you’ll create a README later. Don’t worry about a license for this code. Click Create Repository and copy the URL to your repository as shown below.
Now you have a local repository on your hard drive and a repository in GitHub (often referred to as a remote repository). Now it’s time to make them talk to each other. From the terminal tab in RStudio (still in your git-practice project), follow the “push an existing repository from the command line” instructions shown in your GitHub repository. Make sure to replace the URL with the URL to your own GitHub repository!
If you don’t see a terminal tab next to your console tab, go to Tools > Terminal > New Terminal.
If this is successful, you’ll see several lines of feedback in the terminal, ending in “Branch ‘main’ set up to track remote branch ‘main’ from ‘origin’.” Verify that everything worked correctly by refreshing the webpage for your GitHub repository.
Finally, look at the Git tab. The green and blue arrow buttons should no longer be grayed out and there should be no files listed as changed or added.
Existing project without local Git repo
If you have an existing project but you didn’t check the Create a git repository box when you created it, don’t despair. Make sure your project is open, then go to the Terminal tab, type git init
, and hit Enter. Now you can create and connect a GitHub repository as described above.
New project from existing GitHub repo
This is a very common workflow. This what you will do if someone shares a GitHub repository with you and asks you to contribute to it. You can also do this if you want to create a new GitHub repository first, then a new project. In GitHub, copy the repository URL:
Next, open RStudio and go to File > New Project. In the popup, select Version Control, then Git. Under Repository URL, paste the URL that you just copied from GitHub. Project directory name should default to the repository name. If not, just type it in. For Create project as subdirectory of:, select whichever folder you want to store your R projects in. Click Create Project. You’re done! You now have an R project containing the code from the GitHub repository. It’s already set up with a local repository and a connection to the GitHub repository. Make sure the owner of the GitHub repository has set the permissions so that you can push to it.
This is known as cloning a repository - you are copying the whole GitHub repository to your local hard drive, and setting up your new local repository to sync with the GitHub repository when you ask it to.
Version control is essential to effective collaboration on code projects, but it’s also an incredibly useful tool when you’re working by yourself. It’s also a great way to get more familiar with the version control workflow without the added complexity of working with collaborators.
If you are adding a new feature to your code, it’s good practice to create a new branch to work from. Examples of things you might create a branch for are a new analysis, a complex map or plot, or a few simple maps or plots. You might also create a new branch to fix or improve existing functionality.
Ideally, you should not work from the main branch. Instead, reserve it for the most recent version of your project that is known to work. Later in this lesson you’ll learn how to incorporate code from another branch back into the main branch.
To create a new branch from RStudio, go to the Git tab and click on the branch icon. Pick a concise but descriptive name for your new branch. For this class, call it birds-eda
since we’ll be doing some exploratory data analysis on a bird checklist.
Tip: it’s common to format branch names as all lowercase with words separated by dashes.
Leave the Remote dropdown set to “origin” and leave the Sync branch with remote box checked. This just means to use the GitHub repository we set up in the last lesson and automatically create a new branch there to match our local one. If you uncheck the box, you will not be able to push your branch to GitHub.
Click Create. Look at the Git tab. Your current branch (top left corner) should be the branch you just created. Click on the dropdown to see all of your branches. LOCAL BRANCHES refers to the branches in the repository on your local hard drive. REMOTE:ORIGIN refers to the branches in the connected GitHub repository. You can more or less ignore the stuff under REMOTE:ORIGIN and just focus on your local branches. If you do click on one of the REMOTE:ORIGIN branches, Git will try to sync any new changes from that GitHub branch to your corresponding local branch.
As you saw when you created a new repository in GitHub, tracking changes in Git is a manual process. When you create a new file, you must tell Git to begin tracking it. Untracked files are designated by a yellow question mark .
To practice, go ahead and create a new R script. Paste in the following code and save it as birds.R
.
# Read in bird checklist
birds <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
head(birds)
Time to stage and commit the new script. Click on the Commit button in the Git tab. Check the box in the Staged column. Notice that the code you just added is highlighted below in green. That’s Git’s way of indicating lines that were added since the last commit. birds.R
is a new file, so everything is green. Add a commit message and click Commit.
Notice the message in the Git tab: This means that you now have one commit in your local repository that doesn’t exist in your GitHub repository. We’ll make one more commit and then we’ll get in sync with GitHub.
We created this branch for exploratory data analysis, so let’s do some. We’ll change our current code to use the Tidyverse, then explore the dataset a little. Replace the contents of birds.R
with the following:
# Load packages
library(dplyr)
library(readr)
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
You know the drill at this point. Save the file and open the commit window (Git tab > Commit), but don’t check the box in the Staged column yet.
Notice the lines in red. Git parses a change to a line of code as the deletion of the original line and the addition of a new line with the change applied. Lines with no red or green highlighting are lines that haven’t changed. In longer files, Git will hide unchanged lines, so don’t panic if you see this - your code hasn’t disappeared.
Now we’re going to see why the two-part process of staging and committing is so useful. We did two things when we updated birds.R
- we updated the data import code, and then we did some exploratory data analysis (EDA). It would be nice for our commit history to reflect those distinct changes. Click on line 1 (# Load packages
), hold down the Shift key, and click on line 9 (birds
) to highlight the chunk of data import code that we changed. It should be selected in blue. At the top of the selection, click on the Stage selection button.
At the top of the code preview, toggle Show from Unstaged to Staged. Instead of staging all of our edits to birds.R
, we only staged the edits we highlighted. Go ahead and commit those staged changes.
Since we didn’t commit everything, our EDA code still shows up as unstaged. Stage it now and commit it, but don’t close out the Review Changes window.
In the top left of the Review Changes window, click on History. Here you can review your project’s commit history. Each row is a commit. Since we wrote helpful commit messages, it’s easy to get a sense of what has been done.
Right now we’re just looking at the history for the birds-eda
branch, but you can use the dropdown next to the History button to view history for other branches or for the whole project.
The colored labels indicate branches. HEAD -> refs/heads/birds-eda
means you are currently on the birds-eda
branch. If we switched to the main
branch, it would then be labeled HEAD -> refs/heads/main
and the tracked files in our project folder would change to reflect the state of the main
branch. Branches whose labels start with origin/
are the branches in your GitHub repository.
You can click on a commit to see what changes are part of that commit and which files were involved. The alphanumeric nonsense labeled SHA is just a unique identifier for that commit.
(If you’re interested, SHA stands for Simple Hashing Algorithm. The contents of the commit get fed into an algorithm that generates a checksum, a unique identifier for that commit.)
Now that you’ve made a few commits, you should push your birds-eda
branch to GitHub. That way, it’s backed up in case of hard drive failure. It’ll also be easier to share the code with other people. Verify that you don’t have any uncommitted changes, then use the green Push arrow to push birds-eda
to GitHub. You can find this button in the Git tab or in the Review Changes (commit) window. Both buttons do the same thing. You’ll get some feedback that the branch was pushed; you can close that window once you verify that there are no error messages.
Note: Pushing a branch to GitHub only pushes commits made to that branch. If you have uncommitted work or work in other branches, it will not end up in GitHub.
In a typical project, you will repeat the process of staging, committing, and pushing a few times before your branch is ready to become part of the working, authoritative version of the project that is the main
branch. You’ll also want to make sure to test the code in your branch thoroughly to make sure that it works!
For the purposes of our demo project, we’ll say that we’re done with EDA and birds-eda
is ready to be included in main
. To do this, we merge birds-eda
into main
.
You must be on a branch (main
) to merge another branch (birds-eda
) into it. In the Git tab, click on the branch dropdown (it should currently say birds-eda), and switch to your local main
branch. You will now see a popup that strikes fear into the hearts of the uninitiated:
Don’t panic. When you switch to a different branch, the files in your project directory will update to reflect the series of commits that are part of that branch. There’s only one commit on main
- the initial one. You can verify this by clicking the History button in the Git tab. birds.R
doesn’t exist in the main
branch yet, but if you switch back to birds-eda
it will return.
Now, we get to feel like real hackers for a minute. Navigate to the Terminal tab in the bottom right pane of RStudio. Verify that you are on the main
branch, then type the below command. Pay attention to spacing, punctuation, and capitalization - typos are the most common reason that Git commands fail.
git merge birds-eda
Hit Enter. You should see some feedback in the terminal:
You can verify that the merge was successful by looking in the Files tab - you should now see birds.R
.
The last step is to get the main
branch in GitHub back in sync with the local main
branch. Same as you did with birds-eda
earlier, we’ll push our local branch to GitHub by clicking on the green Push arrow.
If you look at your commit history, you can see that all branches are now up to date with birds-eda
.
Congratulations! Now you have the skills you need to start using version control for your code projects.
At this point, the workflow from earlier should make a little more sense.
As you practice this workflow, here are some guidelines to keep in mind:
main
branch if you know it’s broken (but if it happens, don’t panic - just fix the bug on a new branch and merge back into main
).Collaborating with others in a shared GitHub repository can feel complicated at first, but the learning investment will quickly pay off. The ability to easily identify your collaborators’ changes, review their code, and automatically combine your changes into a single authoritative version is well worth the initial challenges.
The first step to collaborating in a shared GitHub repository is to give everyone access to it. From your git-practice
repository in GitHub, navigate to Settings > Collaborators. Even if your repository is public, only collaborators that you designate will be able to push to it. Click on the green Add people button and enter the username(s) you would like to add. Your collaborator(s) will then receive a notification that they have been invited to your repository. Once they accept, they will be able to push to your repository.
Your collaborator(s) can now clone your GitHub repository, creating their own local Git repositories. When they push a branch to GitHub, it will update the corresponding branch in your GitHub repository. Follow the instructions for New project from existing GitHub repo at the bottom of the Git and RStudio tab of this website to clone a GitHub repository.
The good news is that most aspects of a collaborative workflow are more or less the same as if you were working on your own. Continuing with the bird checklist project, let’s do some data visualization. We’ll create one plot and one word cloud, splitting the work between two people. From the main
branch, one person will create a new branch called group-size-plot
and the other will create a branch called species-wordcloud
, making sure to leave the Sync branch with remote box checked.
To keep things simple, we’ll all continue to work from birds.R
.
Collaborator 1
On the group-size-plot
branch, modify birds.R
to load ggplot
and create a histogram of species group size. You can just replace the contents of birds.R
with the following. Notice that there’s a line at the top to load ggplot
and a section at the bottom to create the histogram. In addition, we’ve added section headers to organize our code a bit.
#----Load packages----
library(dplyr)
library(readr)
library(ggplot)
#----Read data----
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
#----Exploratory data analysis----
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
#----Group size histogram----
group_size %>%
ggplot(aes(x = species_count)) +
geom_histogram() +
ggtitle("Distribution of species group size") +
xlab("Number of species") +
ylab("Frequency")
Stage and commit your changes, and push the group-size-plot
branch to GitHub.
Collaborator 2
On the species-wordcloud
branch, we’ll modify birds.r
to load the ggwordcloud
package and make a wordcloud of the words that appear in bird species names. We’ll save the wordcloud as a PNG. Make sure you’re on the right branch before you make changes!
# Load packages
library(dplyr)
library(readr)
library(ggwordcloud)
library(stringr)
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
## Species description wordcloud
# Remove the type of bird (last word in the species name)
bird_words <- birds %>%
mutate(primary_com_name = str_replace(primary_com_name, "\\b(\\w+)$", ""),
primary_com_name = str_replace_all(primary_com_name, "-", " "),
primary_com_name = trimws(primary_com_name),
primary_com_name = tolower(primary_com_name)) %>%
filter(primary_com_name != "")
# Split species descriptions into single words
bird_words <- paste(bird_words$primary_com_name, collapse = " ") %>%
str_split(boundary("word"), simplify = TRUE) %>%
as.vector()
# Get frequency of each word
bird_words_df <- tibble(word = bird_words) %>%
group_by(word) %>%
summarise(freq = n()) %>%
arrange(-freq) %>%
filter(freq > 10) %>%
mutate(angle = 90 * sample(c(-1:1), n(), replace = TRUE, prob = c(1, 3, 1)))
# Make the word cloud
n_words <- 100 # Number of words to include
ggplot(bird_words_df[1:n_words,],
aes(label = word, size = freq,
color = sample.int(10, n_words, replace = TRUE),
angle = angle)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_continuous(type = "viridis")
ggsave("birds_wordcloud.png", bg = "white")
Stage and commit your changes to birds.R
, but don’t include the PNG. Leave it as-is with the yellow question marks; we’ll talk more about it later. Push the species-wordcloud
branch to GitHub.
At this point, we are ready for the main branch to reflect our new changes. This is where the workflow gets a little different than if you are working by yourself. You can’t just merge your changes into your local main
branch and push it, because the main
branch on GitHub could contain other people’s commits that you don’t have locally. It’s also a good idea to let your collaborators review your code before it gets merged into the main
branch of the shared GitHub repository, so don’t push your local main
branch to a shared GitHub repo!
Collaborator 1
Let’s work on incorporating the group-size-plot
branch into both the local and remote (GitHub) versions of main
.
First, we need to check whether anyone else has made changes to main
, so we’ll pull it. Switch to your local main
branch, then click on the blue Pull arrow in the Git tab. This retrieves the GitHub version of that branch and merges it into your local version. A Git Pull window should pop up and let you know that your branch is already up to date. Switch back to group-size-plot
.
Next, we’ll go to GitHub and create a pull request. This terminology is kind of confusing, because the goal of this pull request is to merge the group-size-plot
branch on GitHub into the main
branch on GitHub. Think of it this way: you are asking your collaborator to pull your branch, review it, and then merge it into main
if all looks good. Why isn’t it called a merge request? Because sometimes you just want feedback on your code and aren’t ready to merge it into main
.
Navigate to your repository on GitHub. You may see a notification that group-size-plot
was recently pushed. If so, you can click the Compare & pull request button in the notification.
Alternately, you can navigate to Pull requests > New pull request, select group-size-plot
in the Compare dropdown, then click Create pull request.
At this point, you can add a note to your pull request, tag a reviewer, and look over the differences between your branch and main
. It can be helpful to include a brief summary of what you did and indicate whether you think it’s ready to merge into main
or would just like a code review.
When you are ready, click on Create pull request.
Collaborator 2
In your GitHub repository, you’ll see a badge next to Pull requests with a “1” on it, indicating that there is one open pull request. Navigate to Pull requests and click on the one your collaborator just created.
You can look at the Commits and Files changed tabs to see what was done. It’s also a good idea to pull the group-size-plot
down to your local repository so that you can actually run the code and verify that it works for you. From the Git tab in RStudio, use the branch dropdown to switch to group-size-plot
. If you don’t see it in your local branches, look under Remote:Origin. Clicking on it will create a local version of the branch.
Run the code and verify that it works. When you’re satisfied, go back to the pull request review on GitHub. If you see a message that your review was requested, click on the green Add your review button at the top of the screen. This will take you to the Files changed page where you can make comments on specific lines of code and decide whether to approve merging it into the main branch, request changes before it is merged, or just provide general feedback.
For the purposes of this class, we’ll go ahead and approve the request and merge group-size-plot
into main
. In real life, you may want to request corrections or improvements before merging. Go back to the Conversation tab for the pull request, add any comments you would like, then click on the green Merge pull request button.
Now our local main
branch needs to be updated, so switch to it in RStudio and click Pull.
We’re almost ready to repeat the pull request process for species-wordcloud
, but first we need to make sure that the code in species-wordcloud
works with the latest updates to main
. To do this, we’ll merge our local main
branch into species-wordcloud
. Switch to species-wordcloud
in RStudio, then type the following at the terminal:
git merge main
Whoops - we’ve encountererd a merge conflict. This happened because species-wordcloud
and group-size-plot
made different changes to the same lines in birds.R
. We just need to manually resolve the conflict, because Git doesn’t have any way of knowing which changes it should keep. To do this, we just edit the offending files (designated by an orange “U” in the Git tab). The first few lines of birds.R
now look like this:
The code between <<<<<<< HEAD
and =======
is what’s in our current branch, species-wordcloud
. The code between =======
and >>>>>>> main
is what’s in the branch we’re trying to merge, main
. In this case, we want to keep all of it, so we’ll just delete <<<<<<< HEAD
, =======
, and >>>>>>> main
.
Scroll down to the bottom of the file, where the other merge conflict is, and repeat the same process. In this case we’re keeping the additions to birds.R
from both branches, but there will be cases where you need to decide which version to keep or modify your code to combine parts of both.
Sometimes merge conflicts can get confusing. You can start over by typing git merge --abort
at the terminal, which takes you back to before the merge conflict. Then just re-do the merge and try again.
When you are done resolving the merge conflict, commit your changes and push to GitHub. Now you can follow the steps above to create your own pull request. When that pull request has been merged, main
will contain the work of both collaborators.
You’ll often have files in your project that you don’t want to include in its permanent history. This could include things like data, database connection information, or user-specific configuration files.
Depending on the situation, you also may want to avoid tracking files generated by your code, since you can just recreate an up to date version of these files by rerunning your code.
Throughout this whole process, birds-wordcloud.png
has been floating around as an untracked file. We don’t really need to store it in the Git repository since we can just generate it from birds.R
, but it’s annoying that it’s sitting there with the yellow question marks.
This is where the .gitignore
file comes in handy. Open it up, and on a new line add the name of the file or folder you want Git to ignore - in this case, birds-wordcloud.png
. It will take effect when you save .gitignore
but you may need to click the refresh icon in the top right of the Git tab to see the files disappear from the status view.
Commit and push your changes to .gitignore
.
Tip: You can also use a regular expression to specify multiple filenames matching a certain pattern.
git reset
, git commit --amend
, or the Amend previous commit box in RStudio. This will cause problems for your teammates.--force
option: Force pushing or force pulling a branch will create headaches and possibly lost work if you do it wrong. Don’t do this unless you know what you’re doing, and if you do, make sure to talk to your collaborators if shared branches are affected.As you learn Git, you will inevitably end up in a tight spot here and there. DO NOT PANIC. It is very hard to permanently lose your work in Git. It is very easy to get confused and think that you did. We’ll talk through some common issues and how to fix them, then talk about what to avoid when you’re starting out
The great thing about Git is that it is hard to make a mistake that can’t be fixed. However, there are still a few ways to get yourself into trouble. These guidelines should keep you from doing irreparable damage as you are learning.
Be careful with the revert button in RStudio
DO NOT DELETE YOUR .git FOLDER
Deleting your .git folder deletes your entire local Git repository. If you have woven a tangled and confusing web of branches and commits, it may be tempting to just get rid of the whole repository, but if you do this you will likely lose some of your work as well as your only hope of fixing things.
Avoid git reset
The reset
command can actually be very useful, but misusing it is a good way to lose work and/or create headaches for collaborators. Practice with it and make sure you understand it before you use it in a repository that you care about. In the meantime, use revert
if you need to safely undo a commit.
Avoid git push --force
or git pull --force
This overwrites the branch that is being updated. As with reset
, just make sure you understand what you’re doing if you use the --force
option.
Chances are it’s because there are commits in the remote branch that aren’t on your branch. While pulling a branch will do an automatic merge, pushing it will not. You need to pull the remote branch, then push your local branch.
Are you on the right branch? If not, switch branches. If you’re on the right branch, it’s because you didn’t commit. A common mistake is to stage changes and forget to hit the commit button. Remember, staging is like framing a snapshot and committing is pressing the shutter button.
Are you looking at the right branch in GitHub? If not, switch branches. If you’re on the right branch: did you remember to push? If you remembered to push: you probably have uncommitted changes (see above). Commit them and push again.
You probably have changes that haven’t been committed. When you pull or merge, Git combines your commit history with the commit history of the branch you are pulling or merging (remember, pull is just fetch remote branch + merge it). Then it actually changes your working directory files to reflect the new state of the branch. If all your changes have been committed, that’s not a problem. If you have uncommitted changes, they would get overwritten, so Git throws an error to protect you from losing work.
When you switch branches, Git does a similar thing - it changes the contents of your working directory to reflect the current state of that branch. If your changes aren’t committed, they would be overwritten.
Just commit anything with a blue M next to it and try again.
If you don’t want to commit your changes, you can use the git stash
command, which “stashes” your changes and then removes them from your working directory. Do the pull/merge/switch, then use git stash pop
to retrieve your uncommitted work.
Take a deep breath. This is confusing at first but it’s not so bad. Merge conflicts happen when you and another person make different changes to the same line of code. They can usually be avoided by communicating with your collaborators and making sure you’re not working on the same thing. They still happen though, and the fix is simple: just edit the offending file to look like it should. Find the markers indicating the conflict, then keep the version you want to keep (or keep both, or make some edits). Then save, stage, and commit. See the previous tab for more information.
For simple mistakes, just fix it and commit. If a more complex change broke your code and you can’t fix it, you can revert the commit where the change occurred (this is a good reason to be thoughtful about how you organize and comment your commits). Look at your commit history and find the SHA (unique identifier) for the offending commit(s). You don’t have to copy the full code - just the first 7 characters should be enough. In the Terminal tab, type the following, replacing xxxxxxx
with the unique identifier for your commit.
git revert xxxxxxx
This will create a new commit that reverses the changes made in commit xxxxxxx
. If you want to revert a series of commits, use git revert xxxxxxx..yyyyyyy
where yyyyyyy
is the unique identifier of the most recent commit in the series.
Once you’ve moved on from writing stand alone long R scripts to writing custom functions and iterating workflow, the next logical step is to start building your own custom package to more easily apply, document and share your code.
There are multiple ways to build a package, and it just keeps getting easier thanks to RStudio. The steps below are the ones that most consistently have worked for me as of January 2022.
Once those steps are completed, check that it worked by going to git tab to pull from GitHub. If the down and up arrows are grayed out, something went wrong. If they look like the image below, and you can pull down from GitHub, then you’re all set.
The easiest way to create a new package in RStudio is using the usethis package. You’ll first need to have a GitHub account and have RStudio connected to your GitHub account. Once that’s working, you can run the code below in your console.
usethis::create_package("D:/NETN/R_Dev/testpackage") # update to work with your file path
usethis::use_git() # sets up local git for new package
usethis::use_github() # creates new github repo called testpackage.
usethis::use_mit_license() # set license to MIT license (or use a different license.)
usethis::git_default_branch_rename() # renames master to main
The basic building blocks of an R package are defined in the list below. At a bare minimum, there are 2 files and 2 folders that make up an R package. The 2 required files are DESCRIPTION and NAMESPACE. The two folders are “R” and “man”. Several additional files that improve documentation and git workflow are also added by RStudio’s package template. Your Files pane should look something like this:
data/
to the gitignore. If you want all files with a certain extension to be ignored, like pngs, you can write *.png
.
The last step before we get to start adding to our R package is to make sure the Build Tools are set up and functioning properly.
#' @title hello
#' @description test package for R training
#' @export
?testpackage::hello
If the text you added shows up in the help file, your Build tools are all set. If the Build exited with status 1, then something is wrong with the roxygen text in your hello.R file. Review the build results to see if you can find the line identified as failing. Also check that each of the lines you added are commented with both symbols ( #'
), and that the terms following the @ are spelled correctly and don’t have a space.
The last thing to do is delete the hello.R file to clean up the package. You don’t need to delete the hello.Rd file, as it will be deleted the next time you rebuild your package.
Now we get to add to our package and make it useful! We’re going to add a simple function to the package that I use all the time for my workflow. It’s a function that takes the first 3 letters of a genus and species to create a species code. It saves me having to type out full species names when I’m filtering through a lot of data.
To follow along, go to File > New R Script (or key Ctrl + Shift + N) and copy the code below to the script.
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#' @export
make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(data, -genus, -species)
return(data2)
}
Note in the Roxygen code above, we added the title, description, and export like we did for hello.R. We added a few more arguments to the Roxygen2 text at the top, including imports, params, and return.
Now we also added 2 imports, which are dependencies of your R package. The first are mutate and select in the dplyr package. The second is the word function in the stringr package. By adding these 2 lines to the Roxygen, these two functions will become part of the Namespace of the package (more on that later), and will be usable by any function in your package.
If you use all base R functions within the functions of your package, you don’t need to use imports. In general, best coding practices are to minimize the number of dependencies to reduce the number of packages a user needs to install before using your package, and make it less likely that your package code will break because a dependency was updated. I use them here to show you the workflow when you need dependencies (e.g., it’s hard for me not to want dplyr at some point in a package). Another note is that @importFrom
will only add the functions for that package in the Namespace, so you’re less likely to have conflicts with other packages. If you want to make the entire package available to the package Namespace (e.g., I’ve done this with ggplot2), then you’d write: #' @import ggplot2
Parameters are where you define the inputs to your function. If an input only takes certain arguments, like TRUE/FALSE, or a list of park codes, @param
is how you document that to the user. Note that if your package functions share the same parameters, you can inherit parameters from other functions, instead of having to copy/paste them across functions by adding #' @inheritParams make_sppcode
.
The @return
argument tells the user what to expect as the output of the function.
@export
argument tells R to export that function into the NAMESPACE file.
There’s one last piece of documentation we need to complete before dependencies will be installed when your package is installed. Open the DOCUMENTATION file. It should look like:
You’ll want to update the Title, Author, Maintainer, and Description, which are pretty self-explanatory. As you update your package, you’ll also want to update the Version number. Next we need to add the Imports and Suggests to the DESCRIPTION, which are defined below.
Imports: Packages listed under Imports will be installed at the same time your package is installed. You can also set the minimum version number that, if users don’t have, will be installed.
Suggests: These packages are not installed at the time your package is installed. Suggests are helpful for external packages that are only used by one or a few functions in your package. For example, one of our packages has a function that imports data directly from our SQL Server, but only a few network staff can access the server. The external packages that the SQL import function uses are listed under Suggests. The SQL import function then checks to see if the suggested packages are installed on the user’s computer. If not, it will stop and print an error that it needs to be installed. We’ll show that workflow later.
You can either manually add these to the DESCRIPTION file like:
OR, you can use the usethis package to do the heavy lifting!
usethis::use_package("dplyr") # for imports which is the default
usethis::use_package("stringr") # for imports which is the default
usethis::use_package("ggplot2", "Suggests") # for suggests
Note also that the License should be MIT + file LICENSE, if you followed the usethis workflow we showed earlier to create the package. I don’t know a lot about licenses, other than it’s best practice to set one. The MIT license is the most common passive license that means your code is open source and allows anyone to copy code with minimal restrictions. If you want all derivatives of your code to be open source, the GPLv3 license is the most common license (usethis::use_gpl_license()
).
We’re finally ready to document the package (note you could have done it after each step). Go to the Build tab and click “Install and Restart” (or Ctrl + Shift + B). Assuming the roxygen and DESCRIPTION were written correctly, you should now see a make_sppcode.Rd in the man folder. You can also check that help works for the function:
?testpackage::make_sppcode
Open your NAMESPACE file. It should look like this:
The Namespace should contains all of the functions you’ve built for your package as exports, along with all of the external dependencies you are using within your functions as imports. As you add more functions and dependencies, they are added here each time you rebuild your package. You can also store data in the namespace, which can then be accessed by your package functions.
The concept of Namespace is a special beast, and can be a bit hard to wrap your head around. In a nutshell each package has its own environment that contains all the package’s functions, dependencies and objects (e.g., data) that have been defined for that package. This environment is separate from your global environment. When you load a package in your session, the package’s environment is accessible, but only through its functions. For example, dplyr is a dependency of our testpackage, When we load testpackage (e.g., library(testpackage)
), the testpackage’s functions can use dplyr. However, if we need dplyr outside of testpackage functions, we have to load it first.
Now that the documentation is all set, let’s test that the make_sppcode()
function actually works! Try running the code below to see if it works.
library(testpackage)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
"Malaxis unifolia", "Calopogon tuberosus"),
cover = c(10, 40, 10, 50),
stems = c(50, 20, 10, 10))
example_dat2 <- make_sppcode(example_dat, "Latin_Name")
head(example_dat2)
## Latin_Name cover stems sppcode
## 1 Carex limosa 10 50 CARLIM
## 2 Arethusa bulbosa 40 20 AREBUL
## 3 Malaxis unifolia 10 10 MALUNI
## 4 Calopogon tuberosus 50 10 CALTUB
While examples are not required, they are by far the best way to help users understand how to use your functions. They’re also breadcrumbs for future you as a reminder of how it works. Examples work best when you first create a simple fake data set to run with the function. That way, a user can easily reproduce and run the code on their machine. We just created the example we’re going to add in the process of testing the function. The code chunk below shows how to add it. Note that if you want to show an example that takes a long time to run, so don’t want it to run while building or checking the package, you can add \dontrun{ example code here }
.
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#'
#' @examples
#' library(testpackage)
#'
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
#' "Malaxis unifolia", "Calopogon tuberosus"),
#' cover = c(10, 40, 10, 50),
#' stems = c(50, 20, 10, 10)))
#'
#' example_dat2 <- make_sppcode(example_dat, "Latin_Name")
#' head(example_dat2)
#'
#' @export
make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(data, -genus, -species)
return(data2)
}
The last thing you need to do before posting your package to GitHub for others to use is to run the R CMD check. You can do this 3 ways. You can either click on the Check ( ) in the Build tab, press Ctrl + Shift + E, or run devtools::check().
The R CMD Check runs through your package to ensure all required and suggested files exists, identifies potential typos/errors in the ROxygen2 code for each function, lets you know about imports you may have forgotten to add to your NAMESPACE, etc. It’s best practice to run this check before sharing/posting it for others.Thoughtful and thorough error handling make your package user friendly. Coding best practices are to have as many checks at the beginning of your function as possible to catch common (or even uncommon) issues that are likely to happen, and to have a clear error or warning message for each check. This is often referred to as “Fail early”. That way, the user won’t be waiting for code to run, only for it to fail a few minutes later with a vague or misleading error message. If you don’t have error handling in your function, the error is often the first external function that failed to run and that has built-in error handling, rather than an error message at the line of code where the actual function failed.
We’re going to add one more function to our testpackage, so we can talk about suggests (i.e. suggested packages that aren’t automatically installed when your package is installed) and error handling. The function will be called theme_IMD and will specify a custom theme for ggplot2, which is one of our suggests. Open a new script, name it theme_IMD, and copy the code chunk into it.
#' @title theme_IMD: custom ggplot2 theme for plotting
#'
#'
#' @description This is a custom ggplot2 theme that removes the default panel grids
#' from ggplot2 figures, and makes the axes and tick marks grey instead of black.
#'
#' @return This function must be used in conjunction with a ggplot object, and will return a ggplot object with the custom theme.
#'
#' @examples
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
#' "Malaxis unifolia", "Calopogon tuberosus"),
#' cover = c(10, 40, 10, 50),
#' stems = c(50, 20, 10, 10))
#' library(ggplot2)
#' p <- ggplot(data = example_dat, aes(x = cover, y = stems)) +
#' geom_point() +
#' theme_IMD()
#' p
#'
#' @export
theme_IMD <- function(){
# Check that suggested package required for this function is installed
if(!requireNamespace("ggplot2", quietly = TRUE)){
stop("Package 'ggplot2' needed for this function to work. Please install it.", call. = FALSE)
}
ggplot2::theme(panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
panel.background = ggplot2::element_rect(color = '#696969', fill = 'white', size = 0.4),
plot.background = ggplot2::element_blank(),
strip.background = ggplot2::element_rect(color = '#696969', fill = 'grey90', size = 0.4),
legend.key = ggplot2::element_blank(),
axis.line.x = ggplot2::element_line(color = "#696969", size = 0.4),
axis.line.y = ggplot2::element_line(color = "#696969", size = 0.4),
axis.ticks = ggplot2::element_line(color = "#696969", size = 0.4)
)}
Notice the line that checks whether ggplot2 is installed on the user’s machine. If ggplot2 isn’t installed on the user’s machine, the function will fail immediately, and will print “Package ‘ggplot2’ needed for this function to work. Please install it.” in the console. This check only happens for functions that have this code in it. Note that for suggests, you also have to use the package:: approach to specify its functions. This is always a messy business for ggplot2…
There are tons of possible checks you can do. I often peek under the hood of packages that are well-designed to see what types of checks the pros actually use. You can view the code under the hood of a function by pressing the F2 key and clicking on the function.
Some examples of checks I commonly use are match.arg()
, which makes sure arguments match between what the function allows and what the user specified. The stopifnot(class(argument) == 'classtype')
is helpful to ensure numbers or logical arguments are specified properly. Other checks I often include are making sure the data sets that the function uses exist in the global environment. The code below is an excerpt from a function in our forestNETN package that compiles tree data. The first part of the code is checking the arguments specified by the user. The tryCatch()
is looking for COMN_TreesByEvent object. If it exists, it will name it tree_vw. If it doesn’t it will exit the function and print the error quoted in the stop()
into the console.
joinTreeData <- function(park = 'all', from = 2006, to = 2021, QAQC = FALSE, locType = c('VS', 'all'), panels = 1:4,
status = c('all', 'active', 'live', 'dead'),
speciesType = c('all', 'native','exotic', 'invasive'),
canopyPosition = c("all", "canopy"), dist_m = NA,
eventType = c('complete', 'all'), output = 'short', ...){
# Match args and classes
status <- match.arg(status)
park <- match.arg(park, several.ok = TRUE,
c("all", "ACAD", "MABI", "MIMA", "MORR", "ROVA", "SAGA", "SARA", "WEFA"))
stopifnot(class(from) == "numeric", from >= 2006)
stopifnot(class(to) == "numeric", to >= 2006)
locType <- match.arg(locType)
stopifnot(class(QAQC) == 'logical')
stopifnot(panels %in% c(1, 2, 3, 4))
output <- match.arg(output, c("short", "verbose"))
canopyPosition <- match.arg(canopyPosition)
speciesType <- match.arg(speciesType)
# Check for tree data in global environment
tryCatch(tree_vw <- COMN_TreesByEvent,
error = function(e){stop("COMN_TreesByEvent view not found. Please import view.")}
)
}
Error handling could take an entire day or more to cover fully, but that’s about all we have time for today. For more detail, Chapter 8 Conditions in the Advanced R book covers this topic quite thoroughly. Another useful resource is Chapter 2.5 Error Handling and Generation in the Mastering Software Development in R.
Debugging is another big topic that we only have time to scratch the surface. For further reading, the best resource I’ve found on debugging is Chapter 22 Debugging in the Advanced R book.
The simplest form of debugging is to load the dependencies and define objects in your global environment that will feed into the function, and then to run the code in the function under the hood. A simple example with the make_sppcode function is in the code chunk below. Note that I commented out the lines that start and end the function.
# dependencies
library(stringr)
library(dplyr)
#function args
data <- example_dat
sppname <- "Latin_Name"
#make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(dat, -genus, -species)
## Error in select(dat, -genus, -species): object 'dat' not found
# return(data2)
#}
In the example above, we found that the line with select(dat, -genus, -species)
had a typo: dat should have been data.
traceback()
There are several other built-in R functions that can help with debugging. The two I use the most often are traceback()
and debug()
. To show how traceback()
works, let’s create a function that we know has an error. Copy this code to your R session and run it. You should see make_sppcode_error show up in your global environment after you run it.
make_sppcode_error <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(dat, -genus, -species)
return(data2)
}
Now try to use the function:
make_sppcode_error(example_dat, sppname = "Latin_Name")
## Error in select(dat, -genus, -species): object 'dat' not found
It should fail, and the error message tells you that object ‘dat’ not found. You could then go look under the hood in your function to try to find where dat lived. Or, you can use traceback()
, which shows you the code and line number that failed. If you have functions from your package that this function uses, and the other function is what failed, traceback()
will tell you that too. Run the code below to see for yourself.
traceback()
debug()
The debug()
function allows you to look under the hood of a function and steps through the function code one line at a time. You can see the outputs of each line, and even interact with/change them to test how the function behaves. Once you get the hang of it, you’ll never go back to the low-tech debugging approach I described first.
The code chunk below shows how to start using the debug()
function to step through the make_sppcode_error()
function. It’s hard to show with R Markdown, but we’ll demo how to walk through the browser in debug()
in a minute. Once you run the code below, your console will show a message that starts with “debugging in: make_sppcode_error(data,”Latin_Name”). You’ll also see a Browse[2]>
below, where you can enter one of several options:
debug(make_sppcode_error)
make_sppcode_error(data, "Latin_Name")
In our case, we’ll enter n
and step our way through the function, printing the head(data)
to make sure it looks the way we expect. Eventually we’ll find that the function fails on the select(dat, ...)
line. Then we’ll exit out by pressing Q
or c
.
sf
and tmap
packages.
[1] https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
[2] http://rstudio-pubs-static.s3.amazonaws.com/5526_83e42f97a07141e88b75f642dbae8b1b.html
[3] https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply]
[5] http://adv-r.had.co.nz/Functional-programming.html
[6] https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf Purrr Cheat Sheet
[7] https://r4ds.had.co.nz/iteration.html Iteration Chapter in R for Data Science
[8] https://rstudio.github.io/profvis/ Provfis Introduction
[9] https://rstudio.github.io/profvis/faq.html Provfis FAQ
officeverse
covers the main packages that add more functionality for outputting .Rmd to Microsoft products, including Word and Powerpoint.
This tab prints all of the code chunks in this file in one long file.
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
rm(list = ls())
rm(list = ls())
rm(list = ls())
rm(list = ls())
#--------------------
# Prep
#--------------------
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
"GADMTools", "sf", "tmap", "rnaturalearth", "rnaturalearthdata", # GIS in R
"readxl", "rvest", "htmltab", "stringr", "jsonlite", "httr", "geojsonsf", # Web services
"dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
"rFIA" # for USFS FIA data
)
install.packages(setdiff(packages, rownames(installed.packages())))
install.packages('terra', repos='https://rspatial.r-universe.dev')
install.packages('tmap')
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
"GADMTools", "sf", "tmap", "rnaturalearth", "rnaturalearthdata", # GIS in R
"readxl", "rvest", "htmltab", "stringr", "jsonlite", "httr", "geojsonsf", # Web services
"dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
"rFIA")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
packages <- c("tidyverse", "parallel", "microbenchmark",
"profvis", "modelr", "lubridate", "tidyselect")
install.packages(setdiff(packages, rownames(installed.packages())))
packages <- c("tidyverse", "parallel", "microbenchmark", "profvis",
"modelr", "lubridate", "tidyselect")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT",
"tidyverse", "flexdashboard", "crosstalk", "leaflet",
"DT", "echarts4r", "reactable", "plotly", "sparkline",
"dygraphs")
install.packages(setdiff(packages, rownames(installed.packages())))
packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT",
"tidyverse", "flexdashboard", "crosstalk", "leaflet",
"DT", "echarts4r", "reactable", "plotly", "sparkline",
"dygraphs")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
devtools::install_github("haozhu233/kableExtra")
packages <- c("devtools", "usethis", "roxygen2","stringr", "dplyr")
install.packages(setdiff(packages, rownames(installed.packages())))
packages <- c("devtools", "usethis", "roxygen2", "stringr", "dplyr")
installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
lapply(packages, library, character.only = TRUE)
library(DBI)
library(odbc)
library(readr)
library(magrittr)
library(dplyr)
library(dbplyr)
#----------Connect to Access------------#
db_path <- "data/Trees.accdb"
conn_string <- paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=", db_path)
conn <- dbConnect(odbc(), .connection_string = conn_string)
#---------------Read the data---------------#
data <- tbl(conn, "tbl_Trees") %>%
collect()
# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
mutate_if(is.character, dplyr::na_if, "")
# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
#----------Option 1: Hard code db connection info (not great)------------#
sql_driver <- "SQL Server Native Client 11.0"
my_server <- "SERVER\\NAME"
my_database <- "Database_Name"
conn <- dbConnect(Driver = sql_driver, Server = my_server, Database = my_database, Trusted_Connection = "Yes", drv = odbc::odbc())
#----------Option 2: Read db connection info from shared drive------------#
# This is just a csv with columns for Driver, Server, and Database
params <- read_csv("path/to/csv/on/shared/drive/database-conn.csv")
params$Trusted_Connection <- "Yes"
params$drv <- odbc()
conn <- do.call(dbConnect, params)
#----------Option 3: Create a user DSN (easy to use in R, but has to be configured for each user on each computer)------------#
# Create DSN: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-how-to/configuring-the-sql-server-odbc-driver-add-a-data-source?view=sql-server-ver15
dsn_name <- "myDSN"
conn <- dbConnect(odbc(), dsn_name)
#---------------Read the data---------------#
data <- tbl(conn, in_schema("analysis", "Chemistry")) %>% # The first argument to in_schema is the schema name (SQL default is dbo) and the second is the table name.
collect()
# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
mutate_if(is.character, na_if, "")
# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
library(GADMTools) # for downloading high res administrative boundaries
library(rnaturalearth)# for natural earth download functions
library(rnaturalearthdata) # for cultural and physical boundaries of diff. res.
library(sf) # for working with simple features
library(tmap) # for mapping spatial data
install.packages('terra', repos='https://rspatial.r-universe.dev')
library(GADMTools)
# Level 0: National
gadm_sf_loadCountries("USA", level = 0, basefile = "./data/")
us_bound <- readRDS("./data/USA_adm0.sf.rds")
us_bound_utm <- st_transform(st_as_sf(us_bound, crs = 4326), 5070)
st_write(us_bound_utm, "./data/US_Boundary.shp", append = F)
# Level 1: State/Province
gadm_sf_loadCountries("USA", level = 1, basefile = "./data/")
us_states <- readRDS("./data/USA_adm1.sf.rds")
us_state_utm <- st_transform(st_as_sf(us_states, crs = 4326), 5070)
st_write(us_state_utm, "./data/US_States.shp", append = F)
# Level 2: County/district
gadm_sf_loadCountries("USA", level = 2, basefile = "./data/")
us_cnty <- readRDS("./data/USA_adm2.sf.rds")
us_cnty_utm <- st_transform(st_as_sf(us_cnty, crs = 4326), 5070)
st_write(us_cnty_utm, "./data/US_Counties.shp", append = F)
library(GADMTools)
library(sf)
# Level 0: National
us_bound_utm <- st_read("./data/US_Boundary.shp")
# Level 1: State/Province
us_state_utm <- st_read("./data/US_States.shp")
# Level 2: County/district
us_cnty_utm <- st_read("./data/US_Counties.shp")
plot(us_state_utm[1])
library(rnaturalearth)
library(rnaturalearthdata)
lakes <- ne_download(scale = 10, # large scale
category = "physical",
type = 'lakes_north_america', # a named file on their website
destdir = paste0(getwd(), "/data"), # save to working dir.
returnclass = 'sf' # return as sf instead of sp
)
lakes_utm <- st_transform(lakes, crs = 5070)
lakes_utm <- st_read("./data/lakes.shp")
# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
# Map
mn_cnty <- st_as_sf(us_cnty_utm[us_cnty_utm$NAME_1 == "Minnesota",])
mn_map <- tm_shape(mn_cnty, projection = 5070) +
tm_borders(col = "black")
mn_map_lakes <- mn_map +
tm_shape(lakes_utm) +
tm_fill(col = 'lightblue')
mn_map_lakes
tmap_options(basemaps = c(Map = NPSbasic,
Imagery = NPSimagery,
Light = NPSlight,
Slate = NPSslate))
tmap_mode('view') # set to interactive mode
mn_map
tmap_mode('plot') # return to static mode
# Packages used in this section
pkgs <- c("tidyverse",
"readxl",
"rvest",
"htmltab",
"stringr")
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
### NOTES:
# 1. We can feed the URL directly into the `readr::read_csv()` function instead of first assigning it to a variable name.
# 2. For other types of flat files (e.g., fixed width or tab-delimited), simply replace `read_csv` with the relevant function
# Specify URL where file is stored
file_loc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv"
# Read in the file from the URL as we would read in a file stored on our computer
dat <- readr::read_csv(file_loc)
head(dat)
### NOTES:
# 1. Setting the `download.file()` `method` argument to `libcurl` allows access to ftps:// URLs and also allows simultaneous downloads, i.e., the `url` and `destfile` arguments can have character vectors with multiple items
# 2. Check out `?download.file` for other useful arguments, e.g., setting `timeout` length
# Specify URL where file is stored (alternatively, We can feed the URL and destination paths directly into the `download.file()` function)
file_loc <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv"
# If we want to save the file in a subfolder of the current working directory, make sure the subfolder exists
if(!dir.exists("./downloads")) dir.create("./downloads")
# Specify the destination path where file should be saved
dest <- paste0("./downloads/park_visits_", gsub("-", "", x = Sys.Date()), ".csv") # saves file with attached date in yyyymmdd format (optional)
download.file(url = file_loc, destfile = dest) # save the file to the specified destination
### NOTES:
# 1. On Windows, if the `mode` argument is not supplied and the URL ends in one of .gz, .bz2, .xz, .tgz, .zip, .jar, .rda, .rds or .RData, then the default mode is "wb" (for reading compressed files) -- but it never hurts to set the argument explicitly.
# 2. The `unzip()` function extracts all files unless one or more files is specified with the `files` argument
# 3. To unzip a single file, we can use unz() instead of unzip() -- but check out `?unz` for the correct argument names
# Specify URL where file is stored
file_loc <- "https://www.zenodo.org/record/3966534/files/labordynamicsinstitute/covid19-expectations-data-v20200622-clean.zip?download=1"
# Specify the destination where file should be saved. In this example we save the compressed file to a temporary directory that will be cleared after we exit the current `R` session. Alternatively, we could have provided an actual file path, e.g., `tmp <- ("./my_downloads/butterflies.zip")
tmp <- tempfile()
# We can see the file path of the temporary directory by typing `tempdir()` in the consoloe
# Download the file
download.file(url = file_loc, destfile = tmp, mode = "wb") # set mode to "wb" if the file is compressed (but okay if omitted because this is automatically set as the default mode when the file has a recognized compressed file extension -- see Note # 1 above)
# Now we can look at file names and sizes (in the temporary directory, AFTER downloading) without extracting the files
unzip(zipfile = tmp, list = TRUE) %>% head(30)
# Hmmm...there are a lot of files, so let's try again but specifically searching for a .csv or .xlsx to read in
unzip(zipfile = tmp, list = TRUE) %>%
dplyr::filter(., grepl(".csv|.xlsx", Name)) %>%
dplyr::pull(Name)
# Looks like we have 65 files that are .csv or .xlsx
# Read in the (unzipped) actual file of interest and call it 'dat'. I'm picking an .xls file for this example just for practice. But in general, it's faster and "cleaner" to read data in .csv or .tsv format if those alternatives are available
dat <- readxl::read_excel(unzip(zipfile = tmp, files = "labordynamicsinstitute-covid19-expectations-data-fd6cf45/auxiliary/aggregage_agegrp_ca.xlsx"))
head(dat) # examine the file
# We can extract the files and save them to a specified location. Remember that we had saved the zipped file to a temporary directory, so that file will be automatically deleted when we close the `R` session. But these extracted files will remain where we specified in the code line below.
# If the subfolder `downloads` doesn't already exist, it will be created automatically by the `unzip()` function.
unzip(zipfile = tmp, exdir = "./downloads/example_extracted_files_folder")
# NOTES:
# 1. With `rvest::html_table()` we can also specify the node by number(s), e.g., `html_table(tabs[c(2, 6)])` or `html_table(tabs[[2]])`
# 2. If we don't specify a node by number, `rvest::html_table()` will convert all the tables into tibbles and combine them into a list of tibbles.
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
# First, read in an HTML page and then look in that object for HTML elements that are defined as `<table>`
html_pg <- rvest::read_html(page_url)
tabs <- rvest::html_elements(html_pg, "table")
# `tabs` may show more tables than you were expecting to see. Remember that HTML elements classified as `<table>` may refer to any content laid out in tabular format--not necessarily numeric data tables
# Use `str_detect()` to find which table has HTML text matching the title (caption) of the table we want. This code returns a logical vector for each element in `tabs`, indicating if it has string matching the title we specified
match_vec <- tabs %>%
rvest::html_text() %>%
stringr::str_detect("COVID-19 vaccine distribution by country")
# Now use `match_vec` with `html_table` to read in the table of interest
this_table <- rvest::html_table(tabs[[which(match_vec)]])
head(this_table)
# Notice that`this_table` includes a column with no heading and with only NA values. If we look at the table on the webpage, we see that this first column actually had no heading and had flag icons for values. This is why we see that column in our tibble/data.frame. In addition, the column headings include the superscripts `[a]` and `[b]`, which we don't actually want in our column headings
# A relatively new package, `htmltab`, has a function that works quite well when it comes to giving us a "clean" table. We will provide this function with the URL and with the table rank we had already identified with string matching.
this_table2 <- htmltab::htmltab(doc = page_url, which = which(match_vec))
head(this_table2)
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
# The XPath below was determined by following the steps above re: using your web brower's developer tools
xp_full <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[6]/div/div/table"
this_table <- htmltab::htmltab(doc = page_url, which = xp_full)
head(this_table) # Note that we seem to have an extra column. If we look at the HTML code for the table, we see that the `Doses(millions)` tab actually spans two columns. First column contains the integer, second column contains the decimal point and one value past the decimal point. We will want to paste the values in those two columns together.
# URL for webpage
page_url <- "https://en.wikipedia.org/wiki/Deployment_of_COVID-19_vaccines"
html_pg <- rvest::read_html(page_url)
tabs <- rvest::html_elements(html_pg, "table")
# Use `str_detect()` to find which table has HTML text matching the title (caption) of the table we want. This code returns a logical vector for each element in `tabs`, indicating if it has string matching the title we specified
match_vec <- tabs %>%
rvest::html_text() %>%
stringr::str_detect("COVID-19 vaccine production by country")
# Let's first try using `rvest::html_table()`
this_table <- rvest::html_table(tabs[[which(match_vec)]])
head(this_table)
# Youch. The default output has part of the headers as the first row of data
# Now try with `htmltab::htmltab()`
this_table2 <- htmltab::htmltab(doc = page_url, which = which(match_vec))
head(this_table2)
# WA-LAH! As it turns out, the default output for `htmltab()` has a rather nice solution for the issue of nested headers
# Packages used in this section
pkgs <- c("tidyverse",
"jsonlite",
"httr",
"geojsonsf")
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
file_loc <- "https://irmaservices.nps.gov/arcgis/rest/services/Inventory_Vegetation/Vegetation_Map_Service_for_Congaree_National_Park/MapServer/0/query?where=1%3D1&text=&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&having=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&historicMoment=&returnDistinctValues=false&resultOffset=&resultRecordCount=&queryByDistance=&returnExtentOnly=false&datumTransformation=¶meterValues=&rangeValues=&quantizationParameters=&f=pjson&as_sfid=AAAAAAWwtYnFdzPLPheHpat-uOHUIjWYRGZh1ZaXYD2RdBDHkpArwB4pw8I0LXyeNO_mXEswL3qFD4pqVVs5R0aNDyWfurVajSmWIy6LmPCCtyKAaIa7Mw-jSTHAfioayEJ-k8k%3D&as_fid=ad543cb412004eb497caa336d36757253f442b06"
# NOTE:
# 1. The `jsonlite::fromJSON` function converts objects from JSON format to another `R` object such as a data frame.
# 2. The `jsonlite::fromJSON` function can take a JSON string, URL or file as its first argument. However, we will download the data to a temporary file, THEN apply the `jsonlite::fromJSON` function.
# 3. Since the data represent geographic points and we want to use the data for mapping, we have one more step to do. The function`sf::st_as_sf()` converts the specified data frame to a simple feature (similar to a shapefile) for mapping. Because we downloaded data in JSON format instead of GeoJSON, we have to explicitly specify the columns that hold the x and y coordinate data, and we have to specify the CRS.
tmp <- tempfile()
download.file(file_loc, tmp) # the data are still in JSON format
veg_pts_df <- jsonlite::fromJSON(tmp)
# View(df)
# `df` is a list of data frames.
veg_pts_sf = sf::st_as_sf(veg_pts_df$features$geometry, coords = c("x", "y"), crs = 26917)
plot(veg_pts_sf)
base_url <- httr::parse_url("https://irmaservices.nps.gov/arcgis/rest/services/Inventory_Vegetation/Vegetation_Map_Service_for_Congaree_National_Park/MapServer/0")
base_url$path <- paste(base_url$path, "query", sep = "/")
base_url$query <- list(where = "1=1", # code in the query parameters
outFields = "*",
f = "json")
request <- httr::build_url(base_url) # compare this URL to the one we used in Example 1
# Everything from here down is the same as in Example 1, but full URL is in `request`
tmp <- tempfile()
download.file(request, tmp)
veg_pts_df <- jsonlite::fromJSON(tmp)
veg_pts_sf = sf::st_as_sf(veg_pts_df$features$attributes, coords = c("x" = "X_Coord", "y" = "Y_Coord"), crs = 26917)
# View(veg_pts_sf)
plot(veg_pts_sf["Pnts_Type"], main = "Congaree National Park Vegetation Points")
base_url <- httr::parse_url("https://irmaservices.nps.gov/arcgis/rest/services/Inventory_Vegetation/Vegetation_Map_Service_for_Congaree_National_Park/MapServer/2") # note that the vegetation layer is MapServer/2
base_url$path <- paste(base_url$path, "query", sep = "/")
base_url$query <- list(where = "1=1", # code in the query parameters
outFields = "*",
f = "geojson")
request <- httr::build_url(base_url) # compare this URL to the one we used in Example 1
tmp <- tempfile()
download.file(request, tmp)
# Because the data are in GeoJSON format we will use `geojsonsf::geojson_sf()` to convert the object to an `sf` object
geo_sf <- geojsonsf::geojson_sf(tmp)
# View(geo_sf)
plot(geo_sf["MapUnit_Name"], main = "Congaree National Park Vegetation Types") # quick plot using base R - clearly issues with an unreadable legend, so needs tweaking. Also remember this `sf` has a different CRS than does `veg_pts_sf` so one would have to be transformed before both layers could be mapped together
library(purrr)
# Create function to iterate download
downFIA <- function(state, table){
csv_name <- paste(state, table, sep = "_")
csv_link <- paste0('http://apps.fs.usda.gov/fia/datamart/CSV/', csv_name, '.csv')
assign(csv_name, read.csv(csv_link))
write.csv(csv_name, paste0("./data/", csv_name, '.csv'))
}
# Create list of states of interest
states = c("CT", "RI", "DE")
# Use purrr::map to download each state's PLOT table.
map(states, ~downFIA(., "PLOT"))
library(rFIA)
# Download CT and RI state data tables to data folder
getFIA(states = c('DE', 'RI'), dir = "./data")
# Download reference tables (ie lookup tables)
getFIA(states = 'ref', tables = c('SPECIES', 'PLANT_DICTIONARY'))
library(rFIA)
# Import all states in R that's in the data folder
fia_all <- readFIA("./data")
names(fia_all) # Print table names
table(fia_all$PLOT$STATECD) # View number of plots by state code. Should be 2.
table(fia_all$PLOT$INVYR) # Range of inventory years in the data
# Import RI data only
fia_ri <- readFIA("./data", states = 'RI')
names(fia_ri)
head(fia_ri$PLOT)
table(fia_ri$PLOT$STATECD) # View number of plots by state code. Now only 1.
table(fia_ri$PLOT$INVYR)
# Clip all data to most recent sample
all_recent <- clipFIA(fia_all, mostRecent = TRUE)
# Print list of counties in RI
countiesRI
# Clip RI to Washington County
ri_Wash <- clipFIA(fia_ri, mask = countiesRI[5,], mostRecent = F)
fia_ri_rec <- clipFIA(fia_ri, mostRecent = TRUE)
# State level tree population estimates
tpa_ri <- tpa(fia_ri)
# Plot level tree population estimates for most recent survey
tpa_ri_plot <- tpa(fia_ri_rec, byPlot = TRUE)
# Species level and size class level tree population estimates
tpa_ri_sppsz <- tpa(fia_ri_rec, bySpecies = TRUE, bySizeClass = TRUE)
# by Ownership
tpa_ri_own <- tpa(fia_ri_rec, grpBy = OWNGRPCD)
head(tpa_ri)
plotFIA(tpa_ri, y = TPA, plot.title = "Trees per acre")
#the following toolbox can be pulled directly from my github:
#source("Aquarius basics.R") or....
source("https://raw.githubusercontent.com/AndrewBirchHydro/albAquariusTools/main/Aquarius%20basics.R")
timeseries$connect("https://aquarius.nps.gov/aquarius", "aqreadonly", "aqreadonly")
publishapiurl='https://aquarius.nps.gov/aquarius/Publish/v2'
#we also will need to load a few packages. If you do not have these, you will need to install them in the console.
library(dataRetrieval)
library(lubridate)
library(ggplot2)
library(jsonlite)
library(httr)
#assign the network name of interest
Network <- "Southern Colorado Plateau Network"
#this function will print the available locations in that network
Print_Locations(Network = Network)
#if you want a dataframe of the locations, you can assign it to 'Locations_in_network' here:
Locations_in_Network <- as.data.frame(Print_Locations(Network = Network))
#enter the location of interest to get a list of datasets:
Print_datasets(Location = "BANDRIT01")
#you can also print available metadata for a given location based on it's identifier:
Print_metadata(Location = "BANDRIT01")
#this line will pull the dataset from Aquarius
raw_record <- Get_timeseries2(record = "Water Temp.AquaticLogger@BANDRIT01")
#this function will "clean it up"
temp_data <- TS_simplify(data = raw_record)
# some summary information
head(temp_data)
summary(temp_data)
# a quick plot
ggplot(temp_data, aes(x = date_time, y = value)) +
geom_path() +
theme_bw() +
xlab("Time") +
ylab("Water Temperature (C)")
#00060- USGS code for discharge
siteNo <- "08313350"
parameter <- "00060"
#for start and end dates, you can enter basically all of known time to get whatever data is available, or narrow it down to a particular window. In this instance, lets go from 1900 to 2025 to ensure we pull everything available
start.date <- "1900-01-01"
end.date <- "2025-01-01"
NWIS_query <- readNWISuv(siteNumbers = siteNo,
parameterCd = parameter,
startDate = start.date,
endDate = end.date)
colnames(NWIS_query) <- c("agency", "identifier", "date_time", "Q_cfs", "time_zone")
head(NWIS_query)
summary(NWIS_query)
# we'll plot both the water temperature and discharge records for a quick visual comparison
ggplot(data = NWIS_query, aes(x = date_time, y = Q_cfs)) +
geom_path() +
theme_bw() +
scale_y_log10() +
xlab("Time") +
ylab("Discharge (cfs)")
ggplot(temp_data, aes(x = date_time, y = value)) +
geom_path() +
theme_bw() +
xlab("Time") +
ylab("Water Temperature (C)")
# we will enter FALSE for all.x and all.y, as we only want the period where the records overlap one another:
combined_record <- merge(x = NWIS_query, y = temp_data,
by = "date_time", all.x = F, all.y = F)
head(combined_record)
summary(combined_record)
# lets take a basic look at the time series together- not that since they are on different scales, a secondary axis is required
ggplot(combined_record, aes(x = date_time, y = Q_cfs)) +
geom_path(color = "blue") +
geom_path(color = "red", aes(x = date_time, y = value * 500)) +
theme_bw() +
scale_y_continuous(name = "Discharge (cfs)", sec.axis = sec_axis(~ . / 500, name = "Water Temperature (C)"))
# lets plot temperature and discharge against one another to see their relationship:
ggplot(combined_record, aes(x = Q_cfs, y = value)) +
geom_point() +
theme_bw() +
xlab("Discharge (cfs)") +
ylab("Water Temperature (C)") +
scale_x_log10()
# This will require the lubridate package
library(lubridate)
# this will create a new column with the numeric month derived from the date_time column as a factor
combined_record$month <- as.factor(month(combined_record$date_time))
# this will create a new column with the year derived from the date_time column as a factor
combined_record$year <- as.factor(year(combined_record$date_time))
ggplot(combined_record, aes(x = Q_cfs, y = value, color = month)) +
geom_point() +
theme_bw() +
xlab("Discharge (cfs)") +
ylab("Water Temperature (C)") +
scale_x_log10() +
ggtitle("2013") +
facet_grid(year ~ .)
rm(list = ls())
library(dplyr);library(ggplot2)
nIter=10 #number of iterations
for (i in 1:nIter) {
print(data.frame(x=i + 1, y=i))
}
#growing a dataframe with rbind
nIter=10000
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-rbind(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.47 seconds on my system
#growing a dataframe with append()
OutDat <- NULL # define an empty variable.
system.time(for (i in 1:nIter) {
d <- data.frame(x=i + 1,y=i)
OutDat<-append(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.93 seconds on my system
nIter=10000 # number of iterations
OutDat <- data.frame(x=rep(NA,nIter),y=rep(NA,nIter)) # define an preallocated dataframe
system.time(for (i in 1:nIter) {
OutDat[i,] <- data.frame(x=i + 1, y=i)
}) #2.5 seconds on my system
rm(OutDat)
#create a few variables that will be used for file path manipulation later on
inPath <- "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/" #location of the data
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
) #names of the files
#preallocate the output object
HoboData<-vector(mode = "list", length = length(fNames))%>%
setNames(.,fNames)
#read in the data
for(i in fNames){
HoboData[[i]]<-read.csv(file.path(inPath,i), skip = 1, header = T)[1:3]%>%
setNames(.,c("idx", "DateTime", "T_F"))
}
#str(HoboData) #uncomment to inspect
format(object.size(HoboData), units="Mb") #how many Mb of memory is this object now occupying
names(HoboData)
head(HoboData[[1]])
OutPath <- paste0(getwd(), "/hobo_outputs/") # or put your preferred path here (we will use this later)
# set up some arbitrary QA/QC thresholds
TempUpper <- 40
TempLower <- (-40)
#not outputting anything so we aren't going to preallocate.
for (i in fNames) {
# 1 Read in the data
OutPut <- HoboData[[i]] %>% #read in the data and define a output
# 2 Generate your output in this case, some QA/QC flagging
dplyr::mutate(
# Create a variable that collects temperature change flags. This can be thought of as the 'first derivative' of the data.
TChangeFlag = ifelse(
c(abs(diff(T_F, lag = 1)) > 10, FALSE), #logical vector
yes = "D10", #TRUE case
no = NA), #FALSE case
# Create a variable that captures flags for high and low temp data.
Flag = dplyr::case_when(
is.na(T_F) ~ "MISSING",
T_F > TempUpper ~ "High",
T_F < TempLower ~ "Low"
)
) %>%
# unite is like a 'paste mutate' that combines columns and replaces them with 1 new column
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
# 3 output the data
dir.create(OutPath, showWarnings = F)
write.csv(OutPut,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("OutPut","i"))
}
#run these in the console without the () to see what lies underneath
lm
mean2<- #Tell [R] that I want this new function to be named "mean2"
function(x,na.rm=T){ #the function consists of 1 parameter named x (aka the data) The { begins the function source code / expressions.
mean(x,na.rm=na.rm) #in the mean function change the default for na.rm=T
}#close function
mean2(c(1:9,NA))
#OutPath <- paste0(getwd(), "/hobo_outputs/")
#TempUpper <- 40
#TempLower <- (-40)
HoboQAQC <- function(data, #The data of course
fNames = names(data), #extract the file names that are supplied to data as the default filenames
OutPath = paste0(getwd(), "/hobo_outputs/"), #set up an output path character string.
TempUpper = 40, #Make the temperature flag thresholds into one of the function's arguments.
TempLower = -40) {
for (i in fNames) {
# 1 Read in the data
OutPut <- data[[i]] %>% # Note, this is now called 'data'
# 2 Generate your output in this case, some QA/QC flagging
dplyr::mutate(
TChangeFlag = ifelse(
c(abs(diff(T_F, lag = 1)) > 10, FALSE),
yes = "D10",
no = NA
),
Flag = dplyr::case_when(
is.na(T_F) ~ "MISSING",
T_F > TempUpper ~ "High",
T_F < TempLower ~ "Low"
)
) %>%
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
# 3 output the data
dir.create(OutPath, showWarnings = F)
write.csv(OutPut,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("OutPut", "i"))
}
}
#HoboQAQC(data=HoboData) #uncomment to call function
#OutPath <- paste0(getwd(), "/hobo_outputs/")
#for (i in fNames) {
HoboFlag <- function(data, #argument that accepts the name of the input data, assumed to be a dataframe.
TempName="T_F", #Gives us some flexibility if we want to supply a different column name
TChange=10, #Allows to provide a value for the temperature change flagging threshold
TempUpper=40, #provide a value for the upper temperature limit flag
TempLower=-40) { #provide a value for the lower temperature limit flag
#OutPut <- HoboData[[i]] %>%
OutPut<-dplyr::mutate(data,
TChangeFlag = ifelse(
c(abs(diff(get(TempName), lag = 1)) > TChange, FALSE),
yes = paste0("D", TChange),
no = NA
),
Flag = dplyr::case_when(
is.na(get(TempName)) ~ "MISSING",
get(TempName) > TempUpper ~ "High",
get(TempName) < TempLower ~ "Low"
)
) %>%
tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "; ", remove = TRUE, na.rm = TRUE)
return(OutPut) # because this is a function, we have to tell it what it is returning. In this case it is the output variable we defined above.
rm(list=c("OutPut","i")) #optional cleanup
}
#Writes the data out and creates directories.
HoboWrite.csv <- function(i, #an index variable, needs to be the filename.
data, #this is the input data, it could be a list or a dataframe, but ultimately we will need to output a dataframe from this function.
OutPath=paste0(getwd(), "/hobo_outputs/"), #same as before
...) { # ellipsis will let us pass in some optional arguments on the columns we want down below.
dir.create(OutPath, showWarnings = F, recursive = T) # recursive=T will create all sub-directories as well
# selects the data that may have been passed from an lapply pipe or a for loop.
if (is.data.frame(data)) {
df <- data[...] # data passed by a for loop
} else {
df <- data[[i]][...] #data passed by an lapply pipe
}
# write out the data
write.csv(df,
paste0(OutPath, gsub(".csv", "_QC.csv", i)),
row.names = F, quote = F
)
rm(list=c("df","i"))
}
#OutPath <- paste0(getwd(), "/hobo_outputs/") #change this if you want it going elsewhere
for (i in fNames) {
HoboFlag(HoboData[[i]])%>%
HoboWrite.csv(i, data= . , OutPath = OutPath) #I want the data to pass into that second position so I am using '.'
rm(i) #optional housekeeping
}
OutPath <- paste0(getwd(), "/hobo_outputs/")
for (i in fNames) {
HoboFlag(HoboData[[i]], TChange=15)%>%
HoboWrite.csv(i, data= . , OutPath = OutPath)
rm(i) #optional housekeeping
}
apply(mtcars, MARGIN = 2, FUN=function(x) sd(x)/mean(x)) #anonymous function that calculates the coefficient of variation
lapply(mtcars, FUN=function(x) sd(x)/mean2(x))
data_list<-list(A=1:10, B=11:20, C=c(21:29,NA))
lapply(data_list, mean2)
data_list<-list(A=1:10, B=11:20, C=c(21:29,NA))
sapply(data_list, mean2)
OutPath <- paste0(getwd(), "/hobo_outputs/")
lapply(HoboData, FUN=HoboFlag)%>%
lapply(names(.), FUN=HoboWrite.csv, data=., OutPath=OutPath)
d<-lapply(HoboData, FUN=HoboFlag)
#str(d) #uncomment to inspect
OutPath <- paste0(getwd(), "/hobo_outputs/")
lapply(names(d), FUN=HoboWrite.csv, data=d, OutPath=OutPath )
Sites<-names(d)[-3] #remove a problem site with no data
lapply(Sites, FUN=function(x) lm(T_F~idx, data=d[[x]]))%>% #anonymous function
lapply(coef)%>% #extract slope and intercept
bind_rows()%>% #glue all the list items together into 1
dplyr::mutate(Sites=Sites,.before="(Intercept)") #add the new column before the existing column named "(Intercept)"
#You don't need to run this. I have run this for you below.
library(future.apply)
HoboData2 <- c(rep(HoboData, 5)) #optionally make the dataset larger. This will take the below code longer to run.
plan(multisession, workers = availableCores()-1) # initiate a multisession cores/sessions is set to use 1 fewer than the max detected. Reduces chance of overwhelming the system.
microbenchmark::microbenchmark(
"sequential"=lapply(HoboData2, FUN=HoboFlag),
"parallel"=future_lapply(HoboData2, FUN=HoboFlag),
times=5, #number of times to do this
unit="s" #units of time needed to completely evaluate each expression
)
plan("sequential") # Important! close the multicore session by setting it back to sequential.
rm(HoboData2)
library(ggplot2)
#set data up for ggplot
d<-data.frame(ListItems=rep(25 * 2^(0:5),2), # Number of list items
Type=rep(c("Sequential","Parallel"),each=6),
Time_s=c(0.52, 1.08, 2.42, 4.39, 9.08, 18.05,
1.1, 1.26, 2.19, 3.58, 5.9, 11.17)) # mean time to complete time in s
ggplot(data=d, aes(ListItems, Time_s))+
geom_line(aes(colour=Type,linetype=Type))+
theme_classic()
rm(list=ls())
library(tidyverse)
library(lubridate)
library(modelr)
#file names
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
)
# Make path to files
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)
fPaths<- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05"))
fPaths # This thing is now a vector where each element has a name
## import the site data into a list where each element comes from a different file
intensity_data_raw <- fPaths %>% map( ~ read_csv(file = .x, skip = 1))
class(intensity_data_raw) # its a list
length(intensity_data_raw) # 5 elements
class(intensity_data_raw[[1]]) # each element is a data.frame (well, tibble actually)
intensity_data_raw %>% map_dbl(~ nrow(.x))
intensity_data_raw %>% map(~ colnames(.x))
library(lubridate)
intensity_data_raw <- intensity_data_raw %>% discard(~ nrow(.x) == 0)
Fix_Data <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
intensity_data <- intensity_data_raw %>% map(~ Fix_Data(.x))
Summarise_Data <- function(data) {
data %>%
group_by(Date) %>%
summarise(across(where(is.numeric), mean))
}
summary_data <- intensity_data %>% map(~ Summarise_Data(.x))
### Oops - some of the intensity data is 0, fix this for later use in gamma glm().
Zero_Intensity_Fix <- function(data) {
data %>%
filter(across(any_of("Intensity"), ~ .x != 0))
}
summary_data <- summary_data %>% map(~ Zero_Intensity_Fix(.x))
Summary_Graph <- function(data) {
graph_data <- data %>%
pivot_longer(cols = where(is.numeric), names_to = "Measure", values_to = "Value") %>%
filter(Measure != "Temp_F")
Graph <- ggplot(graph_data, aes(x = Date, y = Value, group = Measure, color = Measure)) +
geom_point() +
facet_grid(Measure ~ ., scales = "free_y")
return(Graph)
}
Graphs1 <- summary_data %>% map(~ Summary_Graph(.x))
Intensity_Regression <- function(data, ...) {
glm(Intensity ~ Temp_C, data = data, ...)
}
Gauss_Reg <- summary_data %>%
map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
~ Intensity_Regression(.x, family = gaussian),
.else = ~ NA
)
Gamma_Reg <- summary_data %>%
map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
~ Intensity_Regression(.x, family = Gamma(link="log")),
.else = ~NA
)
library(modelr)
summary_data <- map2(.x = summary_data, .y = Gauss_Reg, .f = ~ {
if ("Intensity" %in% colnames(.x)) {
add_predictions(data = .x, model = .y, var = "Pred")
} else {
.x
}
})
Pred_Graph_Data <- bind_rows(summary_data, .id = "Site") %>% filter(!is.na(Pred))
ggplot(Pred_Graph_Data, aes(x = Date, y = Intensity)) +
geom_point() +
geom_line(aes(x = Date, y = Pred, color = "red")) +
facet_grid(Site ~ ., scales = "free_y")
library(tidyverse)
library(lubridate)
library(microbenchmark)
library(profvis)
library(tidyverse)
library(lubridate)
library(profvis)
library(microbenchmark)
# A vector of numbers
x <- 1:10
# Arithmetic is vectorized
x + 1
# a character vector (letters)
x <- letters[1:10]
# paste is vectorized
paste("Letter:", x)
# a logical vector
x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)
# logical operations are vectorized
TRUE & x
# a date vector
library(lubridate)
x <- mdy("02/07/2022", "02/08/2022", "02/09/2022", "02/10/2022")
x
## adding 1 shows you the next day
x + 1
rm(x)
library(tidyverse)
library(lubridate)
library(profvis)
# Recreate intensity data
fNames <- c(
"APIS01_20548905_2021_temp.csv",
"APIS02_20549198_2021_temp.csv",
"APIS03_20557246_2021_temp.csv",
"APIS04_20597702_2021_temp.csv",
"APIS05_20597703_2021_temp.csv"
)
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)
intensity_data_raw <- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05")) %>%
map(~ read_csv(file = .x, , skip = 1)) %>%
discard(~ nrow(.x) == 0)
# Fix_Data function
Fix_Data <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
profvis::profvis(Fix_Data(intensity_data_raw[[2]]), interval = 0.005)
library(microbenchmark)
# Original function
Fix_Data_Lubridate <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}
# New version using as.POSIXct()
Fix_Data_POSIXct <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(
Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = as.POSIXct(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
Date = date(Date_Time)
)
}
# new version using strptime
Fix_Data_strptime <- function(data) {
data %>%
select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
mutate(
Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = strptime(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
Date = date(Date_Time)
)
}
# Do Comparison
mb<-microbenchmark(
Lubridate = Fix_Data_Lubridate(intensity_data_raw[[2]]),
POSIXct = Fix_Data_POSIXct(intensity_data_raw[[2]]),
Strptime = Fix_Data_strptime(intensity_data_raw[[2]]),
times = 10L, unit = "ms"
)
mb
boxplot(mb)
install.packages("rmarkdown")
install.packages('tinytex')
tinytex::install_tinytex()
#Check that it worked:
writeLines(c(
'\\documentclass{article}',
'\\begin{document}', 'Hello world!', '\\end{document}'
), 'test.tex')
tinytex::pdflatex('test.tex')
# Console should say [1] "test.pdf"
# If you ever run into issues with markdown not knitting PDF,
# sometimes running the code below will fix the problem:
tinytex::reinstall_tinytex()
#------------------
# R Markdown I
#------------------
knitr::include_graphics("./images/YAML.png")
<h1>First-level header</h1>
<h2>Second-level header</h2>
...
<h6>Sixth-level header</h6>
<i>italic</i>
<b>bold</b>
superscript<sup>2</sup>
endash: –
Example sentence: <i>Picea rubens</i> is the dominant species in <b>Acadia National Park</b>.
#### This is how to specify header level 4, which renders as:
numspp <- round(mean(runif(50, 0, 100)), 1)
library(tidyverse)
wetdat <- read.csv(
"https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
wetsum <- wetdat %>% group_by(Site_Name, Year) %>%
summarize(num_inv = sum(Invasive), num_prot = sum(Protected),
.groups = 'drop')
devtools::install_github("haozhu233/kableExtra")
library(kableExtra) # for extra kable features
library(knitr) # for kable
wet_kable <- kable(wetsum, format = 'html') %>% # if using pdf, need LaTeX
kable_styling(full_width = FALSE, bootstrap_options = 'striped') #kableExtra function
wet_kable
wet_kable
# custom kable function that requires data, column names and caption
make_kable <- function(data, colnames = NA, caption = NA){
kab <- kable(data, format = 'html', col.names = colnames, align = 'c', caption = caption) %>%
kable_styling(fixed_thead = TRUE,
bootstrap_options = c('condensed', 'bordered', 'striped'),
full_width = FALSE,
position = 'left',
font_size = 12) %>%
row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>%
row_spec(nrow(data), extra_css = 'border-bottom: 1px solid #000000;')
}
# use function with wetsum data
wetkab2 <- make_kable(wetsum,
colnames = c("Site", "Year", "# Invasive", "# Protected"),
caption = "Table 1. Summary of wetland data") %>%
scroll_box(height = "250px")
wetkab2
wetkab3 <- make_kable(wetsum,
colnames = c("Site", "Year", "# Invasive", "# Protected"),
caption = "Table 1. Summary of wetland data") %>%
column_spec(3, background = ifelse(wetsum$num_inv > 0, "orange", FALSE)) %>%
collapse_rows(1, valign = 'top')
wetkab3
ans_tab <- kable(wetsum, format = 'html',
align = c('l', 'c', 'c', 'c')) %>% # Answer 1
kable_styling(fixed_thead = TRUE,
full_width = FALSE,
position = 'left') # Answer 2
ans_tab
library(DT)
wetdt <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"))
wetdt
wetdt
# modify pageLength and lengthMenu
wetdt2 <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"),
width = "45%",
options = list(pageLength = 10,
lengthMenu = c(5, 10, 20))
)
wetdt2
wetdt3 <- datatable(data.frame(wetsum, "Notes" = NA),
width = "45%",
colnames = c("Site", "Year", "# Invasive", "# Protected", "Notes"),
options = list(pageLength = 10),
class = 'cell-border stripe',
filter = list(position = c('top'), clear = FALSE),
editable = list(target = 'cell', disable = list(columns = 1:4))) %>%
formatStyle(3, backgroundColor = styleInterval(0, c('white', "orange")))
wetdt3
wetdt3
<img src="./images/map_of_parks.jpg" alt = "Map of Region-1 IMD parks" width="400px">
photos <- list.files("./images/photopoints", full.names = TRUE)
include_graphics(photos)
knitr::opts_chunk$set(fig.height=4, fig.width=6, fig.align='left')
library(dplyr)
library(ggplot2)
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")
invplot_all
invplot_ACAD
# load packages
library(ggplot2)
library(dplyr)
library(kableExtra)
# Import data from github
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")
table(invdata$park, invdata$cycle) # Should see 3 parks each with 3 cycles
# Create invplot_fun to plot different params
invplot_fun <- function(data, park){
p <-
ggplot(data %>% filter(park == park),
aes(x = cycle, y = inv_cover))+
geom_jitter(color = "#69A466", width = 0.1) +
geom_boxplot(aes(group = cycle), width = 0.18, lwd = 0.4, alpha = 0)+
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = 'Cycle', y = "Invasive % Cover") +
scale_x_continuous(limits = c(0.9, 3.1), breaks = c(1, 2, 3))
return(p)
}
# Plot invasives by cycle for specified park
invplot <- invplot_fun(invdata, park = "ACAD")
# Filter invasive data to only include specified park and cycle
parkinv <- invdata %>%
filter(park == "ACAD" & cycle == 3) %>%
select(-park)
parktab <- kable(parkinv, format = 'html',
col.names = c("Plot", "Cycle", "% Invasive Cover", "% Canopy Cover"),
caption = "Table of forest plots in Acadia National Park"
) %>%
kable_styling(full_width = FALSE, bootstrap_options = 'striped') %>%
scroll_box(height = "400px", width = "420px")
invplot
parktab
# load packages
library(purrr)
library(rmarkdown)
# Set in and out directories and file to iterate on
indir <- c("./Markdown_examples/")
outdir <- c("./Markdown_examples/output/") # Included, to see how to change out dir.
rmd <- "example_for_params.Rmd"
# Set up parameter lists. Have to repeat 3 times because each park has 3 cycles
park_list <- rep(c("ACAD", "MABI", "SARA"), each = 3)
park_long_list <- rep(c("Acadia National Park",
"Marsh-Billings-Rockefeller NHP",
"Saratoga National Historical Park"), each = 3)
cycle_list = rep(c(1, 2, 3), 3)
# Create the render function that we will iterate with
render_fun <- function(parkcode, parklong, cyclename){
render(input = paste0(indir, rmd),
params = list(park = parkcode,
park_long = parklong,
cycle = cyclename),
output_file = paste0("Example_for_", parkcode,
"_cycle_", cyclename, ".html"),
output_dir = outdir)
}
# Map render_fun() to the lists.
# Note that you must have the same # of elements in each list
pmap(list(park_list, park_long_list, cycle_list),
~render_fun(..1, ..2, ..3))
rm(list = ls())
# Packages used in this section
pkgs <- c("tidyverse",
"flexdashboard",
"crosstalk",
"scales",
"DT",
"echarts4r",
#"echarts4r.maps", # used for demo, not needed by participants
"reactable",
"plotly",
"sparkline")
installed_pkgs <- pkgs %in% installed.packages()
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE)
lapply(pkgs, library, character.only = TRUE)
# This opens up a browser window where you can create a personal access token. It may ask you to log into GitHub or confirm your password.
usethis::create_github_token()
gitcreds::gitcreds_set()
# Read in bird checklist
birds <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
head(birds)
# Load packages
library(dplyr)
library(readr)
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
#----Load packages----
library(dplyr)
library(readr)
library(ggplot)
#----Read data----
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
#----Exploratory data analysis----
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
#----Group size histogram----
group_size %>%
ggplot(aes(x = species_count)) +
geom_histogram() +
ggtitle("Distribution of species group size") +
xlab("Number of species") +
ylab("Frequency")
# Load packages
library(dplyr)
library(readr)
library(ggwordcloud)
library(stringr)
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# Preview dataset
birds
# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")
# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")
# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))
# Count species in each species group
group_size <- birds %>%
group_by(species_group) %>%
summarize(species_count = n()) %>%
arrange(-species_count) %>%
ungroup()
# What are the largest species groups?
group_size
# What are the smallest species groups?
tail(group_size)
## Species description wordcloud
# Remove the type of bird (last word in the species name)
bird_words <- birds %>%
mutate(primary_com_name = str_replace(primary_com_name, "\\b(\\w+)$", ""),
primary_com_name = str_replace_all(primary_com_name, "-", " "),
primary_com_name = trimws(primary_com_name),
primary_com_name = tolower(primary_com_name)) %>%
filter(primary_com_name != "")
# Split species descriptions into single words
bird_words <- paste(bird_words$primary_com_name, collapse = " ") %>%
str_split(boundary("word"), simplify = TRUE) %>%
as.vector()
# Get frequency of each word
bird_words_df <- tibble(word = bird_words) %>%
group_by(word) %>%
summarise(freq = n()) %>%
arrange(-freq) %>%
filter(freq > 10) %>%
mutate(angle = 90 * sample(c(-1:1), n(), replace = TRUE, prob = c(1, 3, 1)))
# Make the word cloud
n_words <- 100 # Number of words to include
ggplot(bird_words_df[1:n_words,],
aes(label = word, size = freq,
color = sample.int(10, n_words, replace = TRUE),
angle = angle)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_continuous(type = "viridis")
ggsave("birds_wordcloud.png", bg = "white")
#----Load packages----
library(dplyr)
library(readr)
<<<<<<< HEAD
library(ggwordcloud)
library(stringr)
=======
library(ggplot)
#----Read data----
>>>>>>> main
# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")
# What are the smallest species groups?
tail(group_size)
<<<<<<< HEAD
## Species description wordcloud
# Remove the type of bird (last word in the species name)
bird_words <- birds %>%
mutate(primary_com_name = str_replace(primary_com_name, "\\b(\\w+)$", ""),
primary_com_name = str_replace_all(primary_com_name, "-", " "),
primary_com_name = trimws(primary_com_name),
primary_com_name = tolower(primary_com_name)) %>%
filter(primary_com_name != "")
# Split species descriptions into single words
bird_words <- paste(bird_words$primary_com_name, collapse = " ") %>%
str_split(boundary("word"), simplify = TRUE) %>%
as.vector()
# Get frequency of each word
bird_words_df <- tibble(word = bird_words) %>%
group_by(word) %>%
summarise(freq = n()) %>%
arrange(-freq) %>%
filter(freq > 10) %>%
mutate(angle = 90 * sample(c(-1:1), n(), replace = TRUE, prob = c(1, 3, 1)))
# Make the word cloud
n_words <- 100 # Number of words to include
ggplot(bird_words_df[1:n_words,],
aes(label = word, size = freq,
color = sample.int(10, n_words, replace = TRUE),
angle = angle)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 20) +
theme_minimal() +
scale_color_continuous(type = "viridis")
ggsave("birds_wordcloud.png", bg = "white")
=======
#----Group size histogram----
group_size %>%
ggplot(aes(x = species_count)) +
geom_histogram() +
ggtitle("Distribution of species group size") +
xlab("Number of species") +
ylab("Frequency")
>>>>>>> main
#------------------
# R Packages I
#------------------
knitr::include_graphics("./images/gitshell_init_code.png")
usethis::create_package("D:/NETN/R_Dev/testpackage") # update to work with your file path
usethis::use_git() # sets up local git for new package
usethis::use_github() # creates new github repo called testpackage.
usethis::use_mit_license() # set license to MIT license (or use a different license.)
usethis::git_default_branch_rename() # renames master to main
#------------------
# R Packages 2
#------------------
knitr::include_graphics("./images/Project_Options_Build_Tools.png")
#' @title hello
#' @description test package for R training
#' @export
knitr::include_graphics("./images/Build_hello_example.png")
?testpackage::hello
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#' @export
make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(data, -genus, -species)
return(data2)
}
usethis::use_package("dplyr") # for imports which is the default
usethis::use_package("stringr") # for imports which is the default
usethis::use_package("ggplot2", "Suggests") # for suggests
?testpackage::make_sppcode
library(testpackage)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
"Malaxis unifolia", "Calopogon tuberosus"),
cover = c(10, 40, 10, 50),
stems = c(50, 20, 10, 10))
example_dat2 <- make_sppcode(example_dat, "Latin_Name")
head(example_dat2)
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#'
#' @examples
#' library(testpackage)
#'
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
#' "Malaxis unifolia", "Calopogon tuberosus"),
#' cover = c(10, 40, 10, 50),
#' stems = c(50, 20, 10, 10)))
#'
#' example_dat2 <- make_sppcode(example_dat, "Latin_Name")
#' head(example_dat2)
#'
#' @export
make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(data, -genus, -species)
return(data2)
}
#------------------
# R Packages III
#------------------
library(stringr)
library(dplyr)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
"Malaxis unifolia", "Calopogon tuberosus"),
cover = c(10, 40, 10, 50),
stems = c(50, 20, 10, 10))
#' @title theme_IMD: custom ggplot2 theme for plotting
#'
#'
#' @description This is a custom ggplot2 theme that removes the default panel grids
#' from ggplot2 figures, and makes the axes and tick marks grey instead of black.
#'
#' @return This function must be used in conjunction with a ggplot object, and will return a ggplot object with the custom theme.
#'
#' @examples
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
#' "Malaxis unifolia", "Calopogon tuberosus"),
#' cover = c(10, 40, 10, 50),
#' stems = c(50, 20, 10, 10))
#' library(ggplot2)
#' p <- ggplot(data = example_dat, aes(x = cover, y = stems)) +
#' geom_point() +
#' theme_IMD()
#' p
#'
#' @export
theme_IMD <- function(){
# Check that suggested package required for this function is installed
if(!requireNamespace("ggplot2", quietly = TRUE)){
stop("Package 'ggplot2' needed for this function to work. Please install it.", call. = FALSE)
}
ggplot2::theme(panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
panel.background = ggplot2::element_rect(color = '#696969', fill = 'white', size = 0.4),
plot.background = ggplot2::element_blank(),
strip.background = ggplot2::element_rect(color = '#696969', fill = 'grey90', size = 0.4),
legend.key = ggplot2::element_blank(),
axis.line.x = ggplot2::element_line(color = "#696969", size = 0.4),
axis.line.y = ggplot2::element_line(color = "#696969", size = 0.4),
axis.ticks = ggplot2::element_line(color = "#696969", size = 0.4)
)}
joinTreeData <- function(park = 'all', from = 2006, to = 2021, QAQC = FALSE, locType = c('VS', 'all'), panels = 1:4,
status = c('all', 'active', 'live', 'dead'),
speciesType = c('all', 'native','exotic', 'invasive'),
canopyPosition = c("all", "canopy"), dist_m = NA,
eventType = c('complete', 'all'), output = 'short', ...){
# Match args and classes
status <- match.arg(status)
park <- match.arg(park, several.ok = TRUE,
c("all", "ACAD", "MABI", "MIMA", "MORR", "ROVA", "SAGA", "SARA", "WEFA"))
stopifnot(class(from) == "numeric", from >= 2006)
stopifnot(class(to) == "numeric", to >= 2006)
locType <- match.arg(locType)
stopifnot(class(QAQC) == 'logical')
stopifnot(panels %in% c(1, 2, 3, 4))
output <- match.arg(output, c("short", "verbose"))
canopyPosition <- match.arg(canopyPosition)
speciesType <- match.arg(speciesType)
# Check for tree data in global environment
tryCatch(tree_vw <- COMN_TreesByEvent,
error = function(e){stop("COMN_TreesByEvent view not found. Please import view.")}
)
}
# dependencies
library(stringr)
library(dplyr)
#function args
data <- example_dat
sppname <- "Latin_Name"
#make_sppcode <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(dat, -genus, -species)
# return(data2)
#}
make_sppcode_error <- function(data, sppname){
data$genus = word(data[,sppname], 1)
data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
substr(species, 1, 3))))
data2 <- select(dat, -genus, -species)
return(data2)
}
make_sppcode_error(example_dat, sppname = "Latin_Name")
traceback()
debug(make_sppcode_error)
make_sppcode_error(data, "Latin_Name")
knitr::include_graphics("./images/package_commit.jpg")
devtools::install_github("KateMMiller/testpackage")
We are the people who designed and led the training for this week. We hope you found it useful, and that you keep with it!
Andrew Birch
WRD/IMD Water Quality Program Lead
andrew_birch@nps.gov
Ellen Cheng
SER Quantitative Ecologist
ellen_cheng@nps.gov
Kate Miller
NETN/MIDN Quantitative Ecologist
kathryn_miller@nps.gov
Lauren Pandori
CABR Marine Biologist - MEDN
lauren_pandori@nps.gov
Thomas Parr
GLKN Program Manager
thomas_parr@nps.gov
John Paul Schmit
NCRN Quantitative Ecologist
john_paul_schmit@nps.gov
Sarah Wright
MOJN Data Scientist
sarah_wright@nps.gov