The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.
NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!
Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.
Install important libraries
One of the great things about R is that users can build and share their own packages (AKA libraries) to add functionality to R. It’s a similar idea as extensions in ArcGIS. To use other libraries, you need to first download/install them, and then you need to turn them on in your R session. Some of the packages take a while to download. To save time during training, please try to install these prior to the first day. You can do that by opening RStudio, and in the window on the left (the console) paste the code below and press enter.
install.packages(c("tidyverse", "devtools", "rgdal", "ggspatial", "leaflet"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
Next check that this worked by pasting and running the code below:
library(tidyverse)
library(forestNETN)
library(NCRNWater)
If you run into any problems, we can troubleshoot the first day. We’ll also discuss what the code above means the first day.
Join MS Team and download data
I set up a MS Team called ACAD_R_Training_Info to serve up the data and for us to use to post questions/info throughout the training. The link to the team is here. Please make sure you can join this team before training starts, and let me know if you have problems.
Structure of training
The training will take place over 3 half days. Each day will run from 8-noon via MS Teams, with afternoon assignments to help cement ideas and prepare for the next day’s training. Please make an effort to at least skim the assignments and fill out the Feedback form at the end of each day.
For most of the training, I’ll share my screen to walk through the code. It will help a lot if you have 2 monitors, so you can have the screen I’m sharing on one monitor and your own session of RStudio open on the other. If you need an extra monitor let me know (I recently upgraded and have some older (non-gov’t) ones that are looking for a new home).
Three days is barely enough to scratch the surface on what you can do in R. My goals with this training are to help you get past the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you. Hopefully this also helps build a community of R users in ACAD who can help and share code with each other.
Introduction to R and RStudio
R is a programming language and free software environment that was developed by statisticians for statistical computing and graphics. It has since grown to fill many needs beyond statistics and graphing, but is still one of the leading programming languages for non-ML statistics and data visualization. For more information on the history of R, visit the R Project website.
RStudio is an integrated development environment (IDE) that is a shell around the R program. RStudio makes programming in the R language way easier by color coding different types of code, checking code as you type, and providing many useful tools with the push of a button or key stroke (eg viewing help info). When you open RStudio, you typically see 4 panes:
Global Options
There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. To check these settings, go to Tools > Global Options and follow the steps below.
Note that dplyr ggplot2 and tidyr are part of the tidyverse, which is a collection of a lot of really useful packages that are designed to work with a specific format of data and have similar function structures that allow you to chain processes together (more on that later). Instead of only installing dplyr and tidyr individually, I recommend installing and loading the tidyverse, which will load those packages, along with several other useful packages.
There are also a number of packages NETN or collaborators have built to work with the monitoring data we’re collecting, including:If you haven’t already installed these packages (directions are also in the Prep for Training tap), run the code chunk below:
install.packages(c("tidyverse", "rgdal", "devtools"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
You only need to run the previous code chunk once. After these packages are installed, you can just load them via library(package_name). A couple of notes about packages.
To load the packages, run the code chunk below:
library(tidyverse)
library(rgdal)
library(forestNETNarch)
library(NCRNWater)
If you get an error that says there is no package called “x”, it means that package didn’t install properly, and you should try installing the package again.
If you run into issues installing from GitHub (eg using install_github function), here are a couple of tips for troubleshooting:Find the directory you’re going to work from (other than avoiding saving to your desktop, anywhere is probably fine), and name the directory R_training.
Now run the code chunk below to set up the same file structure.
subfolders<-c("data", "figures", "tables", "rscripts", "shapefiles")
invisible(lapply(subfolders, function(x){
if(!dir.exists(x)){dir.create(paste(getwd(), x, sep = '/'))}
}))
Now, copy the data.zip file that you downloaded before we started into the data folder you created in the previous step. Then run the code chunk below.
If you haven’t downloaded the example data yet, please do so by going to ACAD_R_Training_Info Team and download data.zip in the files channel. Type the code below to check that the data.zip folder is in the right place, and if it is, unzip it. You should see the data.zip and 3 folders after you run list.files(“./data”) the second time.
list.files("./data") # check that you have data.zip in your data folder
## [1] "data.zip" "Sampling_Grid_Fee.zip"
unzip("./data/data.zip", exdir = "./data")
list.files("./data") # check that the unzip worked. There should be multiple folders with datasets in each.
## [1] "data.zip" "example_data" "NETN_forest_csvs"
## [4] "NETN_water_data" "Sampling_Grid_Fee.zip"
There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel. ?dplyr::filter works too, if you need to connect the package to the function and/or haven’t loaded that package yet.
?plot
?dplyr::filter
To view all of the functions and links to their individual help files, type in help(package=packagename). For example, the code below will bring up the help files for the dplyr package.
help(package=dplyr)
Online resources include Stackexchange, and Stackoverflow. For tidyverse-related packages/questions, there’s also really good documentation for each package and function here. Google searches are usually my first step, and I include “in R” and the package name (if applicable) in every search related to R code.
Getting Started
Now we’re going to start coding. The best way to work through these lessons will be to take the code from this website and transfer it to an R Script in your R_training.Rprj. To start that, go to File > New File > R Script (the icon will do the same), and name the script Day_1.R, and save it to your rscripts subfolder. As you get more comfortable coding in R, try pushing yourself to type the commands, rather than copying and pasting. You’ll learn to code faster by typing, even if you’re just mimicking what you see in the training document.
Note also that R is case sensitive. For example, R would see a data frame named “DF” and a data frame named “df” as 2 different objects. It’s best to avoid naming things the same thing with different capitalization to avoid confusing yourself, but R won’t be confused.
Before we go too far, let’s make sure you have the tidyverse package loaded by running the code chunk below. This is the only dependency we’ll need for today’s code.
library(tidyverse)
There are multiple types of data in R, including vectors, lists, arrays, matrices, data frames and tibbles. Vectors are the simplest data type. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. An example vector is below:
# Vector of numbers
vec_num <- c(1, 2, 3, 4, 5)
vec_num
## [1] 1 2 3 4 5
# Vector of strings
vec_str <- c("Schoodic", "MDI-East", "MDI-West", "IAH")
vec_str
## [1] "Schoodic" "MDI-East" "MDI-West" "IAH"
vec_str[1]
## [1] "Schoodic"
Note the use of c(). The c() function stands for combine, and it combines the elements (e.g., “Schoodic”, “MDI-East”, etc.) in a list that are separated by commas. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.
Data frames are essentially a collection of vectors of the same length. Frame means it’s rectangular (like a picture frame), like a spreadsheet with multiple columns, each of which has the same number of rows. We will spend most of our time with data frames, and won’t worry too much about the other data types. Tibbles are the tidyverse version of a data frame, but they’re essentially the same thing as a data frame with a few added bells and whistles.
Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). This data format is also called tidy data. The column variables can be numbers, text, or logical (TRUE/FALSE). Before we go any further, let’s create a simple data frame to look at and perform some basic functions.Note that I have a tendency to use field and column interchangeably (MS Access calls columns fields). Hopefully that doesn’t cause confusion. Just remember that field and column are equivalent.
Copy the code below into the console and run it.
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', NA),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
print(df)
## plot numtrees numLive domSpp invasive_pres
## 1 1 24 21 red_maple TRUE
## 2 2 18 18 red_spruce FALSE
## 3 3 17 16 red_spruce FALSE
## 4 4 14 10 white_cedar FALSE
## 5 5 31 29 white_pine FALSE
## 6 6 27 26 white_spruce TRUE
## 7 plot7 15 14 <NA> FALSE
Here we’re creating a data frame object named df and using <- to assign the stuff on the right to an object named df. Note that objects in R can be a lot of things, including data frames, functions, lists, etc., and the ones that you created and/or are available to you are listed in the Environment pane and usually have a qualifier, like Data, so you have a better idea of what kind of object they are.
The df data frame we just created has 5 columns and 6 rows. You should notice that after you ran the code, df showed up in your Environment at the top right and the console printed output below.
Similar to Excel, R treats numbers and text differently. Some functions will only work with numbers, and some will only work with text. A couple of points about working with different column types in R data frames:To better understand this, run the code below:
df$plot # List the values within the column plot for the df data frame
df[ , 'plot'] # Bracket version of the same thing
df[ , 1] # Bracket version of showing the first column, which is plot
df[2, 1] # Selecting 2nd row of 1st column
Question 1: How do you find the value of the 3rd row of domSpp? Answer is in the Answers tab.
To check the structure of your data, run the code chunk below.
head(df) # to view the first 6 rows of df
## plot numtrees numLive domSpp invasive_pres
## 1 1 24 21 red_maple TRUE
## 2 2 18 18 red_spruce FALSE
## 3 3 17 16 red_spruce FALSE
## 4 4 14 10 white_cedar FALSE
## 5 5 31 29 white_pine FALSE
## 6 6 27 26 white_spruce TRUE
tail(df) # to view last 6 rows of df
## plot numtrees numLive domSpp invasive_pres
## 2 2 18 18 red_spruce FALSE
## 3 3 17 16 red_spruce FALSE
## 4 4 14 10 white_cedar FALSE
## 5 5 31 29 white_pine FALSE
## 6 6 27 26 white_spruce TRUE
## 7 plot7 15 14 <NA> FALSE
Note the blank in domSpp in the last row.
names(df) # to view the column names and their order
## [1] "plot" "numtrees" "numLive" "domSpp"
## [5] "invasive_pres"
str(df) # to view how R is defining each column type
## 'data.frame': 7 obs. of 5 variables:
## $ plot : chr "1" "2" "3" "4" ...
## $ numtrees : num 24 18 17 14 31 27 15
## $ numLive : num 21 18 16 10 29 26 14
## $ domSpp : chr "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
## $ invasive_pres: logi TRUE FALSE FALSE FALSE FALSE TRUE ...
Note the different data types in df. The plot column was ssigned as a chr instead of a number because of the last value (i.e., plot7). The next thing we’ll do is modify the dataset so that plot reads in as numeric, and replace the missing value in domSpp with “white spruce”.
Copy (or type) each section of code below and run in your console.
# The long way (note that I renamed the object here)
df2 <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 7), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', 'white_spruce'),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
# A shorter way
df$plot[7] <- 7 #takes the 7th row of the plot column and changes it to 7
df$domSpp[7] <- "white_spruce" #takes the 7th row of domSpp and replaces it with "white_spruce"
# Another shorter way
df[7, "plot"] <- 7 #takes the 7th row of the plot column and makes it 7
df[7, 'domSpp'] <- "white_spruce" #takes the 7th row of the domSpp column and makes it "white_spruce"
# Another shorter way
df[7, 1] <- 7 #takes the 7th row of the first column and makes it 7
df[7, 4] <- "white_spruce" #takes the 7th row of the 4th column and makes it "white_spruce"
# Even shorter-way
df[7, c(1, 4)] <- c(7, "white_spruce") #Takes the 7th row and the first and 4th column, which happen to be plot and domSpp.
# Then the first column gets 7, and the 4th gets "white spruce"
This is merely to show that there are usually 5 different ways to do something in R, and the best way is the one you can get to work. At least that’s my philosophy for most of the coding I do. Having code I can pick up a year later and still understand what it does is also important. I only really care about coding for performance if I’m working with big datasets, where slow or memory hungry processes can really add up.
I should mention that when you use numbers instead of column names, your code is pretty fragile to changes in column order (eg, you drop a column). Row numbers are similarly fragile, if you reorder your data. It’s better to use names where possible and to use functions that use logic to find and replace records. For example:
#df[is.na("domSpp")] <- "white_oak" #wrong
df$domSpp[is.na(df$domSpp)]<- "white_oak"
Here R is looking for a value in the domSpp column that is NA (i.e. a blank value). Since there’s only 1, this works. If there were more than 1 NA, we’d need better logic to assign the proper value to each blank. We’ll go into more detail about this later.
And now, here’s the tidyverse way to replace the blank in df’s domSpp column with “white_oak”:
df$domSpp <- df$domSpp %>% replace_na("white_oak")
We’ll talk more about pipes (%>%) later, but for now, just know that it’s taking the data on the left of the %>% and doing the thing on the right to it.
Finally, check whether plot is now reading in as numeric or still chr.
str(df) #still a char
## 'data.frame': 7 obs. of 5 variables:
## $ plot : chr "1" "2" "3" "4" ...
## $ numtrees : num 24 18 17 14 31 27 15
## $ numLive : num 21 18 16 10 29 26 14
## $ domSpp : chr "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
## $ invasive_pres: logi TRUE FALSE FALSE FALSE FALSE TRUE ...
R still thinks plot is a char. Basically once R assigns a data type to a column, you have to manually change it if you need to. In the case here, the as.numeric function fixes the issue. If you didn’t fix the last record, as.numeric would still convert the plot column to numeric, but the last value will be set to NA, and you’ll get a warning message: “NAs introduced by coercion”.
df$plot <- as.numeric(df$plot)
str(df)
## 'data.frame': 7 obs. of 5 variables:
## $ plot : num 1 2 3 4 5 6 7
## $ numtrees : num 24 18 17 14 31 27 15
## $ numLive : num 21 18 16 10 29 26 14
## $ domSpp : chr "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
## $ invasive_pres: logi TRUE FALSE FALSE FALSE FALSE TRUE ...
Question 2: How would you make numtrees an integer?
Question 3: How would you change the second row in invasive_pres from FALSE to TRUE?
Using the df we created in the previous section, let’s perform some basic summary statistics. Base R has a number of summary statistics built into it that are really useful, including mean, median, min, max, and sd (for standard deviation).
min(df$numtrees)
## [1] 14
max(df$numtrees)
## [1] 31
mean(df$numtrees)
## [1] 20.85714
range(df$numtrees) #gives min and max
## [1] 14 31
The summary() function in R is another really useful function that can take just about any named Data object in your global environment and provide you a meaningful summary. For data frames, the summary() function gives different/meaningful summaries for each column depending on type (e.g. numeric vs. character). The summary() function also works well for summarizing model output, like from ANOVA, linear regression, etc.
summary(df)
## plot numtrees numLive domSpp
## Min. :1.0 Min. :14.00 Min. :10.00 Length:7
## 1st Qu.:2.5 1st Qu.:16.00 1st Qu.:15.00 Class :character
## Median :4.0 Median :18.00 Median :18.00 Mode :character
## Mean :4.0 Mean :20.86 Mean :19.14
## 3rd Qu.:5.5 3rd Qu.:25.50 3rd Qu.:23.50
## Max. :7.0 Max. :31.00 Max. :29.00
## invasive_pres
## Mode :logical
## FALSE:5
## TRUE :2
##
##
##
You can also calculate new columns from existing columns in your data.
df$numDead <- df$numtrees - df$numLive # base R version
df <- df %>% mutate(numDead = numtrees - numLive) # tidyverse version of the same operation.
# note if the 2nd line doesn't run, load the tidyverse library (library(tidyverse)) first.
The mutate function is in the dplyr package of the tidyverse. The mutate function is used to create a new column in an existing data frame, and always returns the same number of records as the existing data frame. In this example, mutate creates a new column called numdDead, and tells it to equal numtrees - numLive. The results are identical to the base R version in the line above. Mutate just spares you having to write df$ over and over.
Question 4: How would you calculate the percent of trees that are dead in the df data frame?
Question 5: What is the smallest number of dead trees observed in the df dataset?
Loading csv files
For the training modules that follow, we’re going to work with a larger dataset that contains some fake wetland plant data. First we need to load the wetland dataset, using the code below.
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
After running the code above, you should see a new object in your global environment named wetdat. You can view the structure and the data by clicking on the arrow next to the name in the global environment panel, or click directly on wetdat to display a separate window of your data in a spreadsheet format.
We’re going to use this dataset throughout the rest of training. To give a little background on it, each row represents a species that was recorded in a wetland plot. Multiple sites have been sampled twice, as indicated by the Year column. PctFreq is the % of subplots a species was found in, ranging from 0 to 100. Ave_Cov is the average % cover of the species, ranging from 0 to 100.
The read.csv() is a function in base R that reads csv (comma separated values) files, which are lightweight machine-readable files that most text readers or spreadsheet programs can read. Unlike Excel spreadsheets, they only contain values and commas and cannot save any kind of formatting, like merged cells, cell borders, formulas, etc. You also can only have one worksheet within a csv. If most of your data live in Excel spreadsheets, you can either save individual worksheets as csvs directly in Excel, or you can load individual worksheets into R using the readxl package that was designed to read Excel files.Loading Excel files
The first step for working with Excel files (i.e., files with .xls or .xlsx extensions), is to install the readxl package, and then load the package as shown below. The readxl package has a couple of options for loading Excel spreadsheets, depending on whether the extension is .xls, .xlsx, or unknown. The code below shows you how to load files with .xlsx extension. The other options for loading Excel files can be viewed on the readxl help page.
install.packages("readxl") #only need to run once.
library(readxl)
bat_capture <- read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx",
sheet = "All Records")
head(bat_capture)
## # A tibble: 6 x 65
## Date Month State County Town `Site ID` `Site Name: Ori~
## <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1996-06-05 00:00:00 June ME Hanco~ Goul~ COREA MICK02
## 2 1996-06-07 00:00:00 June ME Hanco~ Wint~ ANP FRZR (approx.)
## 3 1996-06-07 00:00:00 June ME Hanco~ Wint~ ANP FRZR (approx.)
## 4 1996-06-07 00:00:00 June ME Hanco~ Wint~ ANP FRZR (approx.)
## 5 1996-06-07 00:00:00 June ME Hanco~ Wint~ ANP FRZR (approx.)
## 6 1996-06-07 00:00:00 June ME Hanco~ Wint~ ANP FRZR (approx.)
## # ... with 58 more variables: `Site Name: Standardized` <chr>,
## # `Recorder/Staff` <chr>, Affiliation <chr>, Lat <dbl>, Long <dbl>,
## # Datum <chr>, Easting <chr>, Northing <chr>, `Start Temp (°F)` <chr>, `End
## # Temp (°F)` <chr>, `Start Temp (°C)` <chr>, `End Temp (°C)` <chr>, `%
## # clouds` <dbl>, Wind <chr>, Precip. <chr>, `Moon Phase` <chr>,
## # Sunset <dttm>, `Elevation (m)` <dbl>, Habitat <chr>, `Start Time` <dttm>,
## # `End Time` <chr>, `Total minutes` <dbl>, `Total hours` <dbl>, `Capture
## # Technique` <chr>, `Capture Technique: Standardized` <chr>, `Effort (sq. m.
## # of net)` <dbl>, `Units of Effort (UOE=Effort*Total hours)` <dbl>, `Set near
## # water?` <chr>, `Capture Time` <chr>, `Time After Sunset (h:mm)` <chr>,
## # `Time after Sunset: Converted to hours` <chr>, Net <chr>, `Net
## # Height` <chr>, Species <chr>, Latin <chr>, Sex <chr>, Age <chr>, `Repro.
## # Status` <chr>, `RS (0-5)` <chr>, `Forearm (mm)` <chr>, `Ear (mm)` <chr>,
## # `Weight (g)` <chr>, `Band #` <chr>, Recap <chr>, `Sample ID` <chr>,
## # Skin <chr>, `Genetic Test` <chr>, `Genetic ID` <chr>, `ID Match` <chr>,
## # Fur <chr>, Blood <chr>, Swab <chr>, Fecal <chr>, Parasite <chr>, `Radio
## # Frequency` <lgl>, Photos <lgl>, `Non-target species` <lgl>, Comments <chr>
The file.choose() function
If you don’t remember the exact path where your file lives, the file.choose() function is a handy trick. Nesting that within a read function will open a separate window where you can select the file by hand. While this can be handy, it can also encourage lazy file organization. I prefer to organize my code and data into projects that store all of the files within the project directory, and so rarely use this function (In fact, I had to look up how to use it for this section). If you want to use it, here’s an example:
dat <- read.csv(file.choose())
Subsetting Data
Subsetting is a generic term for reducing the size of a dataset, either by reducing the number of rows, the number columns, or both, in the dataset. Filtering typically means reducing rows. Selecting usually means reducing columns. Technically, we already did some subsetting when we were trying to identify and fix problems in the df data frame. Now we’re going to do it more intentionally and with a bigger dataset.
Using the wetdat data frame from earlier, let’s subset the data to filter out all species that are considered protected from public (eg sensitive species we don’t want to publish the location to- also this is a fake dataset, so don’t worry that we’re publicizing this.). There’s a logical column called Protected in the dataset that we’re going to use to filter out rows when Protected is TRUE. Let’s also simplify the data frame to only select the columns we care about.
# Filter to remove protected species
wetdat_pub <- wetdat[wetdat$Protected == FALSE, ]
nrow(wetdat) #508 rows in original dataset
nrow(wetdat_pub) #499 rows in filtered dataset
table(wetdat_pub$Protected) # Only Protected = FALSE
# Select only fields you want
wetdat_pub <- wetdat_pub[ , c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov",
"Invasive", "Protected")]
names(wetdat_pub)
# Previous steps can all happen in 1 line of code:
wetdat_pub <- wetdat[wetdat$Protected == FALSE,
c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov",
"Invasive", "Protected")]
nrow(wetdat_pub) # 499 rows
names(wetdat_pub) # Only includes fields we want
table(wetdat_pub$Protected) #All FALSE
# Another base R solution that's a bit easier to work with:
wetdat_pub2 <- subset(wetdat, Protected == FALSE,
select = c(Site_Name, Site_Type, Latin_Name, Year,
Ave_Cov, Invasive, Protected))
nrow(wetdat_pub2) # 499 rows
names(wetdat_pub2) # Only includes fields we want
table(wetdat_pub2$Protected) #All FALSE
A couple of comments about the above code:
Now the tidyverse way do in the previous code chunk using select() and filter() functions chained together via pipes (%>%). The select function is used for columns and the filter command is used for rows.
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>%
select(Site_Name, Site_Type, Latin_Name, Year, Ave_Cov, Invasive, Protected)
# Any column name not mentioned in the select is dropped
nrow(wetdat_tidy)
names(wetdat_tidy)
table(wetdat_tidy$Protected)
# Even more efficient way to code this:
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% select(-Common, -PctFreq)
Note the %>%, called a “pipe”. The way to read what the pipe is doing is it takes the wetdat data frame, pipes it into the filter function, which drops protected species, then that newly filtered dataset without protected species is piped into the select function, and the Common and PctFreq columns are dropped. Also notice that column names aren’t quoted in tidy approach. One of the benefits of the tidyverse packages is that you don’t have to quote column names, whereas you typically do in base R (subset() being an obvious exception). This saves time typing and was a conscious decision of the tidyverse developers. This approach is called non-standard evaluation (NSE). NSE is great most of the time, but can cause headaches when you start writing your own functions. Personally I use tidyverse for most scripting purposes because it’s intuitive to me and I like being able to chain steps together. I use base R when I’m writing functions/ developing packages to avoid NSE issues, and because base R functions tend to be faster. Until you get to the level that you’re building functions/packages, you’re free to work within the tidyverse.
Filtering with multiple values
This section shows you how to filter rows based on multiple possible values. Using the wetland data again, let’s say we only want to look at the data from 2 sites: SEN-01 and SEN-03. To do that, we create a list of strings we want to match and tell R to take any record that matches any of the strings in that list.
# Base R
wetdat_2sites <- wetdat[wetdat$Site_Name %in% c("SEN-01", "SEN-03"), ]
nrow(wetdat_2sites)
table(wetdat_2sites$Site_Name)
# Tidyverse equivalent of same operation
wetdat_2tidy <- wetdat %>% filter(Site_Name %in% c("SEN-01", "SEN-03"))
nrow(wetdat_2tidy)
table(wetdat_2tidy$Site_Name)
The nice thing about using %in% is that it only pulls records that match exactly one of the values you give it, and you don’t have to worry about it also including NAs.
Question 6: How would you select only Invasive species records?
Question 7: Which sites have both Arethusa bulbosa and Carex exilis present?
Sorting Data
Next topic is sorting data in R, which you can do alphabetically or numerically. Typically when you read in a dataset in R, the order is exactly the order of that in the original file, and the row numbers are assigned based on that order. If you want to change the original order, there are multiple ways to sort data in R, as demonstrated below.
head(wetdat) # check original order before sorting
# Base R- I always forget the exact code and have to look it up.
wetdat_sort <- wetdat[order(wetdat$Latin_Name, wetdat$Site_Name), ]
head(wetdat_sort)
# Sort in reverse
wetdat_rev <- wetdat[order(desc(wetdat$Latin_Name), desc(wetdat$Site_Name)), ] #desc() stands for descending
head(wetdat_rev)
# Another example with numbers
wetdat_rev2 <- wetdat[order(-wetdat$PctFreq), ]
# Note: desc() is for text, - is for numbers
# Tidyverse version
wetdat_sort_tidy <- wetdat %>% arrange(Latin_Name, Site_Name)
head(wetdat_sort_tidy)
# Tidyverse reverse sort
wetdat_rev_tidy <- wetdat %>% arrange(desc(Latin_Name), desc(Site_Name))
head(wetdat_rev_tidy)
Question 8: How would you sort the wetdat by Ave_Cov (high to low) and Latin_Name?
That’s the end of Day 1! Be sure to save your Day 1 R script, try to answer the questions from today, and complete the tasks in the Assignments tab as you have time. Answers to the questions are in the Answers tab. Note that I provide one or a few solutions to get the answer to the questions, and it’s fine if you come up with your own. That’s how coding in R works!
Lastly, don’t worry if you don’t understand everything. The key here was to expose you to the many different ways to view and work with data in R. Learning R requires repetition, determination, and patience. Just stick with it!
Data Wrangling Intro
We already did a bit of data wrangling with the df and wetdat data frames yesterday. Today, we take those concepts to the next level. If you’ve never heard of the term, data wrangling essentially refers to all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process. This was the most checked topic in the survey you all filled out, so I’ll cover this in (painful?) detail.
Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.
For this training module, we’re going to check and clean up the wetland plant data from yesterday. Just a reminder that in this dataset each row represents a species that was recorded in a wetland plot. Multiple sites have been sampled twice, as indicated by the Year column. PctFreq is the % of subplots a species was found in, ranging from 0 to 100. Ave_Cov is the average % cover of the species, ranging from 0 to 100.
Before we get started, open the R_training.Rproj project in RStudio, start a Day_2.R script. Then, load the tidyverse package and the wetland dataset we used yesterday.
library(tidyverse)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
Explore data structure
Once you’ve loaded the wetdat dataset, remember that you can view the structure of the wetdat by clicking on the arrow next to the name in the global environment panel. If you click directly on wetdat, a separate window will open to view the data like a spreadsheet. Both of these are acceptable ways to explore data, but I prefer to explore datasets through code, rather than by mouse.
The code below is usually where I start (though I rarely use tail).
head(wetdat) # shows the first 6 lines of the data frame, including the names of the columns
tail(wetdat[1:3]) # shows last 6 records of first 3 columns in the data frame
names(wetdat) # shows the column names of the data frame
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
table(complete.cases(wetdat)) # FALSE counts the number of rows that have an NA in at least 1 column
dim(wetdat) # number of rows (first) and columns (second)
nrow(wetdat) # number of rows (ncol is # columns)
str(wetdat) # shows the type of data in each column
summary(wetdat) # summarize each field based on data type
Check and fix numeric fields that R reads as a character
Both the summary and str functions show an issue with the Avg_Cov column: R classified this as a character, but we want it to be numeric. If R classified this as a character, it means that at least one record within this column contains something that isn’t a number. If your data frame isn’t huge, you can just run View(wetdat) to see the data and scroll/sort until you find the culprit. Strings will typically sort to the top or bottom of an otherwise numeric column.
View(wetdat)
Just by sorting Ave_Cov in the view above, you can see that there are 2 records with <0.1 instead of 0.1. Also, note the row numbers to the far left, which are rows 83 and 106 in this case. the Row numbers are assigned at the time you load the data, and are maintained even when you sort in the view. If you sort the data in code (e.g. wetdat %>% arrange(Site_Name)), the row numbers will be updated.
Knowing what you just learned, you could open this file in Excel and fix the data, then reload the file in R. But I’m going to do it in code, so that it works any time you add new data to the dataset that might have the same issues. This isn’t the slickest approach because I’m pretending that I don’t know which record and which symbols are causing the problems (say, for example, that there are a lot of records with this issue). First, I’m going to create a temporary field called Ave_Cov_num that’s a numeric version of Ave_Cov.
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov) #create new field that converts Ave_Cov to numeric
## Warning: NAs introduced by coercion
Note the “NAs introduced by coercion” warning message in your console. This means that when you converted the Ave_Cov to numeric, there was at least 1 record that R didn’t read as a number, and it was set to NA (blank) in the Ave_Cov_num column. We’re going to use that to find and then fix the record(s) that are causing the issue.
which(is.na(wetdat$Ave_Cov_Num))
## [1] 83 106
The line above is asking which rows in the new Ave_Cov_num field are NA, and returns 83 and 106. Now let’s use brackets to return the data from those 2 rows.
wetdat[which(is.na(wetdat$Ave_Cov_Num)), ] # all columns with NA in Ave_Cov_num
## Site_Name Site_Type Latin_Name Common Year PctFreq
## 83 SEN-03 Sentinel Cornus canadensis bunchberry dogwood 2011 100
## 106 SEN-03 Sentinel Vaccinium macrocarpon large cranberry 2011 100
## Ave_Cov Invasive Protected X_Coord Y_Coord Ave_Cov_Num
## 83 <0.1 FALSE FALSE 553993 4898560 NA
## 106 <0.1 FALSE FALSE 553993 4898560 NA
wetdat[which(is.na(wetdat$Ave_Cov_Num)), "Ave_Cov"] # Ave_Cov only
## [1] "<0.1" "<0.1"
wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Another way to see Ave_Cov only
## [1] "<0.1" "<0.1"
In the code above, we’re using brackets to subset the wetdat data frame to only return the rows that have NA in Ave_Cov_Num. The first line returns all columns in those rows. The second and third lines only show us the Ave_Cov field, which is what we want to fix. Notice that I’m using the NAs in Ave_Cov_Num to return the records causing problems in Ave_Cov.
Also, just a reminder that brackets specify the [row, column] of a data frame. If you specify the dataframe$column then you only have the row left, and don’t need a comma in the brackets. Admittedly, subsetting with brackets can get complicated, and I’m not always sure if I need a comma in the bracket or not. If I get the “incorrect number of dimensions” error, I know I did it wrong and need to either add or delete the comma.
Back to cleaning the data. We found that the records with non-numeric symbols were both “<0.1”. We’re going to use that information to replace “<0.1” with 0.1. Of course the first thing you should do, when you see <0.1 is to check your raw data, datasheets, etc. to make sure 0.1 is the correct replacement for the <0.1 records. If you don’t know what the correct replacement should be, you should probably change the value to NA instead of “0.1”. That example is commented out in the code chunk below.
wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- "0.1"
#wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- NA
Another approach is to drop the “<” using gsub(). This is handy for lab-derived data that report detection limits as < ## or > ##, and when you just want the number after the < or >.
wetdat$Ave_Cov <- gsub("<", "", wetdat$Ave_Cov)
str(wetdat)
## 'data.frame': 508 obs. of 12 variables:
## $ Site_Name : chr "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
## $ Site_Type : chr "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
## $ Latin_Name : chr "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
## $ Common : chr "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ PctFreq : int NA 20 80 40 100 60 100 60 100 100 ...
## $ Ave_Cov : chr "0.02" "0.02" "2.22" "0.04" ...
## $ Invasive : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Protected : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ X_Coord : num 574855 574855 574855 574855 574855 ...
## $ Y_Coord : num 4911909 4911909 4911909 4911909 4911909 ...
## $ Ave_Cov_Num: num 0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...
Here gsub is looking for the “<” symbol, and replaces it with nothing, via empty quotes. If you wanted to replace the “<” with something else, like a minus sign, the 2nd argument would be “-” instead of "". You can now View(wetdat) to check that <0.1 is no longer in the Ave_Cov column. Rows 83 and 106 were the culprits.
However, notice that str(wetdat$Ave_Cov) shows that Ave_Cov is still considered a character. Once R assigns a data type to a column, you have to manually change it to something different. Now that we fixed <0.1, we shouldn’t get any NAs introduced by coercion when we convert to numeric. I’m going to overwrite the Ave_Cov_Num field because it’s just a temporary field for the purpose of data checking.
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
str(wetdat)
## 'data.frame': 508 obs. of 12 variables:
## $ Site_Name : chr "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
## $ Site_Type : chr "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
## $ Latin_Name : chr "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
## $ Common : chr "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ PctFreq : int NA 20 80 40 100 60 100 60 100 100 ...
## $ Ave_Cov : chr "0.02" "0.02" "2.22" "0.04" ...
## $ Invasive : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Protected : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ X_Coord : num 574855 574855 574855 574855 574855 ...
## $ Y_Coord : num 4911909 4911909 4911909 4911909 4911909 ...
## $ Ave_Cov_Num: num 0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...
Ave_Cov_Num is now numeric and no NAs were introduced by coercion. It’s now safe to overwrite the original Ave_Cov field. The entire workflow from start to finish is below:
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
## Warning: NAs introduced by coercion
wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Identify which values in Ave_Cov can't be converted to a number and what you need to do to fix them.
## [1] "<0.1" "<0.1"
wetdat$Ave_Cov <- as.numeric(gsub("<", "", wetdat$Ave_Cov)) # replace < with nothing
wetdat <- subset(wetdat, select = -c(Ave_Cov_Num)) #now drop the temp column Ave_Cov_Num
str(wetdat) #Ave_Cov is now numeric
## 'data.frame': 508 obs. of 11 variables:
## $ Site_Name : chr "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
## $ Site_Type : chr "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
## $ Latin_Name: chr "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
## $ Common : chr "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ PctFreq : int NA 20 80 40 100 60 100 60 100 100 ...
## $ Ave_Cov : num 0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...
## $ Invasive : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Protected : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ X_Coord : num 574855 574855 574855 574855 574855 ...
## $ Y_Coord : num 4911909 4911909 4911909 4911909 4911909 ...
table(complete.cases(wetdat$Ave_Cov)) #checks that there aren't any more NAs.
##
## TRUE
## 508
# If there were NAs, you'd have a column named FALSE with the number of NAs below it.
Check and fix range of numeric data
Another check I often do is to make sure there aren’t any impossible values. For example, both the Ave_Cov and PctFreq columns are percents that should range from 0 to 100. Negative values or values over 100 are errors. Let’s see if we find any in the data. Note that we couldn’t use the range() function on Ave_Cov with the original data frame until we fixed the <0.1 issue and converted to numeric.
range(wetdat$PctFreq, na.rm = TRUE) # 20 to 100- that looks fine.
## [1] 20 100
range(wetdat$Ave_Cov, na.rm = TRUE) # 0.02 to 110.0- no bueno
## [1] 0.02 110.00
The Ave_Cov column has at least 1 record >100. Now we’re going to find out how many records are >100, then replace them with the correct value.
Note also the na.rm = TRUE. That tells R that if there are any blanks in the data, remove them for the range calculation, and only return numbers. If you didn’t include na.rm = TRUE, range would return NA NA if there’s a blank in the column (more on this in the next section).
wetdat[wetdat$Ave_Cov > 100, ] # If Ave_Cov is still a chr, this filter won't work
## Site_Name Site_Type Latin_Name Common Year PctFreq Ave_Cov
## 36 SEN-02 Sentinel Arethusa bulbosa dragon's mouth 2011 20 110
## Invasive Protected X_Coord Y_Coord
## 36 FALSE TRUE 556242 4914369
There’s only 1 row, row 36, which has an Ave_Cov > 100 for Arethusa bulbosa in SEN-01. For our purposes, pretend that we went back to the datasheet, and found that there was a data transcription error, and the correct value should be 11, not 110. The next line of code fixes the data for that specific record.
wetdat$Ave_Cov[wetdat$Ave_Cov > 100] <- 11
range(wetdat$Ave_Cov)
## [1] 0.02 74.00
Now Ave_Cov values are within the acceptable range of 0 to 100.
Finding and fixing blanks (NA)
As I’ve mentioned before, R calls blanks in your data NA. R by default is very conservative about how it handles NAs. If you have an NA in an otherwise numeric column and are trying to do a mathematical operation (eg calculate sum or mean), R will most likely return NA as the result. The philosophy behind that is the user must decide how to deal with NAs- either replace them with a logical value, like 0, or remove them from the calculation. Every situation is different. You, the user, need to decide the correct way to deal with NAs in your data.
First, let’s look to see if there are NAs in the PctFreq and Ave_Cov columns (the main data of interest in the wetdat dataset).
table(complete.cases(wetdat$Ave_Cov)) #all TRUE
##
## TRUE
## 508
table(complete.cases(wetdat$PctFreq)) #some FALSE
##
## FALSE TRUE
## 5 503
While the Ave_Cov column is all TRUE, the PctFreq has 5 NAs (FALSE in output). Now we need to decide whether the NAs should be changed to 0 or dropped from the dataset. Again, that’s up to the user and what you’re trying to do. For the purposes here, I’ll show you how the results change based on the decision you make.
First, let’s do nothing.
mean(wetdat$PctFreq)
## [1] NA
NA is returned. R’s basically awaiting further instruction before it’s willing to return a value. Let’s drop NAs from the calculation.
mean(wetdat$PctFreq, na.rm = TRUE)
## [1] 70.36779
Returns 70.36779. Note that na.rm = TRUE is pretty universal argument in R functions. Functions are typically set to na.rm = FALSE, which returns NA if there are blanks in your data. Changing the argument to na.rm = TRUE will drop NAs from the function.
Now let’s assume that blanks should actually be 0.
wetdat$PctFreq[is.na(wetdat$PctFreq)] <- 0
table(complete.cases(wetdat$PctFreq))
##
## TRUE
## 508
mean(wetdat$PctFreq)
## [1] 69.6752
Now all records in PctFreq are TRUE/complete, and the mean is slightly smaller because the 0s are included in the calculation. This is why it’s important to think about how to deal with blanks, and every situation is different.
Check and fix character columns
A common issue with text/character fields is inconsistent naming conventions, especially if data are being stored in a spreadsheet instead of a database with standardized look ups. In the wetdat data frame, we have a column called Site_Name, which should be a unique name for each site that starts with a 3-letter code (either SEN or RAM), is followed by a dash, then a 2 digit site number. Let’s check to see if all of the Site_Name records follow that. Since we don’t have a huge dataset here, we can use the table() function, which will count the number of records in each unique value of a column.
table(wetdat$Site_Name)
##
## RAM-05 RAM-41 RAM-44 RAM-53 RAM-62 RAM44 SEN-01 SEN-02 SEN-03 SEN-3
## 98 73 34 98 52 45 34 41 32 1
A couple of issues show up here. First, we see that there’s a site named RAM44 that’s missing a dash. We also see that there’s one SEN-3, which should be SEN-03.
Another way to do this, if you had a large dataset, would be to check if there are any Site_Name records that are not length 6 (eg RAM-01 is 6 digits) using the nchar() function, which counts the number of characters in a record.
unique(wetdat$Site_Name[which(nchar(wetdat$Site_Name) != 6)]) #
## [1] "SEN-3" "RAM44"
Note unique() returns only unique values, so you don’t get 40 RAM44 returns (run it without unique() if you don’t quite follow.
You can also check if any Site_Name records are missing a dash using the grepl function. The grepl function is a pattern matching function that we will use to look for “-” in the Site_Name column. The ! in front of which is telling R to return any row that doesn’t have a match for “-” in the Site_Name column.
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))]
## [1] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [10] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [19] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [28] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [37] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
Now to fix and then check both of the issues:
wetdat$Site_Name[wetdat$Site_Name == "RAM44"] <- "RAM-44"
wetdat$Site_Name[wetdat$Site_Name == "SEN-3"] <- "SEN-03"
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))] #character(0) means nothing returned
wetdat$Site_Name[which(nchar(wetdat$Site_Name)!=6)]
table(wetdat$Site_Name)
sort(unique(wetdat$Site_Name)) #another way to see all the unique values in a column
Problems solved (character(0) means no records returned). Note that there are other more powerful solutions using regular expressions from the POSIX or Perl language and functions like gsub() or grep(), but they’re more involved and tricky to work with. I only use them when the solution above doesn’t work. If you ever need to know more about that, a decent intro is here. I also provide a more involved example on how to use regular expressions for pattern matching in Answer 1 for today’s questions, but I’m not going to go too far down that rabbit hole during training (I can barely make these work myself, and always have to look them up).
Checking your work
If you plan to calculate summary statistics, like average or standard error, or are doing a lot of filtering, reshaping, etc. to your data, you always want to check your work and make sure you’ve still got the correct number of sites, individuals, visits, etc. before moving on to the next step. Just because you didn’t get an error or warning, doesn’t mean the code you ran worked the way you intended it to!
In the case of the wetdat, there should be 508 total rows, 7 sites, and three different years. The code below shows how to verify that. I also often skim the unique values in each field for anything weird, hence the sort(unique()) line.
dim(wetdat) # reports # rows then # cols
## [1] 508 11
nrow(wetdat) # reports # rows only
## [1] 508
length(unique(wetdat$Site_Name)) # number of unique values in Site_Name
## [1] 8
length(unique(wetdat$Year)) # number of unique years
## [1] 3
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
##
## 2011 2012 2017
## RAM-05 0 44 54
## RAM-41 0 33 40
## RAM-44 0 34 45
## RAM-53 0 48 50
## RAM-62 0 26 26
## SEN-01 34 0 0
## SEN-02 41 0 0
## SEN-03 33 0 0
sort(unique(wetdat$Common))[1:10] # see all the common names in the dataset (truncated to first 10 common names for printing here).
## [1] "American mountain ash" "apple" "Asian bittersweet"
## [4] "bigleaf aster" "bigtooth aspen" "bird vetch"
## [7] "black chokeberry" "black crowberry" "black huckleberry"
## [10] "black spruce"
Question 1: There’s a Latin name that needs to be fixed (hint look for non-letter symbols). How would you find that record?
Question 2: Now that you’ve found it, fix it, and check that you fixed it.
Basics of joining tables
Most of the data we collect in NETN are stored in relational databases, where plot-level data that never change (eg X, Y coords) are stored in 1 table (tbl_Location) and data related to each visit (also called events) are stored in other tables. Databases are a way to efficiently organize data (eg not a lot of duplication across tables), standardize values through lookup tables, and enforce rules like required fields. To work with the data stored in databases, we typically pull the individual tables into R and join them together. To do that, we use a column that is found in each table called a primary key, that allows you to match the correct records in one table to the other. In our database, we typically use GUIDs (globally unique identifiers) as our keys, which I’ll point out in the next section.
In practice, I use left and full joins the most often, so I’ll focus mostly on those in the rest of this module. We’ll try this out on simple datasets first, then we’ll build up to more complex examples with our forest data. Another great example with visuals can be found here.
tree <- data.frame(plot = c("p1","p1","p1","p2","p2","p3","p4","p4"),
treenum = c(1, 2, 3, 11, 12, 21, 32, 33),
species = c("ACERUB" ,"PINSTR", "PINSTR", "QUERUB", "PICRUB",
"PICRUB", "PICRUB", "ACERUB"))
tree
tree_data <- data.frame(plot = c("p1", "p1", "p1", "p1", "p1","p1", "p2", "p2", "p5", "p7"),
year = c(2012, 2012, 2012, 2016, 2016, 2016, 2012, 2012, 2016, 2016),
treenum = c(1, 2, 3, 1, 2, 3, 11, 12, 51, 71),
DBH = c(12.5, 14.5, 16.1, 12.8, 14.9, 17.2, 28.1, 35.4, 36.1, 45.2))
tree_data
tree_full_join <- full_join(tree, tree_data, by = c("plot", "treenum"))
tree_left_join <- left_join(tree, tree_data, by = c("plot", "treenum"))
nrow(tree_full_join)
## [1] 13
nrow(tree_left_join)
## [1] 11
print(tree_full_join)
## plot treenum species year DBH
## 1 p1 1 ACERUB 2012 12.5
## 2 p1 1 ACERUB 2016 12.8
## 3 p1 2 PINSTR 2012 14.5
## 4 p1 2 PINSTR 2016 14.9
## 5 p1 3 PINSTR 2012 16.1
## 6 p1 3 PINSTR 2016 17.2
## 7 p2 11 QUERUB 2012 28.1
## 8 p2 12 PICRUB 2012 35.4
## 9 p3 21 PICRUB NA NA
## 10 p4 32 PICRUB NA NA
## 11 p4 33 ACERUB NA NA
## 12 p5 51 <NA> 2016 36.1
## 13 p7 71 <NA> 2016 45.2
print(tree_left_join)
## plot treenum species year DBH
## 1 p1 1 ACERUB 2012 12.5
## 2 p1 1 ACERUB 2016 12.8
## 3 p1 2 PINSTR 2012 14.5
## 4 p1 2 PINSTR 2016 14.9
## 5 p1 3 PINSTR 2012 16.1
## 6 p1 3 PINSTR 2016 17.2
## 7 p2 11 QUERUB 2012 28.1
## 8 p2 12 PICRUB 2012 35.4
## 9 p3 21 PICRUB NA NA
## 10 p4 32 PICRUB NA NA
## 11 p4 33 ACERUB NA NA
Note the left join has fewer records, and doesn’t include p5 and p7 in the plot column from tree_data. There’s also no tree data for plots p3 and p4 because there wasn’t any data in the tree_data that matched. The full join includes plots and tree numbers from both tables, but there are NAs where there wasn’t a matching record in one of the tables.
Advanced table joins
Let’s load some example forest data, and use the primary keys to join them together. The tables we’ll join are Location, Events and Stand. The Location table includes plot level data that never changes, like GPS coordinates, elevation, etc. The Events table records every time a plot is sampled. The Stand table has a row for every time a plot is sampled and includes data like crown closure, index of deer impacts, % bare rock, etc. Their relationships are shown in the image below.
Note the GUIDS that link the tables together, including Location_ID between tbl_Locations and tbl_Events, and Event_ID between tbl_Events and tbl_Stand_Data. These primary keys are what we will use to join tables together in R as well. If you looked at these tables in Access, they’d look like the image below.
Now we’re going to look at these same tables in R. First, load the tables by running to code below, and use head() or View() to look at the tables you just loaded.
library(tidyverse)
loc <- read.csv("./data/example_data/tbl_Locations_example.csv") # plot level data for forest plots in NETN
head(loc)
event <- read.csv("./data/example_data/tbl_Events_example.csv") # data on each visit in NETN forest plots
head(event)
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") # stand data collected in plots
head(stand)
View(loc) #Location_ID
View(event) #Location_ID links to tbl_Location; Event_ID links other visit-level tables to this table
View(stand) #Event_ID links stand table to event table.
intersect(names(loc), names(event)) # way to check columns two tables have in common.
## [1] "Location_ID"
intersect(names(event), names(stand))
## [1] "Event_ID"
The intersect function tells you the elements that the 2 data frames share, in this case the column names they have in common. Not coincidentally, the results include the primary keys (eg Location_ID) that link the two tables. Now we’re going to merge the tables.
# base R version
names(loc) # to help me pick the columns to include in the join
names(event)
loc_event_full <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = TRUE) # base R full join
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID") # tidyverse full join
View(loc_event_full) # Now you can see all the times a plot has been sampled. You'll also see that some plots
# don't have sample dates. That's because the loc table includes plots that were rejected and never sampled.
loc_event_left <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = FALSE)
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_full) # same number of rows because all events have a location record
## [1] 164
nrow(loc_event_left) # same number of rows because all events have a location record
## [1] 164
# tidyverse version
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID")
nrow(loc_event_full_tidy) # same number of rows because all events have a location record
## [1] 164
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_left_tidy) # same number of rows because all events have a location record
## [1] 164
Note that the merge() function is the base R way of joining databases. The all.x = TRUE and all.y = TRUE makes it a full join. Left join is all.x = TRUE, all.y = FALSE, and right join is all.x = FALSE, all.y = TRUE. I tend to use merge() instead of tidyverse join functions, because it’s a bit faster when you have huge datasets, and it’s what I’ve gotten used to. The exception, which I’ll show below, is if I’m chaining multiple steps that include a join via pipes.
Let’s filter the location data to only include ACAD plots, and the event table to only include non-QAQC plots (we resample 5% of plots per year; Event_QAQC = TRUE represents the 2nd visit).
# Base R approach
locev <- merge(loc[loc$Park_Code == "ACAD", ],
event[event$Event_QAQC == FALSE, ],
all.x = TRUE, all.y = FALSE)
# Tidyverse approach
loc_ACAD <- loc %>% filter(Park_Code == "ACAD")
locev_tidy <- event %>% filter(Event_QAQC == FALSE) %>%
left_join(loc_ACAD, ., by = "Location_ID")
nrow(locev)
nrow(locev_tidy)
View(locev)
View(locev_tidy)
Note the use of “.” above. Functions in the tidyverse all assume the dataset to the left of the pipe will go into the first argument in the function following the pipe. When that’s not the case, such as in the left_join above, you can use the “.” to specify where to use the dataset.
Question 3: How would you take only the plots on MDI (eg Unit_ID = “ACAD_MDI”) in the loc table and join them to their corresponding visits in the events table?
Question 4: How would you join the stand data to the locev table?
Summarizing Data
On the first day, we talked about summarizing data for the entire dataset by using mean, min, max, etc. Now we’re going to talk about summarizing data for groups within your data set using group_by() and summarize() functions from dplyr (a core tidyverse package). The group_by() function allows you to specify the columns you want to summarize over. For example grouping by site, and counting up the number of trees or critters observed in each site. Other examples include grouping by year, park unit, watershed, stream, etc. The process is very similar using Totals in MS Access queries or subtotals in Excel, although the tools in R are more powerful, flexible, and efficient.
The summarize() function that we’re going to use is similar to the mutate() function we saw earlier. The main difference is that mutate() creates a new field that always has the exact number of rows as the original dataset. The summarize() function will return only as many rows are there are unique combinations of groups in your group_by(). For example, if your grouping variables are stream and years, and say you have 3 streams and 2 years, summarize will return 6 rows (3 streams X 2 years).
To demonstrate how this works, let’s load a cleaned version of the wetland dataset from before. Let’s start by counting the number of species found in each site by year.
wetdat <- read.csv('./data/example_data/ACAD_wetland_data_clean.csv')
head(wetdat)
wetsum <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
head(wetdat)
## Site_Name Site_Type Latin_Name Common Year PctFreq Ave_Cov
## 1 SEN-01 Sentinel Acer rubrum red maple 2011 0 0.02
## 2 SEN-01 Sentinel Amelanchier serviceberry 2011 20 0.02
## 3 SEN-01 Sentinel Andromeda polifolia bog rosemary 2011 80 2.22
## 4 SEN-01 Sentinel Arethusa bulbosa dragon's mouth 2011 40 0.04
## 5 SEN-01 Sentinel Aronia melanocarpa black chokeberry 2011 100 2.64
## 6 SEN-01 Sentinel Carex exilis coastal sedge 2011 60 6.60
## Invasive Protected X_Coord Y_Coord
## 1 FALSE FALSE 574855.5 4911909
## 2 FALSE FALSE 574855.5 4911909
## 3 FALSE FALSE 574855.5 4911909
## 4 FALSE TRUE 574855.5 4911909
## 5 FALSE FALSE 574855.5 4911909
## 6 FALSE FALSE 574855.5 4911909
In the code above, we added a column named “present” using mutate, and we set present to 1 if a species has an Ave_Cov > 0 and isn’t NA. In case you’ve never seen if/then or if/else statements, these are conditional statements that run a different set of commands depending on whether an expression is true or false. They’re foundational to most programming, can be done in Excel, Access, R, and most programming languages. In the code above, the ifelse statement takes the statement to check for TRUE/FALSE (Ave_Cov > 0 & !is.na(Ave_Cov)), if the result is TRUE, then it returns a 1. If it’s false, it returns a 0.
If you’re unclear on what the mutate did, you can break the steps up via the code below.
wetdat <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0))
wetdat_sum <- wetdat %>% group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
After the mutate, we then set the grouping to be by Site_Name and Year and then summarized the number of species per site by summing the present column.
A couple of things to note:wetsum_opt1 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Site_Type, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
wetsum_opt2 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(Site_Type = first(Site_Type),
numspp = sum(present),
.groups = 'drop')
Taking this a step further, let’s pretend that we want to know what the average number of species per site was, and whether this has changed over time. First we need to remove the Sentinel site data, because we only have 1 visit for those sites. Then we’ll calculate the mean and standard error for years 2012 and 2017 across the RAM sites.
wetsum_pre <- wetdat %>% filter(Site_Type == "RAM") %>%
mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Site_Type, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
wetsum_yr <- wetsum_pre %>% group_by(Year) %>%
summarize(mean_numspp = mean(numspp),
numplots = n(),
se_numspp = sd(numspp)/sqrt(numplots),
.groups = 'drop')
wetsum_yr
## # A tibble: 2 x 4
## Year mean_numspp numplots se_numspp
## * <int> <dbl> <int> <dbl>
## 1 2012 37 5 3.97
## 2 2017 43 5 4.86
A couple of things to note here:
Question 5: How would you calculate the min and max number of species by year?
Question 6: What percent of RAM sites had an invasive species present in 2012 and 2017?
The following code saves the wetland data using a different file name:
write.csv(wetdat, "./data/ACAD_wetland_data2.csv")
You should now see a new file in your data folder.
We’re not actually going to use that file for anything, so you can delete it now (either manually, or running the following code.)
file.remove("./data/ACAD_wetland_data2.csv")
Reshaping Data I
Another important concept in data wrangling is reshaping data. Datasets are usually described as long, or wide. The long form, which is the structure database tables often take, consists of each row being an observation (ie if there are 3 visits to a site, there are 3 rows in the data), and each column being a variable (eg. plot, cycle, number of critters, % plant cover, etc). In summary tables, we often want to reshape the data to be wide, so that there’s only one row per plot, a column for each visit, or species, and the abundance of that species as the value in the cell.
We’re going to work with a fake bat dataset to see how this works. To get started, load the dataset and packages, as shown below.
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
head(bat_long)
## Site Julian Year Latin Common
## 1 BrownMtnCR 195 2015 Myotis leibii small-footed
## 2 BrownMtnCR 214 2015 Myotis leibii small-footed
## 3 BrownMtnCR 237 2016 Myotis leibii small-footed
## 4 BrownMtnCR 230 2017 Myotis lucifugus little brown
## 5 BrownMtnCR 201 2016 Lasiurus cinereus hoary
## 6 BrownMtnCR 230 2018 Myotis septentrionalis northern long-eared
The data in this example dataset are simplified capture data. Every row represents an individual bat that was captured at a given site and date. We want to turn this into a wide data frame that has a column for every species, and the number of individuals of that species that were caught for each year and site combination. Before we start, we’ll make a species code that will be easier to work with (R doesn’t like spaces). We’ll also summarize the data, so we have a count of the number of individuals of each species found per year.
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
substr(word(Latin, 2), 1, 3))))
bat_long_sum <- bat_long %>% group_by(Site, Year, sppcode) %>%
summarize(numindiv = sum(!is.na(sppcode)),
.groups = 'drop')
head(bat_long_sum)
## # A tibble: 6 x 4
## Site Year sppcode numindiv
## <chr> <int> <chr> <int>
## 1 BrownMtnCR 2013 MYOLEI 1
## 2 BrownMtnCR 2014 MYOLUC 2
## 3 BrownMtnCR 2015 MYOLEI 2
## 4 BrownMtnCR 2016 LASCIN 1
## 5 BrownMtnCR 2016 MYOLEI 2
## 6 BrownMtnCR 2016 MYOLUC 1
If you didn’t follow what I just did, there are 3 main steps that happened above. First, we used the word() function (another tidyverse function) to pull out the genus (word(Latin, 1)) and species (word(Latin, 2)). The substr() function is taking the first through third characters of each genus and species. The paste0 function is pasting the first three characters of genus and species together. The toupper is making them all capitalized. Note that the 0 after paste means not to separate the elements of the paste. If you wanted to separate the elements by a underscore, you’d use paste instead of paste0, like paste(element1, element2, sep = "_").
To make this wide, we’re essentially going to make a pivot table using functions in tidyr (another tidyverse package). There are two functions you can use. The spread() function was original to tidyr, is what I know the best, and still performs faster on larger datasets than pivot_wider(), the replacement for spread. I show both options, so you can use whichever function makes the most sense to you.
bat_wide <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv") #retired spread function
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv") #replacement in tidyr
bat_wide
## # A tibble: 21 x 6
## Site Year LASCIN MYOLEI MYOLUC MYOSEP
## <chr> <int> <int> <int> <int> <int>
## 1 BrownMtnCR 2013 NA 1 NA NA
## 2 BrownMtnCR 2014 NA NA 2 NA
## 3 BrownMtnCR 2015 NA 2 NA NA
## 4 BrownMtnCR 2016 1 2 1 NA
## 5 BrownMtnCR 2017 NA 1 1 NA
## 6 BrownMtnCR 2018 NA 1 NA 1
## 7 BrownMtnCR 2019 1 NA NA NA
## 8 BrownMtnGH 2013 NA NA NA 1
## 9 BrownMtnGH 2014 NA 2 1 NA
## 10 BrownMtnGH 2015 1 NA NA 1
## # ... with 11 more rows
Note the NAs for the various species. That’s because original dataset didn’t have a data point for that combination of site/species/year. The tidyr package will assign NA when reshaping for combinations not represented in the original dataset. If there are sites that weren’t sampled in a certain year, NA makes sense. However, if all sites were sampled and there just weren’t any individuals found, then filling those with 0 makes sense, and you can tell tidyr to fill blanks with 0 when you rehape the data. This is a handy feature, but it’s not to be abused. You should always think carefully before automatically filling blanks with 0s.
Here’s the same reshape code from before, but now we fill NAs with 0.
bat_wide_fill <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv", fill = 0) #retired spread function
head(bat_wide_fill)
## # A tibble: 6 x 6
## Site Year LASCIN MYOLEI MYOLUC MYOSEP
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 BrownMtnCR 2013 0 1 0 0
## 2 BrownMtnCR 2014 0 0 2 0
## 3 BrownMtnCR 2015 0 2 0 0
## 4 BrownMtnCR 2016 1 2 1 0
## 5 BrownMtnCR 2017 0 1 1 0
## 6 BrownMtnCR 2018 0 1 0 1
bat_wide2_fill <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv",
values_fill = 0) #replacement in tidyr
Now, say you want to take the wide version of the dataframe and make it long, so that you have a row for every site, year, and species combination. You can use gather(), the original tidyr function, or pivot_longer(), the new one.
bat_long2 <- bat_wide_fill %>% gather(key = "sppcode", value = "numindiv", -Site, -Year) # old way
head(bat_long2)
## # A tibble: 6 x 4
## Site Year sppcode numindiv
## <chr> <int> <chr> <dbl>
## 1 BrownMtnCR 2013 LASCIN 0
## 2 BrownMtnCR 2014 LASCIN 0
## 3 BrownMtnCR 2015 LASCIN 0
## 4 BrownMtnCR 2016 LASCIN 1
## 5 BrownMtnCR 2017 LASCIN 0
## 6 BrownMtnCR 2018 LASCIN 0
names(bat_wide_fill)
## [1] "Site" "Year" "LASCIN" "MYOLEI" "MYOLUC" "MYOSEP"
bat_long2b <- bat_wide_fill %>% pivot_longer(cols=c(LASCIN, MYOLEI, MYOLUC, MYOSEP),
names_to = "sppcode",
values_to = "numindiv") #new way
head(bat_long2b)
## # A tibble: 6 x 4
## Site Year sppcode numindiv
## <chr> <int> <chr> <dbl>
## 1 BrownMtnCR 2013 LASCIN 0
## 2 BrownMtnCR 2013 MYOLEI 1
## 3 BrownMtnCR 2013 MYOLUC 0
## 4 BrownMtnCR 2013 MYOSEP 0
## 5 BrownMtnCR 2014 LASCIN 0
## 6 BrownMtnCR 2014 MYOLEI 0
nrow(bat_long_sum) # dataset before made wide, then long
## [1] 34
nrow(bat_long2) # dataset that has every species represented for every year and every site
## [1] 84
Both functions above are keeping a Site and Year column. In the gather function, we’re telling R which ones not to make long. In the pivot_longer, we’re telling R which columns to gather, rather than which columns to ignore. I show both the old and new approach, because the pivot_### functions are still relatively new, slower with big datasets, and I find more examples online of the old approaches. But, personally I find the pivot_### functions to be more intuitive than gather/spread, so I’ve started using them more.
Another way to do this, since there’s almost always more than one way to do something in R, is to use expand() and joins to get all combinations of factors in your data. There’s an added step of having to replace NAs from the join with 0.
#First figure out how many levels you expect
with(bat_long, length(unique(Site)) * length(unique(Year)) * length(unique(sppcode))) #84
bat_all_comb <- bat_long_sum %>% expand(Site, Year, sppcode)
nrow(bat_all_comb) #84
bat_long_comb <- left_join(bat_all_comb, bat_long_sum, by = c("Site", "Year", "sppcode"))
bat_long_comb$numindiv[is.na(bat_long_comb$numindiv)] <- 0 # replace NA with 0
nrow(bat_long2) == nrow(bat_long_comb) #TRUE
A couple of things to note in the above code:
Question 1: How would you create a dataset that has a column for every sample year (ie 2013-2019), a row for every site, and the number of northern long-eared bats found as the value?
You can also pivot on 2 columns using pivot_wider. To demonstrate, let’s pivot on sppcode and Year.
head(bat_long_sum)
## # A tibble: 6 x 4
## Site Year sppcode numindiv
## <chr> <int> <chr> <int>
## 1 BrownMtnCR 2013 MYOLEI 1
## 2 BrownMtnCR 2014 MYOLUC 2
## 3 BrownMtnCR 2015 MYOLEI 2
## 4 BrownMtnCR 2016 LASCIN 1
## 5 BrownMtnCR 2016 MYOLEI 2
## 6 BrownMtnCR 2016 MYOLUC 1
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = c(sppcode, Year),
values_from = numindiv,
values_fill = 0)
head(bat_wide2)
## # A tibble: 3 x 21
## Site MYOLEI_2013 MYOLUC_2014 MYOLEI_2015 LASCIN_2016 MYOLEI_2016 MYOLUC_2016
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Brow~ 1 2 2 1 2 1
## 2 Brow~ 0 1 0 0 0 1
## 3 BubP~ 1 1 1 0 2 0
## # ... with 14 more variables: MYOLEI_2017 <int>, MYOLUC_2017 <int>,
## # MYOLEI_2018 <int>, MYOSEP_2018 <int>, LASCIN_2019 <int>, MYOSEP_2013 <int>,
## # MYOLEI_2014 <int>, LASCIN_2015 <int>, MYOSEP_2015 <int>, LASCIN_2017 <int>,
## # MYOLEI_2019 <int>, MYOLUC_2015 <int>, MYOLUC_2018 <int>, MYOLUC_2019 <int>
The above steps may seem like a lot of work for something you could do by hand in Excel. But, if you have a huge dataset that only stores records of species that were present, this can really come in handy. It’s also repeatable and transparent. So, if you realized you did something wrong, you can fix the code and rerun everything, rather than having to redo all of the work in R by hand.
Reshaping Data II
Let’s work with a larger dataset to demonstrate these concepts again. Say we want a list of all invasive species found in ACAD wetland sites in 2017 and we want to calculate the proportion of plots invasive species occurred in. If we don’t find an invasive in a plot, there’s no record of the data. Let’s load the wetland dataset from before and work through that example.
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
head(wetdat)
## Site_Name Site_Type Latin_Name Common Year PctFreq Ave_Cov
## 1 SEN-01 Sentinel Acer rubrum red maple 2011 0 0.02
## 2 SEN-01 Sentinel Amelanchier serviceberry 2011 20 0.02
## 3 SEN-01 Sentinel Andromeda polifolia bog rosemary 2011 80 2.22
## 4 SEN-01 Sentinel Arethusa bulbosa dragon's mouth 2011 40 0.04
## 5 SEN-01 Sentinel Aronia melanocarpa black chokeberry 2011 100 2.64
## 6 SEN-01 Sentinel Carex exilis coastal sedge 2011 60 6.60
## Invasive Protected X_Coord Y_Coord
## 1 FALSE FALSE 574855.5 4911909
## 2 FALSE FALSE 574855.5 4911909
## 3 FALSE FALSE 574855.5 4911909
## 4 FALSE TRUE 574855.5 4911909
## 5 FALSE FALSE 574855.5 4911909
## 6 FALSE FALSE 574855.5 4911909
# Prepare data for pivot
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
substr(word(Latin_Name, 2), 1, 3))),
present = ifelse(Ave_Cov>0, 1, 0)) %>%
filter(Year == 2017)
# Pivot wide, so every species has a column
wetdat_wide <- wetdat_prep %>% select(Site_Name, Year, sppcode, present) %>%
pivot_wider(names_from = "sppcode", values_from = "present",
values_fill = 0)
head(wetdat_wide)
## # A tibble: 5 x 106
## Site_Name Year ACERUB ALNINC AMENA BERTHU CARFOL CARTRI CARNA CHACAL CORCAN
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RAM-41 2017 1 1 1 1 1 1 1 1 1
## 2 RAM-53 2017 1 1 1 1 1 0 0 1 1
## 3 RAM-62 2017 1 0 0 0 0 1 0 1 1
## 4 RAM-44 2017 1 1 0 0 0 0 0 1 0
## 5 RAM-05 2017 1 1 1 0 1 1 0 1 1
## # ... with 95 more variables: DOEUMB <dbl>, DRYCRI <dbl>, ERIANG <dbl>,
## # GAYBAC <dbl>, ILEMUC <dbl>, ILEVER <dbl>, JUNEFF <dbl>, KALANG <dbl>,
## # LARLAR <dbl>, MAICAN <dbl>, MAITRI <dbl>, MYRGAL <dbl>, ONOSEN <dbl>,
## # OSMCIN <dbl>, PICGLA <dbl>, PICMAR <dbl>, PRENA <dbl>, RHOCAN <dbl>,
## # RHOGRO <dbl>, ROSNIT <dbl>, RUBHIS <dbl>, SOLULI <dbl>, SPIALB <dbl>,
## # SYMFOE <dbl>, THEPAL <dbl>, TRIBOR <dbl>, VACANG <dbl>, VACCOR <dbl>,
## # VACMYR <dbl>, VACVIT <dbl>, VIBNUD <dbl>, AROMEL <dbl>, CALCAN <dbl>,
## # CARATL <dbl>, CARLAS <dbl>, CARSTR <dbl>, CELORB <dbl>, DANSPI <dbl>,
## # DICACU <dbl>, DROINT <dbl>, DROROT <dbl>, DULARU <dbl>, EPILEP <dbl>,
## # ERITEN <dbl>, ERIVIR <dbl>, JUNCAN <dbl>, LYSTER <dbl>, MORPEN <dbl>,
## # OCLNEM <dbl>, OSMREG <dbl>, PICRUB <dbl>, PINSTR <dbl>, POGOPH <dbl>,
## # RHAFRA <dbl>, SCICYP <dbl>, SPITOM <dbl>, SYMNOV <dbl>, TRINA <dbl>,
## # TYPLAT <dbl>, VACMAC <dbl>, GAUHIS <dbl>, MONUNI <dbl>, VACOXY <dbl>,
## # APOAND <dbl>, BETPOP <dbl>, CARLAC <dbl>, EQUARV <dbl>, EURMAC <dbl>,
## # FESFIL <dbl>, IRIVER <dbl>, POPGRA <dbl>, POPTRE <dbl>, QUERUB <dbl>,
## # RANACR <dbl>, ROSVIR <dbl>, RUBNA <dbl>, SALPET <dbl>, SOLRUG <dbl>,
## # UTRCOR <dbl>, VEROFF <dbl>, VICCRA <dbl>, AREBUL <dbl>, CALTUB <dbl>,
## # CAREXI <dbl>, CARMAG <dbl>, CARPAU <dbl>, EMPNIG <dbl>, EURRAD <dbl>,
## # GLYSTR <dbl>, KALPOL <dbl>, `LON-` <dbl>, OCLX <dbl>, RHYALB <dbl>,
## # SARPUR <dbl>, XYRMON <dbl>
# Pivot long, so every site has a species with 1 or 0 to denote present/absent
wetdat_long <- wetdat_wide %>% pivot_longer(cols = c(-Site_Name, -Year), names_to = "sppcode",
values_to = "present") %>%
arrange(Site_Name, sppcode)
# Now we need to filter the data to only include invasive species
# First make a list of the invasive species
inv_list <- unique(wetdat_prep$sppcode[wetdat_prep$Invasive == TRUE])
inv_list
## [1] "BERTHU" "CELORB" "RHAFRA" "LON-"
#Now filter on that invasive list
wetdat_long_inv <- wetdat_long %>% filter(sppcode %in% inv_list)
# Now summarize to count the percent of plots each species was found in
wetdat_sum <- wetdat_long_inv %>% group_by(sppcode) %>%
summarize(pct_sites = sum(present)/n()*100,
.groups = 'keep')
wetdat_sum
## # A tibble: 4 x 2
## # Groups: sppcode [4]
## sppcode pct_sites
## <chr> <dbl>
## 1 BERTHU 40
## 2 CELORB 20
## 3 LON- 20
## 4 RHAFRA 40
Note the use of n() in the summarize() for wetdat_sum. That function counts the number of rows in each group. It’s handy, but be careful. It count’s every row, including rows with NAs. If you want that, n() is great. If you don’t, you need to use another approach.
Question 2: How would you create a wide matrix with all protected species as columns, sites and years sampled as rows, and percent cover as the value?
Intro to ggplot
To get started with ggplot2, we need to make a dataset to plot. Instead of starting with a dataset that’s ready to be plotted, let’s use what we’ve learned to make a real dataset from NETN forest data. We’re going to create the plot below that shows the mean density of seedlings and saplings by size class for the most recent 4-years of sampling at ACAD.
Note this is also a demonstration of how building packages for your own programs can really save you time performing tasks that you’re likely to do over and over.
First you need to load the forestNETN package and import the data.
library(forestNETNarch)
library(tidyverse)
importCSV("./data/NETN_forest_csvs")
Now we’ll use one of the functions in the forestNETN package to pull together the regeneration data. The resulting dataset will have a row for every species of tree seedling or sapling found in a given plot from 2016-2019. We’ll use that dataset to sum up the total number of seedlings and saplings by size class for each plot.
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")
head(acad_regen) # this dataframe includes species, so there are multiple rows per plot.
# We need to group_by Plot_Name, and sum seedling densities to get plot-level seedling densities.
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>%
summarize(seed15_30 = sum(seed15.30),
seed30_100 = sum(seed30.100),
seed100_150 = sum(seed100.150),
seed150p = sum(seed150p),
sapling = sum(sap.den),
.groups = 'drop') %>%
arrange(Plot_Name)
# This is the shortest way to get a dataset that has a mean seedling density for each size class and a se for each size class.
# Note we want to make this long, so every plot has each of the size classes
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, names_to = "size_class", values_to = "density")
# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>%
summarize(mean_dens = mean(density),
se_dens = sd(density)/sqrt(n()),
.groups = 'drop')
# We have our dataset to plot.
The pros with using ggplot are that you can get a plot with very little work. The cons are that as soon as you want to change something, like removing the grid lines, customizing a legend, changing the font size, etc., it’s quite tedious to do. Using the default formatting that comes with ggplot, let’s plot the acad_stats dataset.
The first step in ggplot is to specify the data and the x and y (and grouping, if you have any) values with the ggplot() function. After you set that up, you have to decide what kind of plot you want to us. The most common ones I use are:As long as the mapping is the same (eg x/y values fit on the plot), you can add multiple geoms, such as points and lines that connect the points, or geom_errorbar to the geom_bar. You can even include different datasets. Just know that the ranges of the x and y axis default to the x and y you set in ggplot(). You can manually change the range of an axis to fix that, but it requires more tinkering. You can also keep adding things to the object with +.
acad_base <- ggplot(data = acad_stats, aes(x = size_class, y = mean_dens))
# Make a point plot
acad_base + geom_point()
# Make a bar chart
acad_bar <- acad_base + geom_bar(stat = 'identity')
acad_bar
# Add error bars to the previous bar chart
acad_bar + geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")
Note that the x-axis plots the size classes alphabetically instead of from small to big size classes. To fix this (this is where ggplot starts to get fiddly), we want to make the size_class column an ordered factor, and we set the order the levels.
acad_stats$size_class_fact <- ordered(acad_stats$size_class,
levels = c('seed15_30', 'seed30_100', 'seed100_150', 'seed150p', 'sapling'))
acad_bar2 <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity')+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")
acad_bar2
That’s more like it, but I still don’t like how the figure looks. For example, I don’t like the gridlines in the figure. Those settings are buried in theme(), and to change it you need to +theme(change_settings_here)
acad_bar2 + theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
That’s a little better, but I still don’t like it. And, I never remember exactly what I need to type to remove the gridlines, take the grey fill off the background, etc. So I made my own theme that does all of that in the forestNETN package, called theme_FHM.
acad_bar2+theme_FHM()
Much better! But, I still prefer the bars be a different color than grey. Let’s fill the bars with a nice blue color. I’ll start over on the code too, so you can see it all together. Note the use of “#5E79A8” to the fill bar color. I used the hex code for a blue-grey color I like. Hex codes are a universal color code that specify the intensity of red, green, blue, and makes it easier to use the same color across multiple programs. If you want to pick your own color, you can use HTML Color Codes website to pick a color and see the code. I also like the color picker, mixer and shades in w3schools.com.
acad_final <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity', fill = "#5E79A8")+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
theme_FHM()
acad_final
Question 4: Take the acad_final graph and make 2 more changes to it. For example, make the line width on the error bars bigger, change the font size of the axis text bigger, tilt the x-axis labels (eg seed15_30) 45 degrees, rename x-axis labels, etc.
More advanced plotting in ggplot
Another great feature of ggplot is that you can use grouping variables to code by color or shape, and quickly create multiple panels of plots using facet_wrap() or facet_grid(). For this section, let’s work with some water data from Acadia. We’ll need to load the NCRNWater package and data first. Data files are pretty big, so it might take a minute or two to load.
library(NCRNWater)
netnwd<-importNCRNWater(Dir = "./data/NETN_water_data",
Data = "Water Data.csv",
MetaData = "VizMetaData.csv")
You should see an object in your global environment called netnwd. All of the water data are stored within this object, and we’ll use getter functions in the water package to retrieve it. This is a different work flow than we’ve seen up until now. Hopefully it’s straightforward enough that you can get to the final dataset for plotting, which will have one row for every pH measurement taken in each ACAD annually sampled lake. The columns will contain the measurement and information about the site/sample.
To find the lakes that have been sampled every year, we have to find all of the site codes in ACAD and filter to only include Lakes. Then we need to find the sites that are sampled every year using the frequency of measurements in the dataset to determine those sites.
# Create data frame that lists the sites and their type using the getSiteInfo getter.
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
head(acad_all_sites)
## sitecode type fullname
## 1 NETN_ACAD_ABIN Stream Aunt Betty Pond Inlet
## 2 NETN_ACAD_ANTB Lake Aunt Bettys Pond
## 3 NETN_ACAD_BOWL Lake The Bowl
## 4 NETN_ACAD_BRBK Lake Bear Brook Pond
## 5 NETN_ACAD_BRKB Stream Breakneck Brook
## 6 NETN_ACAD_BRWN Stream Browns Brook
# Filter data frame to only include lakes
acad_lakes <- acad_all_sites %>% filter(type == "Lake") #list of site codes and full names for ACAD lakes
acad_lakes
## sitecode type fullname
## 1 NETN_ACAD_ANTB Lake Aunt Bettys Pond
## 2 NETN_ACAD_BOWL Lake The Bowl
## 3 NETN_ACAD_BRBK Lake Bear Brook Pond
## 4 NETN_ACAD_BUBL Lake Bubble Pond
## 5 NETN_ACAD_EAGL Lake Eagle Lake
## 6 NETN_ACAD_ECHO Lake Echo Lake
## 7 NETN_ACAD_HODG Lake Hodgdon Pond
## 8 NETN_ACAD_JORD Lake Jordan Pond
## 9 NETN_ACAD_LBRK Lake Lower Breakneck
## 10 NETN_ACAD_LHAD Lake Lower Hadlock
## 11 NETN_ACAD_LONG Lake Long Pond (MDI)
## 12 NETN_ACAD_LPIH Lake Long Pond (IAH)
## 13 NETN_ACAD_ROUN Lake Round Pond
## 14 NETN_ACAD_SAMP Lake Sargent Mtn Pond
## 15 NETN_ACAD_SEAL Lake Seal Cove Pond
## 16 NETN_ACAD_SEAW Lake Seawall Pond
## 17 NETN_ACAD_UBRK Lake Upper Breakneck
## 18 NETN_ACAD_UHAD Lake Upper Hadlock
## 19 NETN_ACAD_WHOL Lake Witch Hole Pond
## 20 NETN_ACAD_WOOD Lake Lake Wood
# Compile the water pH data using getWData function
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
# Filter water data to only include lakes
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
table(acad_lakes_ph$Site) # 8 sites with >90 measurements
##
## NETN_ACAD_ANTB NETN_ACAD_BOWL NETN_ACAD_BRBK NETN_ACAD_BUBL NETN_ACAD_EAGL
## 29 20 29 94 95
## NETN_ACAD_ECHO NETN_ACAD_HODG NETN_ACAD_JORD NETN_ACAD_LBRK NETN_ACAD_LHAD
## 94 28 93 29 30
## NETN_ACAD_LONG NETN_ACAD_LPIH NETN_ACAD_ROUN NETN_ACAD_SAMP NETN_ACAD_SEAL
## 94 5 24 19 93
## NETN_ACAD_SEAW NETN_ACAD_UBRK NETN_ACAD_UHAD NETN_ACAD_WHOL NETN_ACAD_WOOD
## 30 24 95 91 24
# Turn the table above into a dataframe, and use it to filter the water data on.
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90)
# Filter acad_lakes_pH to only include the annual lakes and add column for month and year
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>%
mutate(month = format(Date, "%m"),
year = format(Date, "%Y"))
head(acad_ph2)
## Date SampleDepth Depth.Units Value MQL UQL Censored ValueCen TextValue
## 1 2006-05-23 epilimnion <NA> 6.570 NA NA FALSE 6.570 6.57
## 2 2006-06-21 epilimnion <NA> 6.515 NA NA FALSE 6.515 6.515
## 3 2006-07-20 epilimnion <NA> 6.610 NA NA FALSE 6.610 6.61
## 4 2006-08-10 epilimnion <NA> 6.620 NA NA FALSE 6.620 6.62
## 5 2006-09-26 epilimnion <NA> 6.120 NA NA FALSE 6.120 6.12
## 6 2006-10-17 epilimnion <NA> 6.560 NA NA FALSE 6.560 6.56
## Characteristic Category Site Park month year
## 1 pH physical NETN_ACAD_BUBL ACAD 05 2006
## 2 pH physical NETN_ACAD_BUBL ACAD 06 2006
## 3 pH physical NETN_ACAD_BUBL ACAD 07 2006
## 4 pH physical NETN_ACAD_BUBL ACAD 08 2006
## 5 pH physical NETN_ACAD_BUBL ACAD 09 2006
## 6 pH physical NETN_ACAD_BUBL ACAD 10 2006
acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")],
by= c("Site" = "sitecode")) # how to join if columns have diff. names
head(acad_ph_final) #final dataset to plot
## Date SampleDepth Depth.Units Value MQL UQL Censored ValueCen TextValue
## 1 2006-05-23 epilimnion <NA> 6.570 NA NA FALSE 6.570 6.57
## 2 2006-06-21 epilimnion <NA> 6.515 NA NA FALSE 6.515 6.515
## 3 2006-07-20 epilimnion <NA> 6.610 NA NA FALSE 6.610 6.61
## 4 2006-08-10 epilimnion <NA> 6.620 NA NA FALSE 6.620 6.62
## 5 2006-09-26 epilimnion <NA> 6.120 NA NA FALSE 6.120 6.12
## 6 2006-10-17 epilimnion <NA> 6.560 NA NA FALSE 6.560 6.56
## Characteristic Category Site Park month year fullname
## 1 pH physical NETN_ACAD_BUBL ACAD 05 2006 Bubble Pond
## 2 pH physical NETN_ACAD_BUBL ACAD 06 2006 Bubble Pond
## 3 pH physical NETN_ACAD_BUBL ACAD 07 2006 Bubble Pond
## 4 pH physical NETN_ACAD_BUBL ACAD 08 2006 Bubble Pond
## 5 pH physical NETN_ACAD_BUBL ACAD 09 2006 Bubble Pond
## 6 pH physical NETN_ACAD_BUBL ACAD 10 2006 Bubble Pond
Now that we have our final dataset, let’s first create a plot that color codes points and lines by site.
ph_plot <- ggplot(acad_ph_final, aes(x = Date, y = Value, color = Site, shape = Site))+
geom_point()+
labs(y = "pH")+
theme_bw()
ph_plot
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 186 rows containing missing values (geom_point).
There’s too much going on here to make sense of the data. Let’s split the data so that each site has its own plot.
ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
geom_point(shape = 16, color = "LightBlue")+
labs(y = "pH")+
theme_bw()+
facet_wrap(~fullname)
ph_plot_site
Now let’s add a loess smoother to explore trends. The great thing about ggplot is that you can keep adding to existing plots without having to repeat code.
ph_plot_site_smooth <- ph_plot_site + geom_smooth(method = 'loess', span = 1, color = "RoyalBlue", se = FALSE)
# vary the span to increase/decrease wavyness of smooth
ph_plot_site_smooth + geom_smooth(method = 'loess', span = 0.1, color = "red", se = FALSE)+
geom_smooth(method = 'loess', span = 0.5, color = 'green', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Now let’s change the years so they’re at an angle and include every 3 years.
ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
geom_point(shape = 16, color = "LightBlue")+
labs(y = "pH")+
theme_bw()+
facet_wrap(~fullname)+
geom_smooth(method = 'loess', span = 0.5, color = 'RoyalBlue', se = FALSE)
ph_plot_site2 <- ph_plot_site + scale_x_date(date_breaks = "2 years",
date_labels = "%Y")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ph_plot_site2
## `geom_smooth()` using formula 'y ~ x'
I’m just barely scratching the surface of what you can do in ggplot2. For more examples of the types of graphs you can make, check out the R ggplot gallery.
Question 5: Filter the acad_ph_final data frame to only include 2019 data and create a plot with months as the X axis, and includes points, a loess smooth with span = 1, and a facet on site.
Creating a Shapefile in R
Often times I want to map the final output of summaries I do in R. While you can save tables in R, then join them to existing shapefiles in ArcGIS, I prefer to save my output as a shapefile in R. In case this is helpful to you, I’ll show you the easiest way I’ve found to do this.
The keys are to have the X, Y Coordinates included in your final summary, and to know which Coordinate Reference System to specify (abbreviated CRS). For example, parks in NETN cover 2 UTM Zones, and you have to specify the correct zone. The datum and projection for each park are listed below.
Most of the major coordinate systems have a registered EPSG Code assigned to it, including ours. EPSG stands for European Petroleum Survey Group, which seems pretty random, but is a useful set of universally unique codes for datum and protections that most GIS software can recognize. Using the EPSG code is the fastest way I’ve found to assign the coordinate system for our shapefile. For ACAD and MIMA, the EPSG Code is 26919. For the other parks it’s 26918 (notice the last digit is the only difference, and matches the UTM Zone above).
To start out, let’s load our wetland dataset again. Let’s say we want to create a shapefile for the most recent sample of each site and a column with the number of species found at each site.
library(tidyverse)
library(rgdal) # for GIS functions
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
table(wetdat$Site_Name, wetdat$Year) # SEN most recent is 2011, RAM is 2017
wetdat_rich <- wetdat %>% filter(Year %in% c(2011, 2017)) %>%
mutate(present = ifelse(!is.na(Ave_Cov) & Ave_Cov > 0, 1, 0)) %>%
group_by(Site_Name, Year, Site_Type, X_Coord, Y_Coord) %>%
summarize(numspp = sum(present),
.groups = 'drop')
head(wetdat_rich)
## # A tibble: 6 x 6
## Site_Name Year Site_Type X_Coord Y_Coord numspp
## <chr> <int> <chr> <dbl> <dbl> <dbl>
## 1 RAM-05 2017 RAM 553186. 4899764. 54
## 2 RAM-41 2017 RAM 549667. 4903243. 40
## 3 RAM-44 2017 RAM 530882. 4877460. 45
## 4 RAM-53 2017 RAM 549032. 4909911. 50
## 5 RAM-62 2017 RAM 556731. 4908125. 26
## 6 SEN-01 2011 Sentinel 574855. 4911909. 34
Now we’re ready to save to shapefile.
# Step 1: Create spatial object by specifying the coordinates
coordinates(wetdat_rich) <- ~X_Coord + Y_Coord
head(wetdat_rich)
plot(wetdat_rich) # These points now plot in space (bottom left point is on IAH)
# Step 2: Define the coordinate system
proj4string(wetdat_rich) <- CRS("+init=epsg:26919")
# Step 3: Write the spatial object to a shapefile
writeOGR(wetdat_rich, dsn = "./shapefiles", layer = "wetdat_rich", driver = "ESRI Shapefile",
overwrite_layer = TRUE) # overwrite is in case you need to update your shapefile.
A lot of this may be unfamiliar to you, especially since ArcGIS does a lot of this work behind the scenes.
GIS Mapping in R
I only have time to scratch the surface on this topic, but I think it’s useful to know that anything you can do in ArcGIS with vector and raster data, you can do in R. GIS in R requires more understanding about what’s happening under the hood, and sometimes it isn’t worth the effort in R. It depends on what you need to do, how involved the process is, and whether you need to document the process and/or repeat it over and over. For example, I’ve found the raster package in R to be relatively easy to use and faster at processing large files (e.g. NLCD rasters), and I almost always use R when I’m working with rasters. In contrast, making nice maps that need a lot of tweaking (eg moving plot labels to prevent overlap, multiple legends, etc), I tend to use ArcGIS. But, to give you an idea of the things you can do in R,
I’ll show you two ways to plot data spatially in R. First is using ggplot, and second is using leaflet.
Let’s load the wetland shapefile we just created, along with a layer that plots wetlands in the vegetation map and the park fee boundary.
# Load packages
library(rgdal) # for readOGR
library(ggplot2) #for ggplot functions
library(ggspatial) #to make ggplot functions spatial
# Load shapefiles
bounds <- readOGR("./data/example_data/ParkBndCARTO_Fee_2020.shp", layer = "ParkBndCARTO_Fee_2020",
GDAL1_integer64_policy = TRUE) # Don't ask me why this is here. All I know is it works.
head(bounds@data) # How you view the attribute table of a shapefile in R
wetmap <- readOGR("./data/example_data/ACAD_vegmap_wetlands.shp", layer = "ACAD_vegmap_wetlands",
GDAL1_integer64_policy = TRUE)
wet_rich_shp <- readOGR("./shapefiles/wetdat_rich.shp", layer = "wetdat_rich",
GDAL1_integer64_policy = TRUE)
head(wet_rich_shp@data)
head(wet_rich_shp@coords) # Where the coordinates are stored
plot(bounds)
To prepare the shapefile polygons for plotting in ggplot, we first have to fortify them to make them into a spatial data frame.
# Prepare shapefiles for plotting
bounds2 <- fortify(bounds) #
## Regions defined for each Polygons
str(bounds2) # note the data frame structure that describes the boundary data
wetmap2 <- fortify(wetmap)
## Regions defined for each Polygons
str(wetmap2)
# Turn the wetland richness shapefile back into a dataset
# One reason to do this is R is because you never had to make this a shapefile to plot
wet_rich_df <- cbind(as.data.frame(wet_rich_shp@data),
long = wet_rich_shp@coords[,1],
lat = wet_rich_shp@coords[,2])
Now to map the different shapefiles using spatial geoms from ggspatial.
ggmap <- ggplot()+
coord_sf(crs = 26919)+ # set coordinate system to NAD83 UTM Zone 19N using EPSG code
geom_polypath(data = bounds2, aes(x = long, y = lat, group = group),
fill = "grey", alpha = 0.4)+
geom_polypath(data = wetmap2, aes(x = long, y = lat, group = group),
fill = "#80A8D1")+
geom_point(data = wet_rich_df,
aes(x = long,
y = lat,
color = Site_Type),
shape = 17)+
theme_minimal() # because I don't like ggplot's grids in default grid
ggmap
You can add north arrows, scale bars and legends using other packages, including sf. For more information about this, check out r-spatial’s site.
Finally, I’ll show how to use leaflet to map data with park tiles as the base layer. The leaflet package is native to javascript, but the main functions have been made into an R package. The great thing about leaflet is you can add interactive maps that allow you to zoom/pan to different views. We use these a lot for our data visualizers.
To use leaflet, you have to have all of your spatial data in WGS84 instead of NAD83 UTM Zone 19N. The first step is to reproject the wetland points to WGS84.
wet_rich_dd <- spTransform(wet_rich_shp, CRS="+proj=longlat +datum=WGS84")
plot(wet_rich_dd) #check that they still plot spatially
library(leaflet)
park_tile_basic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
map_leaf <- leaflet() %>%
setView(lng = -68.312,
lat = 44.25,
zoom = 10) %>%
addTiles(urlTemplate = park_tile_basic) %>%
addCircleMarkers(data = wet_rich_dd,
radius = 4,
lng = wet_rich_dd@coords[,1],
lat = wet_rich_dd@coords[,2],
label = wet_rich_dd@data$Site_Name,
labelOptions = labelOptions(noHide=T, textOnly = TRUE,
direction = 'bottom', textsize = "12px"),
fillColor = "#4C9B4F",
fillOpacity = 1,
stroke = FALSE) %>%
addScaleBar(position='bottomright') %>%
scaleBarOptions(maxWidth=10, metric=TRUE)
map_leaf
When I first started with NETN, the workflow below was how we operated. There’s no shame in this workflow, and it’s still what a lot of networks do. However, it’s time-consuming, requires a lot of human intervention where errors can be introduced, and you nearly have to start from scratch every year you need to update your reports or do a new summary.
This is the dream. Again there’s no judgment if you’re not here. This is something to aspire to, as you have time and gain more skills in R. I’m proud to report that we’re really close to achieving this with most of our protocols in NETN, but it’s been a long road (4+ years). RStudio has also come a long way in the last 5 years and has made this transition easier with improvements to R Markdown (automated reporting), R Shiny (interactive websites), and the tidyverse.
There’s a lot of great online material for learning new applications of R. The ones I’ve used the most are listed below.
Online BooksQuestion 1: How do you find the value of the 3rd row of domSpp?
# If starting from scratch
library(tidyverse)
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', NA),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
df[3, c("domSpp")] # option 1
## [1] "red_spruce"
df[3, 4] # option 2
## [1] "red_spruce"
Question 2: How would you make numtrees an integer?
df$numtrees <- as.integer(df$numtrees)
str(df)
## 'data.frame': 7 obs. of 5 variables:
## $ plot : chr "1" "2" "3" "4" ...
## $ numtrees : int 24 18 17 14 31 27 15
## $ numLive : num 21 18 16 10 29 26 14
## $ domSpp : chr "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
## $ invasive_pres: logi TRUE FALSE FALSE FALSE FALSE TRUE ...
Question 3: How would you change the second row in invasive_pres from FALSE to TRUE?
df$invasive_pres[2] <- TRUE
head(df)
## plot numtrees numLive domSpp invasive_pres
## 1 1 24 21 red_maple TRUE
## 2 2 18 18 red_spruce TRUE
## 3 3 17 16 red_spruce FALSE
## 4 4 14 10 white_cedar FALSE
## 5 5 31 29 white_pine FALSE
## 6 6 27 26 white_spruce TRUE
Question 4: How would you calculate the percent of trees that are dead in the df data frame?
# base R version (2 options)
df$numDead <- df$numtrees - df$numLive
df$pctDead <- df$numDead/df$numtrees*100
df$pctDead <- with(df, (numtrees - numLive)/numtrees*100)
# tidyverse version (2 options)
df <- df %>% mutate(numDead = numtrees - numLive,
pctDead = numDead/numtrees*100)
df <- df %>% mutate(pctDead = (numtrees - numLive)/numtrees*100)
Question 5: What is the smallest number of dead trees observed in the df dataset?
min(df$numDead) # uses numdead column calculated for previous question.
## [1] 0
summary(df$numDead)[1] #Min. is the first record that summary reports for numeric columns
## Min.
## 0
Question 6: How would you select only Invasive species records?
# If starting from scratch
wetdat<-read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetinv <- wetdat[wetdat$Invasive == TRUE, ] # base R bracket-based solution
wetinv2 <- subset(wetdat, Invasive == TRUE) # another base R solution
wetinv_tidy <- wetdat %>% filter(Invasive == TRUE) # tidyverse solution
Question 7: Which sites have both Arethusa bulbosa and Carex exilis present?
# Base R solution
wetspp <- wetdat[wetdat$Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"),]
sort(unique(wetspp$Site_Name))
## [1] "RAM-05" "SEN-01" "SEN-02"
# Tidyverse solution
wetspp_tidy <- wetdat %>% filter(Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"))
sort(unique(wetspp_tidy$Site_Name)) # RAM-05 and SEN-01 have both species present
## [1] "RAM-05" "SEN-01" "SEN-02"
Question 8: How would you sort the wetdat by Ave_Cov (high to low) and Latin_Name?
wetdat_sort <- wetdat[order(-wetdat$Ave_Cov, wetdat$Latin_Name), ] # base R
head(wetdat_sort)
## Site_Name Site_Type Latin_Name
## 286 RAM-62 RAM Carex trisperma
## 328 RAM-62 RAM Vaccinium vitis-idaea
## 425 RAM-05 RAM Carex folliculata
## 220 RAM-53 RAM Eurybia radula
## 506 RAM-05 RAM Viburnum nudum var. cassinoides
## 305 RAM-62 RAM Maianthemum trifolium
## Common Year PctFreq Ave_Cov Invasive Protected
## 286 threeseeded sedge 2017 100 74 FALSE FALSE
## 328 lingonberry 2012 100 66 FALSE FALSE
## 425 long sedge 2017 50 63 FALSE FALSE
## 220 low rough aster 2012 50 63 FALSE FALSE
## 506 northern wild raisin 2012 100 63 FALSE FALSE
## 305 threeleaf false lily of the valley 2012 100 51 FALSE FALSE
## X_Coord Y_Coord
## 286 556731.5 4908125
## 328 556731.5 4908125
## 425 553186.0 4899764
## 220 549032.1 4909911
## 506 553186.0 4899764
## 305 556731.5 4908125
wetdat_sort_tidy <- wetdat %>% arrange(-Ave_Cov, Latin_Name) #tidyverse
head(wetdat_sort_tidy)
## Site_Name Site_Type Latin_Name
## 1 RAM-62 RAM Carex trisperma
## 2 RAM-62 RAM Vaccinium vitis-idaea
## 3 RAM-05 RAM Carex folliculata
## 4 RAM-53 RAM Eurybia radula
## 5 RAM-05 RAM Viburnum nudum var. cassinoides
## 6 RAM-62 RAM Maianthemum trifolium
## Common Year PctFreq Ave_Cov Invasive Protected
## 1 threeseeded sedge 2017 100 74 FALSE FALSE
## 2 lingonberry 2012 100 66 FALSE FALSE
## 3 long sedge 2017 50 63 FALSE FALSE
## 4 low rough aster 2012 50 63 FALSE FALSE
## 5 northern wild raisin 2012 100 63 FALSE FALSE
## 6 threeleaf false lily of the valley 2012 100 51 FALSE FALSE
## X_Coord Y_Coord
## 1 556731.5 4908125
## 2 556731.5 4908125
## 3 553186.0 4899764
## 4 549032.1 4909911
## 5 553186.0 4899764
## 6 556731.5 4908125
Question 1: There’s a Latin name that needs to be fixed (hint look for non-letter symbols). How would you find that record?
sort(unique(wetdat$Latin_Name))[1:10] # Low-tech way: use your eyes to spot the "Alnus incana++". I truncated to first 10.
## [1] "Acer rubrum" "Alnus incana"
## [3] "Alnus incana++" "Amelanchier"
## [5] "Andromeda polifolia" "Apocynum androsaemifolium"
## [7] "Arethusa bulbosa" "Aronia melanocarpa"
## [9] "Berberis thunbergii" "Betula populifolia"
Hopefully you noticed Alnus incana++ when you sorted unique latin names. The ++ are errors that we want to remove (see Question 2). If you wanted to identify any non-alphabetical symbol in a column, and if your dataset was too big to spot all of them, you can use regular expressions like the code chunk below. The pattern in the grep is specifying that you want to return any non-alphabetical symbol (via alpha) in Latin_Name. Note that “^” is interpreted as NOT in regular expressions. To use the whole record for your search, rather than stopping at the first word (since Latin Names are usually 2 words), you add +$. The + specifies that you can have multiple matches within 1 record, and the $ tells you to do the search to the end of the record.
# The regular expression way to find non-alphabetical symbols (now you know why I didn't cover this).
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # looks for anything not A-Z or a-z in entire record
## [1] "Alnus incana++"
# ^ and +$ are telling R to start with first word and check every following word within a record.
# This took longer than I'd like to admit to make work, by the way.
Question 2: Now that you’ve found it, fix it, and check that you fixed it.
wetdat$Latin_Name[wetdat$Latin_Name == "Alnus incana++"] <- "Alnus incana" # easy approach
wetdat$Latin_Name <- gsub("[+]", "", wetdat$Latin_Name) # gsub approach
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # no values returned
## character(0)
sort(unique(wetdat$Latin_Name))[1:10] # using your eyes approach
## [1] "Acer rubrum" "Alnus incana"
## [3] "Amelanchier" "Andromeda polifolia"
## [5] "Apocynum androsaemifolium" "Arethusa bulbosa"
## [7] "Aronia melanocarpa" "Berberis thunbergii"
## [9] "Betula populifolia" "Calamagrostis canadensis"
Question 3: How would you take only the plots on MDI (eg Unit_ID = “ACAD_MDI”) in the loc table and join them to their corresponding visits in the events table?
# In case you need to load the tables
loc <- read.csv("./data/example_data/tbl_Locations_example.csv")
event <- read.csv("./data/example_data/tbl_Events_example.csv")
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv")
# Filter then join
loc_MDI <- loc %>% filter(Unit_ID == "ACAD_MDI")
locev_MDI <- left_join(loc_MDI, event, by = "Location_ID")
# All in one via pipes:
locev_MDI2 <- loc %>% filter(Unit_ID == "ACAD_MDI") %>%
left_join(event, by = "Location_ID")
Question 4: How would you join the stand data to the locev table (the one we made that dropped QAQC events)?
locev <- merge(loc, event[event$Event_QAQC == FALSE, ], all.x = TRUE, all.y = TRUE)
intersect(names(locev), names(stand))
## [1] "Event_ID"
locev_stand <- merge(locev, stand, by = "Event_ID", all.x = TRUE, all.y = TRUE) #base R full join
locev_stand_tidy <- full_join(locev, stand, by = "Event_ID") #tidyverse full join
Question 5: How would you calculate the min and max number of species by year?
head(wetdat)
## Site_Name Site_Type Latin_Name Common Year PctFreq Ave_Cov
## 1 SEN-01 Sentinel Acer rubrum red maple 2011 0 0.02
## 2 SEN-01 Sentinel Amelanchier serviceberry 2011 20 0.02
## 3 SEN-01 Sentinel Andromeda polifolia bog rosemary 2011 80 2.22
## 4 SEN-01 Sentinel Arethusa bulbosa dragon's mouth 2011 40 0.04
## 5 SEN-01 Sentinel Aronia melanocarpa black chokeberry 2011 100 2.64
## 6 SEN-01 Sentinel Carex exilis coastal sedge 2011 60 6.60
## Invasive Protected X_Coord Y_Coord
## 1 FALSE FALSE 574855.5 4911909
## 2 FALSE FALSE 574855.5 4911909
## 3 FALSE FALSE 574855.5 4911909
## 4 FALSE TRUE 574855.5 4911909
## 5 FALSE FALSE 574855.5 4911909
## 6 FALSE FALSE 574855.5 4911909
wetmin_max <- wetdat %>% mutate(present = ifelse(Ave_Cov >0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = "drop") %>%
group_by(Year) %>%
summarize(minspp = min(numspp),
maxspp = max(numspp),
.groups = "drop")
wetmin_max
## # A tibble: 3 x 3
## Year minspp maxspp
## * <int> <dbl> <dbl>
## 1 2011 33 41
## 2 2012 26 48
## 3 2017 26 54
Note that first I had to sum up the number of species within a site (hence group_by(Site_Name, Year)). After I calculate that, I grouped by Year only, then found the min and max number of species recorded within a year.
Question 6: What percent of RAM sites had an invasive species present in 2012 and 2017?
#head(wetdat)
wetinv1 <- wetdat %>% filter(Site_Type == "RAM") %>%
group_by(Site_Name, Year) %>%
summarize(num_invspp = sum(Invasive),
plot_inv = ifelse(num_invspp > 0, 1, 0),
.groups = "drop")
head(wetinv1)
## # A tibble: 6 x 4
## Site_Name Year num_invspp plot_inv
## <chr> <int> <int> <dbl>
## 1 RAM-05 2012 0 0
## 2 RAM-05 2017 1 1
## 3 RAM-41 2012 0 0
## 4 RAM-41 2017 1 1
## 5 RAM-44 2012 1 1
## 6 RAM-44 2017 1 1
# Note plot_inv is to set plot to 1 or 0 even if multiple invasives are found on the plot.
wetinv <- wetinv1 %>% group_by(Year) %>%
summarize(tot_inv_plots = (sum(plot_inv)/n())*100,
.groups = "drop")
wetinv
## # A tibble: 2 x 2
## Year tot_inv_plots
## * <int> <dbl>
## 1 2012 40
## 2 2017 80
Note that logical fields (ie TRUE/FALSE) are interpreted as numbers in R when you use sum. If the value is TRUE, R interprets it as 1. If it’s FALSE, R interprets it as 0. So sum(Invasive) is summing all of the TRUEs in the Invasive column.
Question 1: How would you create a dataset that has a column for every sample year (ie 2013-2019), a row for every site, and the number of northern long-eared bats found as the value?
# In case you're starting from scratch
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
substr(word(Latin, 2), 1, 3))))
MYOSEP_sum <- bat_long %>% filter(sppcode == "MYOSEP") %>%
group_by(Site, Year) %>%
summarize(num_bats = sum(ifelse(!is.na(Latin), 1, 0)),
.groups = 'drop')
head(MYOSEP_sum) # Note only the site and years where MYOSEP was found are listed. Next lines add those back before pivot
bat_site_year <- bat_long %>% expand(Site, Year)
MYOSEP_comp <- bat_site_year %>% left_join(., MYOSEP_sum, by = c("Site", "Year"))
MYOSEP_comp$num_bats[is.na(MYOSEP_comp$num_bats)] <- 0 # Replace all the NAs created by left join with 0
MYOSEP_wide <- MYOSEP_comp %>% arrange(Site, Year) %>%
pivot_wider(names_from = Year, values_from = num_bats, values_fill = 0)
MYOSEP_wide
## # A tibble: 3 x 8
## Site `2013` `2014` `2015` `2016` `2017` `2018` `2019`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BrownMtnCR 0 0 0 0 0 1 0
## 2 BrownMtnGH 1 0 1 0 0 0 0
## 3 BubPdS 0 0 0 0 0 0 0
Question 2: How would you create a wide matrix with all protected species as columns, sites and years sampled as rows, and percent cover as the value?
# If you're starting from scratch
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
substr(word(Latin_Name, 2), 1, 3))))
wetdat_dist <- wetdat_prep %>% distinct(Site_Name, Year) # makes tibble of all site, year combinations, but doesn't add any like expand
wetdat_wide <- wetdat_prep %>% filter(Protected == TRUE) %>%
select(Site_Name, Year, sppcode, Ave_Cov) %>%
pivot_wider(names_from = sppcode, values_from = Ave_Cov, values_fill = 0) %>%
left_join(wetdat_dist, ., by = c("Site_Name", "Year"))
wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")][is.na(wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")])] <- 0
Note that I used distinct() instead of expand() because not every site is sampled every year. I used distinct to add site/year combinations that were sampled but where protected species weren’t found. Expand would have added a record for every year in the data to every site.
Question 3: How would you create a wide matrix for ACAD plots in 2019 that has plots for rows, and exotic species as columns?
# Starting from scratch
library(forestNETNarch)
importCSV("./data/NETN_forest_csvs")
acad_exo <- makeSppList(park = "ACAD", from = 2019, to = 2019, speciesType = "exotic")
acad_exo <- acad_exo %>% mutate(present = ifelse(Exotic == TRUE & !is.na(Exotic), 1, 0))
head(acad_exo)
# Now to answer the question
# Old school spread version
acad_wide_spread <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>%
arrange(Latin_Name, Plot_Name) %>%
spread("Latin_Name", "present", fill = 0)
# Newer pivot version
acad_wide_pivot <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>%
arrange(Latin_Name, Plot_Name) %>%
pivot_wider(names_from = "Latin_Name",
values_from = "present",
values_fill = 0)
head(acad_wide_pivot)
## # A tibble: 6 x 7
## Plot_Name `Festuca filifo~ `Hieracium caes~ `Lonicera - Exo~ `Poa compressa`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ACAD-044 1 0 1 0
## 2 ACAD-045 1 0 0 0
## 3 ACAD-078 1 0 0 0
## 4 ACAD-176 1 0 0 1
## 5 ACAD-058 0 1 0 0
## 6 ACAD-039 0 0 0 0
## # ... with 2 more variables: `Rhamnus frangula` <dbl>, `NA` <dbl>
Question 4: Take the acad_final graph and make 2 more changes to it.
You may have chosen other things to do than I did. Here are a couple of changes to the ggplot bar chart
# Starting from scratch
library(forestNETN)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>%
summarize(seed15_30 = sum(seed15.30),
seed30_100 = sum(seed30.100),
seed100_150 = sum(seed100.150),
seed150p = sum(seed150p),
sapling = sum(sap.den),
.groups = 'drop') %>%
arrange(Plot_Name)
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name,
names_to = "size_class",
values_to = "density")
# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>%
summarize(mean_dens = mean(density),
se_dens = sd(density)/sqrt(n()),
.groups = 'drop')
acad_stats$size_class_fact <- ordered(acad_stats$size_class,
levels = c('seed15_30', 'seed30_100',
'seed100_150', 'seed150p', 'sapling'))
# Now to answer the question
ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity', fill = "steelblue4")+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens),
width = 0.2, lwd = 0.7)+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
theme_FHM()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+ #hjust is centering horizontally
scale_x_discrete(labels = c('15-30cm', '30-100cm', '100-150cm', '>150cm', '1-10cm DBH'))
Note that width = 0.2 in the geom_errorbar is the width of the caps on the bars. lwd is the weight of the line.
Question 5: Filter the acad_ph_final data frame to only include 2019 data and create a plot with months as the X axis, and includes points, a loess smooth with span = 1, and a facet on site.
# Starting from scratch
library(NCRNWater)
netnwd<-importNCRNWater(Dir = "./data/NETN_water_data",
Data = "Water Data.csv",
MetaData = "VizMetaData.csv")
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
acad_lakes <- acad_all_sites %>% filter(type == "Lake") #list of site codes and full names for ACAD lakes
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90)
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>%
mutate(month = format(Date, "%m"),
year = format(Date, "%Y"))
acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")],
by= c("Site" = "sitecode")) # how to join if columns have diff. names
#head(acad_ph_final)
# Now to answer the question
acad_ph_2019 <- acad_ph_final %>% filter(year == 2019) %>% mutate(mon_abr = format(Date, "%b"))
plot19 <- ggplot(acad_ph_2019, aes(x = Date, y = Value))+
geom_point()+
labs(y = "pH")+
geom_smooth(method = 'loess', span = 1, se = FALSE)+
scale_x_date(date_breaks = "1 month",
date_labels = "%b")+ #%b is months names abbreviated
facet_wrap(~fullname, ncol = 4)+
theme_bw()
plot19
## `geom_smooth()` using formula 'y ~ x'
This tab prints all of the code chunks in this file in one long file.
#--------------------
# Prep
#--------------------
install.packages(c("tidyverse", "devtools", "rgdal", "ggspatial", "leaflet"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
library(tidyverse)
library(forestNETN)
library(NCRNWater)
#--------------------
# Day 1 Mod 1
#--------------------
install.packages(c("tidyverse", "rgdal", "devtools"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
library(tidyverse)
library(rgdal)
library(forestNETNarch)
library(NCRNWater)
library(tidyverse)
library(rgdal)
#--------------------
# Day 1 Mod 2
#--------------------
subfolders<-c("data", "figures", "tables", "rscripts", "shapefiles")
invisible(lapply(subfolders, function(x){
if(!dir.exists(x)){dir.create(paste(getwd(), x, sep = '/'))}
}))
zips <- c("./data/example_data", "./data/NETN_forest_csvs", "./data/NETN_water_data", "./data/NETN_bird_data")
unlink(zips, recursive=T)
list.files("./data")
list.files("./data") # check that you have data.zip in your data folder
unzip("./data/data.zip", exdir = "./data")
list.files("./data") # check that the unzip worked. There should be multiple folders with datasets in each.
#--------------------
# Day 1 Mod 3
#--------------------
?plot
?dplyr::filter
help(package=dplyr)
#--------------------
# Day 1 Mod 4
#--------------------
library(tidyverse)
# Vector of numbers
vec_num <- c(1, 2, 3, 4, 5)
vec_num
# Vector of strings
vec_str <- c("Schoodic", "MDI-East", "MDI-West", "IAH")
vec_str
vec_str[1]
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', NA),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
print(df)
df$plot # List the values within the column plot for the df data frame
df[ , 'plot'] # Bracket version of the same thing
df[ , 1] # Bracket version of showing the first column, which is plot
df[2, 1] # Selecting 2nd row of 1st column
head(df) # to view the first 6 rows of df
tail(df) # to view last 6 rows of df
names(df) # to view the column names and their order
str(df) # to view how R is defining each column type
# The long way (note that I renamed the object here)
df2 <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 7), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', 'white_spruce'),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
# A shorter way
df$plot[7] <- 7 #takes the 7th row of the plot column and changes it to 7
df$domSpp[7] <- "white_spruce" #takes the 7th row of domSpp and replaces it with "white_spruce"
# Another shorter way
df[7, "plot"] <- 7 #takes the 7th row of the plot column and makes it 7
df[7, 'domSpp'] <- "white_spruce" #takes the 7th row of the domSpp column and makes it "white_spruce"
# Another shorter way
df[7, 1] <- 7 #takes the 7th row of the first column and makes it 7
df[7, 4] <- "white_spruce" #takes the 7th row of the 4th column and makes it "white_spruce"
# Even shorter-way
df[7, c(1, 4)] <- c(7, "white_spruce") #Takes the 7th row and the first and 4th column, which happen to be plot and domSpp.
# Then the first column gets 7, and the 4th gets "white spruce"
#df[is.na("domSpp")] <- "white_oak" #wrong
df$domSpp[is.na(df$domSpp)]<- "white_oak"
df$domSpp <- df$domSpp %>% replace_na("white_oak")
str(df) #still a char
df$plot <- as.numeric(df$plot)
str(df)
#--------------------
# Day 1 Mod 5
#--------------------
min(df$numtrees)
max(df$numtrees)
mean(df$numtrees)
range(df$numtrees) #gives min and max
summary(df)
df$numDead <- df$numtrees - df$numLive # base R version
df <- df %>% mutate(numDead = numtrees - numLive) # tidyverse version of the same operation.
# note if the 2nd line doesn't run, load the tidyverse library (library(tidyverse)) first.
#--------------------
# Day 1 Mod 6
#--------------------
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
install.packages("readxl") #only need to run once.
library(readxl)
bat_capture <- suppressWarnings(read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx",
sheet = "All Records"))
library(readxl)
bat_capture <- read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx",
sheet = "All Records")
head(bat_capture)
dat <- read.csv(file.choose())
# Filter to remove protected species
wetdat_pub <- wetdat[wetdat$Protected == FALSE, ]
nrow(wetdat) #508 rows in original dataset
nrow(wetdat_pub) #499 rows in filtered dataset
table(wetdat_pub$Protected) # Only Protected = FALSE
# Select only fields you want
wetdat_pub <- wetdat_pub[ , c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov",
"Invasive", "Protected")]
names(wetdat_pub)
# Previous steps can all happen in 1 line of code:
wetdat_pub <- wetdat[wetdat$Protected == FALSE,
c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov",
"Invasive", "Protected")]
nrow(wetdat_pub) # 499 rows
names(wetdat_pub) # Only includes fields we want
table(wetdat_pub$Protected) #All FALSE
# Another base R solution that's a bit easier to work with:
wetdat_pub2 <- subset(wetdat, Protected == FALSE,
select = c(Site_Name, Site_Type, Latin_Name, Year,
Ave_Cov, Invasive, Protected))
nrow(wetdat_pub2) # 499 rows
names(wetdat_pub2) # Only includes fields we want
table(wetdat_pub2$Protected) #All FALSE
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>%
select(Site_Name, Site_Type, Latin_Name, Year, Ave_Cov, Invasive, Protected)
# Any column name not mentioned in the select is dropped
nrow(wetdat_tidy)
names(wetdat_tidy)
table(wetdat_tidy$Protected)
# Even more efficient way to code this:
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% select(-Common, -PctFreq)
# Base R
wetdat_2sites <- wetdat[wetdat$Site_Name %in% c("SEN-01", "SEN-03"), ]
nrow(wetdat_2sites)
table(wetdat_2sites$Site_Name)
# Tidyverse equivalent of same operation
wetdat_2tidy <- wetdat %>% filter(Site_Name %in% c("SEN-01", "SEN-03"))
nrow(wetdat_2tidy)
table(wetdat_2tidy$Site_Name)
head(wetdat) # check original order before sorting
# Base R- I always forget the exact code and have to look it up.
wetdat_sort <- wetdat[order(wetdat$Latin_Name, wetdat$Site_Name), ]
head(wetdat_sort)
# Sort in reverse
wetdat_rev <- wetdat[order(desc(wetdat$Latin_Name), desc(wetdat$Site_Name)), ] #desc() stands for descending
head(wetdat_rev)
# Another example with numbers
wetdat_rev2 <- wetdat[order(-wetdat$PctFreq), ]
# Note: desc() is for text, - is for numbers
# Tidyverse version
wetdat_sort_tidy <- wetdat %>% arrange(Latin_Name, Site_Name)
head(wetdat_sort_tidy)
# Tidyverse reverse sort
wetdat_rev_tidy <- wetdat %>% arrange(desc(Latin_Name), desc(Site_Name))
head(wetdat_rev_tidy)
#--------------------
# Day 2 Mod 1
#--------------------
library(tidyverse)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
head(wetdat) # shows the first 6 lines of the data frame, including the names of the columns
tail(wetdat[1:3]) # shows last 6 records of first 3 columns in the data frame
names(wetdat) # shows the column names of the data frame
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
table(complete.cases(wetdat)) # FALSE counts the number of rows that have an NA in at least 1 column
dim(wetdat) # number of rows (first) and columns (second)
nrow(wetdat) # number of rows (ncol is # columns)
str(wetdat) # shows the type of data in each column
summary(wetdat) # summarize each field based on data type
View(wetdat)
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov) #create new field that converts Ave_Cov to numeric
which(is.na(wetdat$Ave_Cov_Num))
wetdat[which(is.na(wetdat$Ave_Cov_Num)), ] # all columns with NA in Ave_Cov_num
wetdat[which(is.na(wetdat$Ave_Cov_Num)), "Ave_Cov"] # Ave_Cov only
wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Another way to see Ave_Cov only
wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- "0.1"
#wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- NA
wetdat$Ave_Cov <- gsub("<", "", wetdat$Ave_Cov)
str(wetdat)
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
str(wetdat)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Identify which values in Ave_Cov can't be converted to a number and what you need to do to fix them.
wetdat$Ave_Cov <- as.numeric(gsub("<", "", wetdat$Ave_Cov)) # replace < with nothing
wetdat <- subset(wetdat, select = -c(Ave_Cov_Num)) #now drop the temp column Ave_Cov_Num
str(wetdat) #Ave_Cov is now numeric
table(complete.cases(wetdat$Ave_Cov)) #checks that there aren't any more NAs.
# If there were NAs, you'd have a column named FALSE with the number of NAs below it.
range(wetdat$PctFreq, na.rm = TRUE) # 20 to 100- that looks fine.
range(wetdat$Ave_Cov, na.rm = TRUE) # 0.02 to 110.0- no bueno
wetdat[wetdat$Ave_Cov > 100, ] # If Ave_Cov is still a chr, this filter won't work
wetdat$Ave_Cov[wetdat$Ave_Cov > 100] <- 11
range(wetdat$Ave_Cov)
table(complete.cases(wetdat$Ave_Cov)) #all TRUE
table(complete.cases(wetdat$PctFreq)) #some FALSE
mean(wetdat$PctFreq)
mean(wetdat$PctFreq, na.rm = TRUE)
wetdat$PctFreq[is.na(wetdat$PctFreq)] <- 0
table(complete.cases(wetdat$PctFreq))
mean(wetdat$PctFreq)
table(wetdat$Site_Name)
unique(wetdat$Site_Name[which(nchar(wetdat$Site_Name) != 6)]) #
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))]
wetdat$Site_Name[wetdat$Site_Name == "RAM44"] <- "RAM-44"
wetdat$Site_Name[wetdat$Site_Name == "SEN-3"] <- "SEN-03"
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))] #character(0) means nothing returned
wetdat$Site_Name[which(nchar(wetdat$Site_Name)!=6)]
table(wetdat$Site_Name)
sort(unique(wetdat$Site_Name)) #another way to see all the unique values in a column
dim(wetdat) # reports # rows then # cols
nrow(wetdat) # reports # rows only
length(unique(wetdat$Site_Name)) # number of unique values in Site_Name
length(unique(wetdat$Year)) # number of unique years
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
sort(unique(wetdat$Common))[1:10] # see all the common names in the dataset (truncated to first 10 common names for printing here).
#--------------------
# Day 2 Mod 3
#--------------------
tree <- data.frame(plot = c("p1","p1","p1","p2","p2","p3","p4","p4"),
treenum = c(1, 2, 3, 11, 12, 21, 32, 33),
species = c("ACERUB" ,"PINSTR", "PINSTR", "QUERUB", "PICRUB",
"PICRUB", "PICRUB", "ACERUB"))
tree
tree_data <- data.frame(plot = c("p1", "p1", "p1", "p1", "p1","p1", "p2", "p2", "p5", "p7"),
year = c(2012, 2012, 2012, 2016, 2016, 2016, 2012, 2012, 2016, 2016),
treenum = c(1, 2, 3, 1, 2, 3, 11, 12, 51, 71),
DBH = c(12.5, 14.5, 16.1, 12.8, 14.9, 17.2, 28.1, 35.4, 36.1, 45.2))
tree_data
tree_full_join <- full_join(tree, tree_data, by = c("plot", "treenum"))
tree_left_join <- left_join(tree, tree_data, by = c("plot", "treenum"))
nrow(tree_full_join)
nrow(tree_left_join)
print(tree_full_join)
print(tree_left_join)
library(tidyverse)
loc <- read.csv("./data/example_data/tbl_Locations_example.csv") # plot level data for forest plots in NETN
head(loc)
event <- read.csv("./data/example_data/tbl_Events_example.csv") # data on each visit in NETN forest plots
head(event)
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") # stand data collected in plots
head(stand)
View(loc) #Location_ID
View(event) #Location_ID links to tbl_Location; Event_ID links other visit-level tables to this table
View(stand) #Event_ID links stand table to event table.
intersect(names(loc), names(event)) # way to check columns two tables have in common.
intersect(names(event), names(stand))
# base R version
names(loc) # to help me pick the columns to include in the join
names(event)
loc_event_full <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = TRUE) # base R full join
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID") # tidyverse full join
View(loc_event_full) # Now you can see all the times a plot has been sampled. You'll also see that some plots
# don't have sample dates. That's because the loc table includes plots that were rejected and never sampled.
loc_event_left <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = FALSE)
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_full) # same number of rows because all events have a location record
nrow(loc_event_left) # same number of rows because all events have a location record
# tidyverse version
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID")
nrow(loc_event_full_tidy) # same number of rows because all events have a location record
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_left_tidy) # same number of rows because all events have a location record
# Base R approach
locev <- merge(loc[loc$Park_Code == "ACAD", ],
event[event$Event_QAQC == FALSE, ],
all.x = TRUE, all.y = FALSE)
# Tidyverse approach
loc_ACAD <- loc %>% filter(Park_Code == "ACAD")
locev_tidy <- event %>% filter(Event_QAQC == FALSE) %>%
left_join(loc_ACAD, ., by = "Location_ID")
nrow(locev)
nrow(locev_tidy)
View(locev)
View(locev_tidy)
#--------------------
# Day 2 Mod 4
#--------------------
wetdat <- read.csv('./data/example_data/ACAD_wetland_data_clean.csv')
head(wetdat)
wetsum <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
head(wetdat)
wetdat <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0))
wetdat_sum <- wetdat %>% group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
wetsum_opt1 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Site_Type, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
wetsum_opt2 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(Site_Type = first(Site_Type),
numspp = sum(present),
.groups = 'drop')
wetsum_pre <- wetdat %>% filter(Site_Type == "RAM") %>%
mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Site_Type, Year) %>%
summarize(numspp = sum(present),
.groups = 'drop')
wetsum_yr <- wetsum_pre %>% group_by(Year) %>%
summarize(mean_numspp = mean(numspp),
numplots = n(),
se_numspp = sd(numspp)/sqrt(numplots),
.groups = 'drop')
wetsum_yr
#--------------------
# Day 2 Mod 5
#--------------------
write.csv(wetdat, "./data/ACAD_wetland_data2.csv")
file.remove("./data/ACAD_wetland_data2.csv")
#--------------------
# Day 3 Mod 1
#--------------------
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
head(bat_long)
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
substr(word(Latin, 2), 1, 3))))
bat_long_sum <- bat_long %>% group_by(Site, Year, sppcode) %>%
summarize(numindiv = sum(!is.na(sppcode)),
.groups = 'drop')
head(bat_long_sum)
bat_wide <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv") #retired spread function
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv") #replacement in tidyr
bat_wide
bat_wide_fill <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv", fill = 0) #retired spread function
head(bat_wide_fill)
bat_wide2_fill <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv",
values_fill = 0) #replacement in tidyr
bat_long2 <- bat_wide_fill %>% gather(key = "sppcode", value = "numindiv", -Site, -Year) # old way
head(bat_long2)
names(bat_wide_fill)
bat_long2b <- bat_wide_fill %>% pivot_longer(cols=c(LASCIN, MYOLEI, MYOLUC, MYOSEP),
names_to = "sppcode",
values_to = "numindiv") #new way
head(bat_long2b)
nrow(bat_long_sum) # dataset before made wide, then long
nrow(bat_long2) # dataset that has every species represented for every year and every site
#First figure out how many levels you expect
with(bat_long, length(unique(Site)) * length(unique(Year)) * length(unique(sppcode))) #84
bat_all_comb <- bat_long_sum %>% expand(Site, Year, sppcode)
nrow(bat_all_comb) #84
bat_long_comb <- left_join(bat_all_comb, bat_long_sum, by = c("Site", "Year", "sppcode"))
bat_long_comb$numindiv[is.na(bat_long_comb$numindiv)] <- 0 # replace NA with 0
nrow(bat_long2) == nrow(bat_long_comb) #TRUE
head(bat_long_sum)
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = c(sppcode, Year),
values_from = numindiv,
values_fill = 0)
head(bat_wide2)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
head(wetdat)
# Prepare data for pivot
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
substr(word(Latin_Name, 2), 1, 3))),
present = ifelse(Ave_Cov>0, 1, 0)) %>%
filter(Year == 2017)
# Pivot wide, so every species has a column
wetdat_wide <- wetdat_prep %>% select(Site_Name, Year, sppcode, present) %>%
pivot_wider(names_from = "sppcode", values_from = "present",
values_fill = 0)
head(wetdat_wide)
# Pivot long, so every site has a species with 1 or 0 to denote present/absent
wetdat_long <- wetdat_wide %>% pivot_longer(cols = c(-Site_Name, -Year), names_to = "sppcode",
values_to = "present") %>%
arrange(Site_Name, sppcode)
# Now we need to filter the data to only include invasive species
# First make a list of the invasive species
inv_list <- unique(wetdat_prep$sppcode[wetdat_prep$Invasive == TRUE])
inv_list
#Now filter on that invasive list
wetdat_long_inv <- wetdat_long %>% filter(sppcode %in% inv_list)
# Now summarize to count the percent of plots each species was found in
wetdat_sum <- wetdat_long_inv %>% group_by(sppcode) %>%
summarize(pct_sites = sum(present)/n()*100,
.groups = 'keep')
wetdat_sum
#--------------------
# Day 3 Mod 2
#--------------------
library(forestNETNarch)
library(tidyverse)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")
head(acad_regen) # this dataframe includes species, so there are multiple rows per plot.
# We need to group_by Plot_Name, and sum seedling densities to get plot-level seedling densities.
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>%
summarize(seed15_30 = sum(seed15.30),
seed30_100 = sum(seed30.100),
seed100_150 = sum(seed100.150),
seed150p = sum(seed150p),
sapling = sum(sap.den),
.groups = 'drop') %>%
arrange(Plot_Name)
# This is the shortest way to get a dataset that has a mean seedling density for each size class and a se for each size class.
# Note we want to make this long, so every plot has each of the size classes
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, names_to = "size_class", values_to = "density")
# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>%
summarize(mean_dens = mean(density),
se_dens = sd(density)/sqrt(n()),
.groups = 'drop')
# We have our dataset to plot.
acad_base <- ggplot(data = acad_stats, aes(x = size_class, y = mean_dens))
# Make a point plot
acad_base + geom_point()
# Make a bar chart
acad_bar <- acad_base + geom_bar(stat = 'identity')
acad_bar
# Add error bars to the previous bar chart
acad_bar + geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")
acad_stats$size_class_fact <- ordered(acad_stats$size_class,
levels = c('seed15_30', 'seed30_100', 'seed100_150', 'seed150p', 'sapling'))
acad_bar2 <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity')+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")
acad_bar2
acad_bar2 + theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
acad_bar2+theme_FHM()
acad_final <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity', fill = "#5E79A8")+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
theme_FHM()
acad_final
library(NCRNWater)
netnwd<-importNCRNWater(Dir = "./data/NETN_water_data",
Data = "Water Data.csv",
MetaData = "VizMetaData.csv")
# Create data frame that lists the sites and their type using the getSiteInfo getter.
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
head(acad_all_sites)
# Filter data frame to only include lakes
acad_lakes <- acad_all_sites %>% filter(type == "Lake") #list of site codes and full names for ACAD lakes
acad_lakes
# Compile the water pH data using getWData function
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
# Filter water data to only include lakes
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
table(acad_lakes_ph$Site) # 8 sites with >90 measurements
# Turn the table above into a dataframe, and use it to filter the water data on.
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90)
# Filter acad_lakes_pH to only include the annual lakes and add column for month and year
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>%
mutate(month = format(Date, "%m"),
year = format(Date, "%Y"))
head(acad_ph2)
acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")],
by= c("Site" = "sitecode")) # how to join if columns have diff. names
head(acad_ph_final) #final dataset to plot
ph_plot <- ggplot(acad_ph_final, aes(x = Date, y = Value, color = Site, shape = Site))+
geom_point()+
labs(y = "pH")+
theme_bw()
ph_plot
ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
geom_point(shape = 16, color = "LightBlue")+
labs(y = "pH")+
theme_bw()+
facet_wrap(~fullname)
ph_plot_site
ph_plot_site_smooth <- ph_plot_site + geom_smooth(method = 'loess', span = 1, color = "RoyalBlue", se = FALSE)
# vary the span to increase/decrease wavyness of smooth
ph_plot_site_smooth + geom_smooth(method = 'loess', span = 0.1, color = "red", se = FALSE)+
geom_smooth(method = 'loess', span = 0.5, color = 'green', se = FALSE)
ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
geom_point(shape = 16, color = "LightBlue")+
labs(y = "pH")+
theme_bw()+
facet_wrap(~fullname)+
geom_smooth(method = 'loess', span = 0.5, color = 'RoyalBlue', se = FALSE)
ph_plot_site2 <- ph_plot_site + scale_x_date(date_breaks = "2 years",
date_labels = "%Y")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ph_plot_site2
#--------------------
# Day 3 Mod 3
#--------------------
library(tidyverse)
library(rgdal) # for GIS functions
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
table(wetdat$Site_Name, wetdat$Year) # SEN most recent is 2011, RAM is 2017
wetdat_rich <- wetdat %>% filter(Year %in% c(2011, 2017)) %>%
mutate(present = ifelse(!is.na(Ave_Cov) & Ave_Cov > 0, 1, 0)) %>%
group_by(Site_Name, Year, Site_Type, X_Coord, Y_Coord) %>%
summarize(numspp = sum(present),
.groups = 'drop')
head(wetdat_rich)
head(wetdat_rich)
# Step 1: Create spatial object by specifying the coordinates
coordinates(wetdat_rich) <- ~X_Coord + Y_Coord
head(wetdat_rich)
plot(wetdat_rich) # These points now plot in space (bottom left point is on IAH)
# Step 2: Define the coordinate system
proj4string(wetdat_rich) <- CRS("+init=epsg:26919")
# Step 3: Write the spatial object to a shapefile
writeOGR(wetdat_rich, dsn = "./shapefiles", layer = "wetdat_rich", driver = "ESRI Shapefile",
overwrite_layer = TRUE) # overwrite is in case you need to update your shapefile.
# Load packages
library(rgdal) # for readOGR
library(ggplot2) #for ggplot functions
library(ggspatial) #to make ggplot functions spatial
# Load shapefiles
bounds <- readOGR("./data/example_data/ParkBndCARTO_Fee_2020.shp", layer = "ParkBndCARTO_Fee_2020",
GDAL1_integer64_policy = TRUE) # Don't ask me why this is here. All I know is it works.
head(bounds@data) # How you view the attribute table of a shapefile in R
wetmap <- readOGR("./data/example_data/ACAD_vegmap_wetlands.shp", layer = "ACAD_vegmap_wetlands",
GDAL1_integer64_policy = TRUE)
wet_rich_shp <- readOGR("./shapefiles/wetdat_rich.shp", layer = "wetdat_rich",
GDAL1_integer64_policy = TRUE)
head(wet_rich_shp@data)
head(wet_rich_shp@coords) # Where the coordinates are stored
plot(bounds)
# Prepare shapefiles for plotting
bounds2 <- fortify(bounds) #
str(bounds2) # note the data frame structure that describes the boundary data
wetmap2 <- fortify(wetmap)
str(wetmap2)
# Turn the wetland richness shapefile back into a dataset
# One reason to do this is R is because you never had to make this a shapefile to plot
wet_rich_df <- cbind(as.data.frame(wet_rich_shp@data),
long = wet_rich_shp@coords[,1],
lat = wet_rich_shp@coords[,2])
ggmap <- ggplot()+
coord_sf(crs = 26919)+ # set coordinate system to NAD83 UTM Zone 19N using EPSG code
geom_polypath(data = bounds2, aes(x = long, y = lat, group = group),
fill = "grey", alpha = 0.4)+
geom_polypath(data = wetmap2, aes(x = long, y = lat, group = group),
fill = "#80A8D1")+
geom_point(data = wet_rich_df,
aes(x = long,
y = lat,
color = Site_Type),
shape = 17)+
theme_minimal() # because I don't like ggplot's grids in default grid
ggmap
wet_rich_dd <- spTransform(wet_rich_shp, CRS="+proj=longlat +datum=WGS84")
plot(wet_rich_dd) #check that they still plot spatially
library(leaflet)
park_tile_basic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
map_leaf <- leaflet() %>%
setView(lng = -68.312,
lat = 44.25,
zoom = 10) %>%
addTiles(urlTemplate = park_tile_basic) %>%
addCircleMarkers(data = wet_rich_dd,
radius = 4,
lng = wet_rich_dd@coords[,1],
lat = wet_rich_dd@coords[,2],
label = wet_rich_dd@data$Site_Name,
labelOptions = labelOptions(noHide=T, textOnly = TRUE,
direction = 'bottom', textsize = "12px"),
fillColor = "#4C9B4F",
fillOpacity = 1,
stroke = FALSE) %>%
addScaleBar(position='bottomright') %>%
scaleBarOptions(maxWidth=10, metric=TRUE)
map_leaf
#--------------------
# Day 1 Answers
#--------------------
# If starting from scratch
library(tidyverse)
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
numtrees = c(24, 18, 17, 14, 31, 27, 15),
numLive = c(21, 18, 16, 10, 29, 26, 14),
domSpp = c('red_maple', 'red_spruce',
'red_spruce','white_cedar','white_pine',
'white_spruce', NA),
invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))
df[3, c("domSpp")] # option 1
df[3, 4] # option 2
df$numtrees <- as.integer(df$numtrees)
str(df)
df$invasive_pres[2] <- TRUE
head(df)
# base R version (2 options)
df$numDead <- df$numtrees - df$numLive
df$pctDead <- df$numDead/df$numtrees*100
df$pctDead <- with(df, (numtrees - numLive)/numtrees*100)
# tidyverse version (2 options)
df <- df %>% mutate(numDead = numtrees - numLive,
pctDead = numDead/numtrees*100)
df <- df %>% mutate(pctDead = (numtrees - numLive)/numtrees*100)
min(df$numDead) # uses numdead column calculated for previous question.
summary(df$numDead)[1] #Min. is the first record that summary reports for numeric columns
# If starting from scratch
wetdat<-read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetinv <- wetdat[wetdat$Invasive == TRUE, ] # base R bracket-based solution
wetinv2 <- subset(wetdat, Invasive == TRUE) # another base R solution
wetinv_tidy <- wetdat %>% filter(Invasive == TRUE) # tidyverse solution
# Base R solution
wetspp <- wetdat[wetdat$Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"),]
sort(unique(wetspp$Site_Name))
# Tidyverse solution
wetspp_tidy <- wetdat %>% filter(Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"))
sort(unique(wetspp_tidy$Site_Name)) # RAM-05 and SEN-01 have both species present
wetdat_sort <- wetdat[order(-wetdat$Ave_Cov, wetdat$Latin_Name), ] # base R
head(wetdat_sort)
wetdat_sort_tidy <- wetdat %>% arrange(-Ave_Cov, Latin_Name) #tidyverse
head(wetdat_sort_tidy)
#--------------------
# Day 2 Answers
#--------------------
sort(unique(wetdat$Latin_Name))[1:10] # Low-tech way: use your eyes to spot the "Alnus incana++". I truncated to first 10.
# The regular expression way to find non-alphabetical symbols (now you know why I didn't cover this).
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # looks for anything not A-Z or a-z in entire record
# ^ and +$ are telling R to start with first word and check every following word within a record.
# This took longer than I'd like to admit to make work, by the way.
wetdat$Latin_Name[wetdat$Latin_Name == "Alnus incana++"] <- "Alnus incana" # easy approach
wetdat$Latin_Name <- gsub("[+]", "", wetdat$Latin_Name) # gsub approach
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # no values returned
sort(unique(wetdat$Latin_Name))[1:10] # using your eyes approach
# In case you need to load the tables
loc <- read.csv("./data/example_data/tbl_Locations_example.csv")
event <- read.csv("./data/example_data/tbl_Events_example.csv")
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv")
# Filter then join
loc_MDI <- loc %>% filter(Unit_ID == "ACAD_MDI")
locev_MDI <- left_join(loc_MDI, event, by = "Location_ID")
# All in one via pipes:
locev_MDI2 <- loc %>% filter(Unit_ID == "ACAD_MDI") %>%
left_join(event, by = "Location_ID")
locev <- merge(loc, event[event$Event_QAQC == FALSE, ], all.x = TRUE, all.y = TRUE)
intersect(names(locev), names(stand))
locev_stand <- merge(locev, stand, by = "Event_ID", all.x = TRUE, all.y = TRUE) #base R full join
locev_stand_tidy <- full_join(locev, stand, by = "Event_ID") #tidyverse full join
head(wetdat)
wetmin_max <- wetdat %>% mutate(present = ifelse(Ave_Cov >0 & !is.na(Ave_Cov), 1, 0)) %>%
group_by(Site_Name, Year) %>%
summarize(numspp = sum(present),
.groups = "drop") %>%
group_by(Year) %>%
summarize(minspp = min(numspp),
maxspp = max(numspp),
.groups = "drop")
wetmin_max
#head(wetdat)
wetinv1 <- wetdat %>% filter(Site_Type == "RAM") %>%
group_by(Site_Name, Year) %>%
summarize(num_invspp = sum(Invasive),
plot_inv = ifelse(num_invspp > 0, 1, 0),
.groups = "drop")
head(wetinv1)
# Note plot_inv is to set plot to 1 or 0 even if multiple invasives are found on the plot.
wetinv <- wetinv1 %>% group_by(Year) %>%
summarize(tot_inv_plots = (sum(plot_inv)/n())*100,
.groups = "drop")
wetinv
#--------------------
# Day 3 Answers
#--------------------
# In case you're starting from scratch
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
substr(word(Latin, 2), 1, 3))))
MYOSEP_sum <- bat_long %>% filter(sppcode == "MYOSEP") %>%
group_by(Site, Year) %>%
summarize(num_bats = sum(ifelse(!is.na(Latin), 1, 0)),
.groups = 'drop')
head(MYOSEP_sum) # Note only the site and years where MYOSEP was found are listed. Next lines add those back before pivot
bat_site_year <- bat_long %>% expand(Site, Year)
MYOSEP_comp <- bat_site_year %>% left_join(., MYOSEP_sum, by = c("Site", "Year"))
MYOSEP_comp$num_bats[is.na(MYOSEP_comp$num_bats)] <- 0 # Replace all the NAs created by left join with 0
MYOSEP_wide <- MYOSEP_comp %>% arrange(Site, Year) %>%
pivot_wider(names_from = Year, values_from = num_bats, values_fill = 0)
MYOSEP_wide
# If you're starting from scratch
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
substr(word(Latin_Name, 2), 1, 3))))
wetdat_dist <- wetdat_prep %>% distinct(Site_Name, Year) # makes tibble of all site, year combinations, but doesn't add any like expand
wetdat_wide <- wetdat_prep %>% filter(Protected == TRUE) %>%
select(Site_Name, Year, sppcode, Ave_Cov) %>%
pivot_wider(names_from = sppcode, values_from = Ave_Cov, values_fill = 0) %>%
left_join(wetdat_dist, ., by = c("Site_Name", "Year"))
wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")][is.na(wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")])] <- 0
# Starting from scratch
library(forestNETNarch)
importCSV("./data/NETN_forest_csvs")
acad_exo <- makeSppList(park = "ACAD", from = 2019, to = 2019, speciesType = "exotic")
acad_exo <- acad_exo %>% mutate(present = ifelse(Exotic == TRUE & !is.na(Exotic), 1, 0))
head(acad_exo)
# Now to answer the question
# Old school spread version
acad_wide_spread <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>%
arrange(Latin_Name, Plot_Name) %>%
spread("Latin_Name", "present", fill = 0)
# Newer pivot version
acad_wide_pivot <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>%
arrange(Latin_Name, Plot_Name) %>%
pivot_wider(names_from = "Latin_Name",
values_from = "present",
values_fill = 0)
head(acad_wide_pivot)
# Starting from scratch
library(forestNETN)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>%
summarize(seed15_30 = sum(seed15.30),
seed30_100 = sum(seed30.100),
seed100_150 = sum(seed100.150),
seed150p = sum(seed150p),
sapling = sum(sap.den),
.groups = 'drop') %>%
arrange(Plot_Name)
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name,
names_to = "size_class",
values_to = "density")
# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>%
summarize(mean_dens = mean(density),
se_dens = sd(density)/sqrt(n()),
.groups = 'drop')
acad_stats$size_class_fact <- ordered(acad_stats$size_class,
levels = c('seed15_30', 'seed30_100',
'seed100_150', 'seed150p', 'sapling'))
# Now to answer the question
ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
geom_bar(stat = 'identity', fill = "steelblue4")+
geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens),
width = 0.2, lwd = 0.7)+
labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
theme_FHM()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+ #hjust is centering horizontally
scale_x_discrete(labels = c('15-30cm', '30-100cm', '100-150cm', '>150cm', '1-10cm DBH'))
# Starting from scratch
library(NCRNWater)
netnwd<-importNCRNWater(Dir = "./data/NETN_water_data",
Data = "Water Data.csv",
MetaData = "VizMetaData.csv")
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
acad_lakes <- acad_all_sites %>% filter(type == "Lake") #list of site codes and full names for ACAD lakes
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90)
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>%
mutate(month = format(Date, "%m"),
year = format(Date, "%Y"))
acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")],
by= c("Site" = "sitecode")) # how to join if columns have diff. names
#head(acad_ph_final)
# Now to answer the question
acad_ph_2019 <- acad_ph_final %>% filter(year == 2019) %>% mutate(mon_abr = format(Date, "%b"))
plot19 <- ggplot(acad_ph_2019, aes(x = Date, y = Value))+
geom_point()+
labs(y = "pH")+
geom_smooth(method = 'loess', span = 1, se = FALSE)+
scale_x_date(date_breaks = "1 month",
date_labels = "%b")+ #%b is months names abbreviated
facet_wrap(~fullname, ncol = 4)+
theme_bw()
plot19