R training for ACAD

ACAD R Training: November 3 – 5, 2020

Prep for Training

Installing required software

The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.

NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!

Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.

Install the latest version of R (4.0.4 as of March 2021)

Assuming you’re on a PC, the installer for the latest version of R can be downloaded from The Comprehensive R Archive Network (CRAN). You’ll want to download the R installer by clicking on “Download R 4.0.4 for Windows”, or whatever the latest version number happens to be. After it downloads, open and run the installer with default settings. Note that there’s no need to pin R to your taskbar or add a shortcut to your desktop. For 99% of your needs, you’ll run R within RStudio, and won’t ever open R directly.

Install RStudio

The installer for RStudio for PCs can be downloaded here. You’ll need to click on the large blue “DOWNLOAD RSTUDIO FOR WINDOWS” button to download the installer. After it downloads, open and run the installer with default settings. I like having RStudio pinned to my taskbar, so it’s easier to find/open, but it’s up to you and your preferences.

Install important libraries

One of the great things about R is that users can build and share their own packages (AKA libraries) to add functionality to R. It’s a similar idea as extensions in ArcGIS. To use other libraries, you need to first download/install them, and then you need to turn them on in your R session. Some of the packages take a while to download. To save time during training, please try to install these prior to the first day. You can do that by opening RStudio, and in the window on the left (the console) paste the code below and press enter.

install.packages(c("tidyverse", "devtools", "rgdal", "ggspatial", "leaflet"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")

Next check that this worked by pasting and running the code below:

library(tidyverse)
library(forestNETN)
library(NCRNWater)

If you run into any problems, we can troubleshoot the first day. We’ll also discuss what the code above means the first day.

Join MS Team and download data

I set up a MS Team called ACAD_R_Training_Info to serve up the data and for us to use to post questions/info throughout the training. The link to the team is here. Please make sure you can join this team before training starts, and let me know if you have problems.

We’ll be using several example datasets throughout the training. It will help if you can download the data prior to the first day, especially if your connection is slow. To download the data, click on this data.zip link (also posted in the files channel). We’ll set up file structure and unzip the data files on the first day.

Optional Reading

I know you’re all busy and have to prioritize how you spend your time. If you have any time before training starts, I highly recommend reading Chapter 2: R basics and workflows in STAT545. This is an online book based on a graduate level statistics class that Jenny Bryan teaches and is turning into this book. She’s a stats professor turned senior programmer at RStudio, and I really like how she approaches programming R in this book.

Structure of training

The training will take place over 3 half days. Each day will run from 8-noon via MS Teams, with afternoon assignments to help cement ideas and prepare for the next day’s training. Please make an effort to at least skim the assignments and fill out the Feedback form at the end of each day.

For most of the training, I’ll share my screen to walk through the code. It will help a lot if you have 2 monitors, so you can have the screen I’m sharing on one monitor and your own session of RStudio open on the other. If you need an extra monitor let me know (I recently upgraded and have some older (non-gov’t) ones that are looking for a new home).

Three days is barely enough to scratch the surface on what you can do in R. My goals with this training are to help you get past the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you. Hopefully this also helps build a community of R users in ACAD who can help and share code with each other.

Day 1

Intro to R & RStudio

Introduction to R and RStudio

R is a programming language and free software environment that was developed by statisticians for statistical computing and graphics. It has since grown to fill many needs beyond statistics and graphing, but is still one of the leading programming languages for non-ML statistics and data visualization. For more information on the history of R, visit the R Project website.

RStudio is an integrated development environment (IDE) that is a shell around the R program. RStudio makes programming in the R language way easier by color coding different types of code, checking code as you type, and providing many useful tools with the push of a button or key stroke (eg viewing help info). When you open RStudio, you typically see 4 panes:

$RStudio$

Source: This is typically the top left pane, and is where you can load and write scripts. In the photo above, we have a bat_data_wrangling.R script loaded. Note that if you haven’t yet loaded or started a new script, you won’t see this pane until you do. When you’re ready to run the scripts you can either highlight a line or lines and click “Run” or press Ctrl+Enter. That essentially sends the code to R (which lives in the console directly below), and you’ll typically see some form of output in the console.

RStudio color codes functions, datasets, quoted strings, numbers, comments, etc. differently to make code easier to read. You can customize the layout and appearance of your code by clicking on Tools > Global Options > Appearance. For example, I prefer a dark background, and use the Cobalt Editor Theme. Using that theme, comments are blue, numbers are magenta, quoted strings are green, etc.
RStudio also checks your code and updates the colors as you go. If, for example, you have an unclosed parentheses pair in your code, you will see Xs to the left of the line number and sometimes a little squiggle under a word. If you missed a closing quote, it will turn all of your following code the color of text (green in my case) until you close the quote. This makes it easier to spot where you missed the closing quote.

Console: The pane in the bottom left is essentially where R lives. When you first open RStudio, the console will tell you the version of R that you’re running under the hood (should be R 4.0.2 – “Taking Off Again” or R 4.0.3 – “Bunny-Wunnies Freak Out” for our training). You can type code directly in the console, or you can write scripts and send the code to the console. Note that the console color codes errors where code failed differently than warnings or successful runs of code.
Environment Window: This pane in the top right shows you the objects loaded in your environment, like datasets, that are accessible to you by name. You can also click on objects and view them.

The history tab in this pane shows the code you’ve run in the current session, and can be a way to recover lines you ran but deleted/overwrote and later realized you needed in your script. I rarely go here, but it’s really saved me before!
The Tutorial tab is relatively new. I haven’t used it much, but is worth checking out.

Workspace: the pane in the bottom right has multiple useful tabs.

Files: This tab shows the files within your working directory. You don’t need to type a full file path to access these files.
Plots: This tab will show plots that you create.
Packages: Tab allows you to turn on/off packages, and also view the help files within each package. You can also do that with code.
Help: Allows you to search for help within packages that have been installed on your computer

Useful keystrokes

Once you get in the swing of coding, you’ll find that minimizing the number of times you have to use your mouse will help you code faster. RStudio has done a great job creating lots of really useful keyboard shortcuts designed to keep your hands on the keyboard instead of having to click through menus. One way to see all of the shortcuts RStudio has built in is to press Alt+Shift+K. A window should appear with a bunch of shortcuts listed. These are also listed on RStudio’s IDE Cheat Sheet. The shortcuts I use the most often are listed below:

UNDO: Ctrl+Z
REDO: Ctrl+Y
Run highlighted code: Ctrl Enter
Restart R Session: Ctrl Shift F10
Move mouse to Source pane: Ctrl 1
Move mouse to Console pane: Ctrl 2
Insert pipe (%>%): Ctrl Shift M
Insert “<-” : Alt -
Zoom in to make text bigger: Ctrl Shift +
Zoom out: Ctrl -
Clear console: Ctrl L
Duplicate line of code: Ctrl Shift D
Move line of code up or down: Alt arrow up or down

Global Options

There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. To check these settings, go to Tools > Global Options and follow the steps below.

Under the General tab, you should see that your R Version is [64-bit] and the version is R-4.0.2 or R-4.0.3. If it’s not, you need to change it to that.
Make sure you are not saving your history. When R saves your history, you start with the same workspace (including data loaded in the global environment), which seems like a good thing. However, the whole point of using R is that your code should return the same results every time you run it. Clearing your history every time you close RStudio forces you to test that your code is still returning the same results. If you happen to be working with huge datasets that take awhile to load and process, saving your history and RData may be useful. Otherwise, clear your history by default by making sure Always save history (even when not saving .RData) is not checked.
Also make sure the Restore .RData into your workspace at startup is unchecked. If you ever need to do that, you can run 1 line of code instead (load.Rdata(“filename.Rdata”)).
Most other settings are whatever you prefer, including everything in the Appearance window.

Loading dependencies

While base R (everything that comes with the installation from CRAN) comes with a lot of useful functions, there are also a lot of great packages (aka libraries) out there that make data wrangling, analysis and plotting easier and more efficient. The packages we will rely on the most during this training are:

dplyr: data wrangling functions including filtering (subsetting), arranging, and summarizing
tidyr: reshaping data from long to wide and vice versa
ggplot2: the most widely used package for plotting in R
rgdal: to set coordinate systems and to load or create shapefiles
devtools: required to install packages from github, including the ones listed below.

Note that dplyr ggplot2 and tidyr are part of the tidyverse, which is a collection of a lot of really useful packages that are designed to work with a specific format of data and have similar function structures that allow you to chain processes together (more on that later). Instead of only installing dplyr and tidyr individually, I recommend installing and loading the tidyverse, which will load those packages, along with several other useful packages.

There are also a number of packages NETN or collaborators have built to work with the monitoring data we’re collecting, including:

forestNETN: for importing, filtering, and summarizing forest data specific to NETN
NCRNWater: for importing, filtering, and summarizing water data collected by NETN (initially developed by National Capital Region Network)
wetlandACAD: for working with water level data collected in ACAD sentinel wetlands

If you haven’t already installed these packages (directions are also in the Prep for Training tap), run the code chunk below:

install.packages(c("tidyverse", "rgdal", "devtools"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")

You only need to run the previous code chunk once. After these packages are installed, you can just load them via library(package_name). A couple of notes about packages.

Best practices for writing code are to load all of the packages you will use at the top of the script, so that it’s obvious to another user if they need to install a package to make the code work.
You only need devtools to install the NETN-specific packages from github. Once the NETN-specific packages are installed (eg forestNETN), you don’t need to load devtools again.
The tidyverse and rgdal can take awhile to install, but once they’re installed they’re relatively quick to load.
Note the use of :: in the code chunk above. That’s another way to use functions in a package without loading the package. I only use this for packages like devtools, which I only use once in a script and I use it early. The :: symbol also comes in handy if you happen to be using a function where multiple packages have the same named function. For example, filter is a named function in the base R stats package and in dplyr. If you want the filter function from dplyr, type dplyr::filter.
To install a package, the name needs to be in quotes. To load the package via library(), you don’t quote the package name. Don’t ask me why…

To load the packages, run the code chunk below:

library(tidyverse)
library(rgdal)
library(forestNETNarch)
library(NCRNWater)

If you get an error that says there is no package called “x”, it means that package didn’t install properly, and you should try installing the package again.

If you run into issues installing from GitHub (eg using install_github function), here are a couple of tips for troubleshooting:

Install Rtools for Windows 64-bit. The download page is here.

Download the Windows Binaries for backports.
Install backports from file in RStudio by going to Tools> Install Packages> Install From: Change the Install From option to “Package Archive File” and select backports_1.1.10.zip.

Now try installing devtools and forestNETN (install.packages(“devtools”); devtools::install_github(‘katemmiller/forestNETNarch’))

Project and File Setup

First thing we’re going to do is start an R project for this training and set up the same file structure, which we’ll continue to use the rest of the training. To set up a project, open R Studio and go to:

File > New Project > New Directory > New Project

Find the directory you’re going to work from (other than avoiding saving to your desktop, anywhere is probably fine), and name the directory R_training.

Now run the code chunk below to set up the same file structure.

subfolders<-c("data", "figures", "tables", "rscripts", "shapefiles")

invisible(lapply(subfolders, function(x){
  if(!dir.exists(x)){dir.create(paste(getwd(), x, sep = '/'))}
  }))

Now, copy the data.zip file that you downloaded before we started into the data folder you created in the previous step. Then run the code chunk below.

If you haven’t downloaded the example data yet, please do so by going to ACAD_R_Training_Info Team and download data.zip in the files channel. Type the code below to check that the data.zip folder is in the right place, and if it is, unzip it. You should see the data.zip and 3 folders after you run list.files(“./data”) the second time.

list.files("./data") # check that you have data.zip in your data folder

## [1] "data.zip"              "Sampling_Grid_Fee.zip"

unzip("./data/data.zip", exdir = "./data")
list.files("./data") # check that the unzip worked. There should be multiple folders with datasets in each.

## [1] "data.zip"              "example_data"          "NETN_forest_csvs"     
## [4] "NETN_water_data"       "Sampling_Grid_Fee.zip"

Using Projects makes sharing code easier, because the parent directly can vary across machines, as long as the project has the same sub-directory (i.e. data, figures, shapefiles, etc. folders).
If you successfully started a project named R_Training, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding.
The last thing to mention here is the # in the code chunk above. The # comments out anything after it on the same line, so R knows not to run it. Commenting your code allows you to document what you’re doing, so it’s easier for someone else (or you 3 months later) to follow your code.

Getting Help

There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel. ?dplyr::filter works too, if you need to connect the package to the function and/or haven’t loaded that package yet.

?plot
?dplyr::filter

To view all of the functions and links to their individual help files, type in help(package=packagename). For example, the code below will bring up the help files for the dplyr package.

help(package=dplyr)

Online resources include Stackexchange, and Stackoverflow. For tidyverse-related packages/questions, there’s also really good documentation for each package and function here. Google searches are usually my first step, and I include “in R” and the package name (if applicable) in every search related to R code.

Types of Data

Getting Started

Now we’re going to start coding. The best way to work through these lessons will be to take the code from this website and transfer it to an R Script in your R_training.Rprj. To start that, go to File > New File > R Script (the new script

icon will do the same), and name the script Day_1.R, and save it to your rscripts subfolder. As you get more comfortable coding in R, try pushing yourself to type the commands, rather than copying and pasting. You’ll learn to code faster by typing, even if you’re just mimicking what you see in the training document.

Note also that R is case sensitive. For example, R would see a data frame named “DF” and a data frame named “df” as 2 different objects. It’s best to avoid naming things the same thing with different capitalization to avoid confusing yourself, but R won’t be confused.

Before we go too far, let’s make sure you have the tidyverse package loaded by running the code chunk below. This is the only dependency we’ll need for today’s code.

library(tidyverse)

Types of Data

There are multiple types of data in R, including vectors, lists, arrays, matrices, data frames and tibbles. Vectors are the simplest data type. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. An example vector is below:

# Vector of numbers
vec_num <- c(1, 2, 3, 4, 5)
vec_num

## [1] 1 2 3 4 5

# Vector of strings
vec_str <- c("Schoodic", "MDI-East", "MDI-West", "IAH")
vec_str

## [1] "Schoodic" "MDI-East" "MDI-West" "IAH"

vec_str[1]

## [1] "Schoodic"

Note the use of c(). The c() function stands for combine, and it combines the elements (e.g., “Schoodic”, “MDI-East”, etc.) in a list that are separated by commas. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.

Data frames are essentially a collection of vectors of the same length. Frame means it’s rectangular (like a picture frame), like a spreadsheet with multiple columns, each of which has the same number of rows. We will spend most of our time with data frames, and won’t worry too much about the other data types. Tibbles are the tidyverse version of a data frame, but they’re essentially the same thing as a data frame with a few added bells and whistles.

Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). This data format is also called tidy data. The column variables can be numbers, text, or logical (TRUE/FALSE). Before we go any further, let’s create a simple data frame to look at and perform some basic functions.

Note that I have a tendency to use field and column interchangeably (MS Access calls columns fields). Hopefully that doesn’t cause confusion. Just remember that field and column are equivalent.

Copy the code below into the console and run it.

df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', NA),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))

print(df)

##    plot numtrees numLive       domSpp invasive_pres
## 1     1       24      21    red_maple          TRUE
## 2     2       18      18   red_spruce         FALSE
## 3     3       17      16   red_spruce         FALSE
## 4     4       14      10  white_cedar         FALSE
## 5     5       31      29   white_pine         FALSE
## 6     6       27      26 white_spruce          TRUE
## 7 plot7       15      14         <NA>         FALSE

Here we’re creating a data frame object named df and using <- to assign the stuff on the right to an object named df. Note that objects in R can be a lot of things, including data frames, functions, lists, etc., and the ones that you created and/or are available to you are listed in the Environment pane and usually have a qualifier, like Data, so you have a better idea of what kind of object they are.

The df data frame we just created has 5 columns and 6 rows. You should notice that after you ran the code, df showed up in your Environment at the top right and the console printed output below.

Similar to Excel, R treats numbers and text differently. Some functions will only work with numbers, and some will only work with text. A couple of points about working with different column types in R data frames:

R calls text a string. If you have any text in a column, R automatically makes it a string.
There are 2 types of strings in R: character (abbreviated chr) and factor.
Older versions of R (i.e., R < 4.x) defaulted strings to factors. The newest R version defaults to characters, which saves us a lot of headaches. For most of the coding you’ll do in R, you probably won’t need to know about factors. If you’re using old code (e.g. pre-R 4.x) that relied on factors, you may need to change your default options to make the code work by running options(stringsAsFactors = TRUE). Otherwise, don’t worry about this for now.
When you’re dealing with strings, you typically have to put quotes around them. Note that 99% of the time, it doesn’t matter whether you use single or double quotes, as long as the thing you’re quoting is consistent.
The pound sign (#) is used to tell R not to run the text following the #. This is also called commenting out text, and is a way to leave comments that won’t break your code.
Columns that are entirely numbers (blanks are okay too) are set to numeric by default.
Integers are numeric whole numbers that will only return integers (no decimal points). That’s usually not a problem, but if it ever becomes a problem, the as.numeric() function fixes it. If you want a column to be an integer, use as.integer().
Logical is the last data type, and only consists of TRUE and FALSE.
Finally NA is a blank or missing value in your data. Note that the last row in domSpp is NA (stands for not available).

Selecting rows and columns in R

There are 2 main approaches in base R for selecting/working with rows and columns: $ and [ , ].

The $ is used to specify a data frame name and one of its columns. For example df$plot is referring to the plot column in the df data frame.
The brackets [ , ] allow you to select rows and columns. The left side of the comma in brackets handles rows, and the right side of the bracket deals with columns, like: df[row numbers, column numbers].

To better understand this, run the code below:

df$plot # List the values within the column plot for the df data frame
df[ , 'plot'] # Bracket version of the same thing
df[ , 1] # Bracket version of showing the first column, which is plot
df[2, 1] # Selecting 2nd row of 1st column

Question 1: How do you find the value of the 3rd row of domSpp? Answer is in the Answers tab.

Check and fix data structure

To check the structure of your data, run the code chunk below.

head(df) # to view the first 6 rows of df

##   plot numtrees numLive       domSpp invasive_pres
## 1    1       24      21    red_maple          TRUE
## 2    2       18      18   red_spruce         FALSE
## 3    3       17      16   red_spruce         FALSE
## 4    4       14      10  white_cedar         FALSE
## 5    5       31      29   white_pine         FALSE
## 6    6       27      26 white_spruce          TRUE

tail(df) # to view last 6 rows of df

##    plot numtrees numLive       domSpp invasive_pres
## 2     2       18      18   red_spruce         FALSE
## 3     3       17      16   red_spruce         FALSE
## 4     4       14      10  white_cedar         FALSE
## 5     5       31      29   white_pine         FALSE
## 6     6       27      26 white_spruce          TRUE
## 7 plot7       15      14         <NA>         FALSE

Note the blank in domSpp in the last row.

names(df) # to view the column names and their order

## [1] "plot"          "numtrees"      "numLive"       "domSpp"       
## [5] "invasive_pres"

str(df) # to view how R is defining each column type

## 'data.frame':    7 obs. of  5 variables:
##  $ plot         : chr  "1" "2" "3" "4" ...
##  $ numtrees     : num  24 18 17 14 31 27 15
##  $ numLive      : num  21 18 16 10 29 26 14
##  $ domSpp       : chr  "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
##  $ invasive_pres: logi  TRUE FALSE FALSE FALSE FALSE TRUE ...

Note the different data types in df. The plot column was ssigned as a chr instead of a number because of the last value (i.e., plot7). The next thing we’ll do is modify the dataset so that plot reads in as numeric, and replace the missing value in domSpp with “white spruce”.

Copy (or type) each section of code below and run in your console.

# The long way (note that I renamed the object here)
df2 <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 7), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', 'white_spruce'),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))

# A shorter way
df$plot[7] <- 7 #takes the 7th row of the plot column and changes it to 7
df$domSpp[7] <- "white_spruce" #takes the 7th row of domSpp and replaces it with "white_spruce"

# Another shorter way
df[7, "plot"] <- 7 #takes the 7th row of the plot column and makes it 7
df[7, 'domSpp'] <- "white_spruce" #takes the 7th row of the domSpp column and makes it "white_spruce"

# Another shorter way
df[7, 1] <- 7 #takes the 7th row of the first column and makes it 7
df[7, 4] <- "white_spruce" #takes the 7th row of the 4th column and makes it "white_spruce"

# Even shorter-way
df[7, c(1, 4)] <- c(7, "white_spruce") #Takes the 7th row and the first and 4th column, which happen to be plot and domSpp.
# Then the first column gets 7, and the 4th gets "white spruce"

This is merely to show that there are usually 5 different ways to do something in R, and the best way is the one you can get to work. At least that’s my philosophy for most of the coding I do. Having code I can pick up a year later and still understand what it does is also important. I only really care about coding for performance if I’m working with big datasets, where slow or memory hungry processes can really add up.

I should mention that when you use numbers instead of column names, your code is pretty fragile to changes in column order (eg, you drop a column). Row numbers are similarly fragile, if you reorder your data. It’s better to use names where possible and to use functions that use logic to find and replace records. For example:

#df[is.na("domSpp")] <- "white_oak" #wrong
df$domSpp[is.na(df$domSpp)]<- "white_oak"

Here R is looking for a value in the domSpp column that is NA (i.e. a blank value). Since there’s only 1, this works. If there were more than 1 NA, we’d need better logic to assign the proper value to each blank. We’ll go into more detail about this later.

And now, here’s the tidyverse way to replace the blank in df’s domSpp column with “white_oak”:

df$domSpp <- df$domSpp %>% replace_na("white_oak")

We’ll talk more about pipes (%>%) later, but for now, just know that it’s taking the data on the left of the %>% and doing the thing on the right to it.

Finally, check whether plot is now reading in as numeric or still chr.

str(df) #still a char

## 'data.frame':    7 obs. of  5 variables:
##  $ plot         : chr  "1" "2" "3" "4" ...
##  $ numtrees     : num  24 18 17 14 31 27 15
##  $ numLive      : num  21 18 16 10 29 26 14
##  $ domSpp       : chr  "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
##  $ invasive_pres: logi  TRUE FALSE FALSE FALSE FALSE TRUE ...

R still thinks plot is a char. Basically once R assigns a data type to a column, you have to manually change it if you need to. In the case here, the as.numeric function fixes the issue. If you didn’t fix the last record, as.numeric would still convert the plot column to numeric, but the last value will be set to NA, and you’ll get a warning message: “NAs introduced by coercion”.

df$plot <- as.numeric(df$plot)
str(df)

## 'data.frame':    7 obs. of  5 variables:
##  $ plot         : num  1 2 3 4 5 6 7
##  $ numtrees     : num  24 18 17 14 31 27 15
##  $ numLive      : num  21 18 16 10 29 26 14
##  $ domSpp       : chr  "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
##  $ invasive_pres: logi  TRUE FALSE FALSE FALSE FALSE TRUE ...

Question 2: How would you make numtrees an integer?

Question 3: How would you change the second row in invasive_pres from FALSE to TRUE?

Data Summary I

Using the df we created in the previous section, let’s perform some basic summary statistics. Base R has a number of summary statistics built into it that are really useful, including mean, median, min, max, and sd (for standard deviation).

min(df$numtrees)

## [1] 14

max(df$numtrees)

## [1] 31

mean(df$numtrees)

## [1] 20.85714

range(df$numtrees) #gives min and max

## [1] 14 31

The summary() function in R is another really useful function that can take just about any named Data object in your global environment and provide you a meaningful summary. For data frames, the summary() function gives different/meaningful summaries for each column depending on type (e.g. numeric vs. character). The summary() function also works well for summarizing model output, like from ANOVA, linear regression, etc.

summary(df)

##       plot        numtrees        numLive         domSpp         
##  Min.   :1.0   Min.   :14.00   Min.   :10.00   Length:7          
##  1st Qu.:2.5   1st Qu.:16.00   1st Qu.:15.00   Class :character  
##  Median :4.0   Median :18.00   Median :18.00   Mode  :character  
##  Mean   :4.0   Mean   :20.86   Mean   :19.14                     
##  3rd Qu.:5.5   3rd Qu.:25.50   3rd Qu.:23.50                     
##  Max.   :7.0   Max.   :31.00   Max.   :29.00                     
##  invasive_pres  
##  Mode :logical  
##  FALSE:5        
##  TRUE :2        
##                 
##                 
##

You can also calculate new columns from existing columns in your data.

df$numDead <- df$numtrees - df$numLive # base R version
df <- df %>% mutate(numDead = numtrees - numLive) # tidyverse version of the same operation.
# note if the 2nd line doesn't run, load the tidyverse library (library(tidyverse)) first.

The mutate function is in the dplyr package of the tidyverse. The mutate function is used to create a new column in an existing data frame, and always returns the same number of records as the existing data frame. In this example, mutate creates a new column called numdDead, and tells it to equal numtrees - numLive. The results are identical to the base R version in the line above. Mutate just spares you having to write df$ over and over.

Question 4: How would you calculate the percent of trees that are dead in the df data frame?

Question 5: What is the smallest number of dead trees observed in the df dataset?

Importing Data into R

Loading csv files

For the training modules that follow, we’re going to work with a larger dataset that contains some fake wetland plant data. First we need to load the wetland dataset, using the code below.

wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")

After running the code above, you should see a new object in your global environment named wetdat. You can view the structure and the data by clicking on the arrow next to the name in the global environment panel, or click directly on wetdat to display a separate window of your data in a spreadsheet format.

We’re going to use this dataset throughout the rest of training. To give a little background on it, each row represents a species that was recorded in a wetland plot. Multiple sites have been sampled twice, as indicated by the Year column. PctFreq is the % of subplots a species was found in, ranging from 0 to 100. Ave_Cov is the average % cover of the species, ranging from 0 to 100.

The read.csv() is a function in base R that reads csv (comma separated values) files, which are lightweight machine-readable files that most text readers or spreadsheet programs can read. Unlike Excel spreadsheets, they only contain values and commas and cannot save any kind of formatting, like merged cells, cell borders, formulas, etc. You also can only have one worksheet within a csv. If most of your data live in Excel spreadsheets, you can either save individual worksheets as csvs directly in Excel, or you can load individual worksheets into R using the readxl package that was designed to read Excel files.

Loading Excel files

The first step for working with Excel files (i.e., files with .xls or .xlsx extensions), is to install the readxl package, and then load the package as shown below. The readxl package has a couple of options for loading Excel spreadsheets, depending on whether the extension is .xls, .xlsx, or unknown. The code below shows you how to load files with .xlsx extension. The other options for loading Excel files can be viewed on the readxl help page.

install.packages("readxl") #only need to run once.

library(readxl)
bat_capture <- read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx", 
                         sheet = "All Records")

head(bat_capture)

## # A tibble: 6 x 65
##   Date                Month State County Town  `Site ID` `Site Name: Ori~
##   <dttm>              <chr> <chr> <chr>  <chr> <chr>     <chr>           
## 1 1996-06-05 00:00:00 June  ME    Hanco~ Goul~ COREA     MICK02          
## 2 1996-06-07 00:00:00 June  ME    Hanco~ Wint~ ANP       FRZR (approx.)  
## 3 1996-06-07 00:00:00 June  ME    Hanco~ Wint~ ANP       FRZR (approx.)  
## 4 1996-06-07 00:00:00 June  ME    Hanco~ Wint~ ANP       FRZR (approx.)  
## 5 1996-06-07 00:00:00 June  ME    Hanco~ Wint~ ANP       FRZR (approx.)  
## 6 1996-06-07 00:00:00 June  ME    Hanco~ Wint~ ANP       FRZR (approx.)  
## # ... with 58 more variables: `Site Name: Standardized` <chr>,
## #   `Recorder/Staff` <chr>, Affiliation <chr>, Lat <dbl>, Long <dbl>,
## #   Datum <chr>, Easting <chr>, Northing <chr>, `Start Temp (°F)` <chr>, `End
## #   Temp (°F)` <chr>, `Start Temp (°C)` <chr>, `End Temp (°C)` <chr>, `%
## #   clouds` <dbl>, Wind <chr>, Precip. <chr>, `Moon Phase` <chr>,
## #   Sunset <dttm>, `Elevation (m)` <dbl>, Habitat <chr>, `Start Time` <dttm>,
## #   `End Time` <chr>, `Total minutes` <dbl>, `Total hours` <dbl>, `Capture
## #   Technique` <chr>, `Capture Technique: Standardized` <chr>, `Effort (sq. m.
## #   of net)` <dbl>, `Units of Effort (UOE=Effort*Total hours)` <dbl>, `Set near
## #   water?` <chr>, `Capture Time` <chr>, `Time After Sunset (h:mm)` <chr>,
## #   `Time after Sunset: Converted to hours` <chr>, Net <chr>, `Net
## #   Height` <chr>, Species <chr>, Latin <chr>, Sex <chr>, Age <chr>, `Repro.
## #   Status` <chr>, `RS (0-5)` <chr>, `Forearm (mm)` <chr>, `Ear (mm)` <chr>,
## #   `Weight (g)` <chr>, `Band #` <chr>, Recap <chr>, `Sample ID` <chr>,
## #   Skin <chr>, `Genetic Test` <chr>, `Genetic ID` <chr>, `ID Match` <chr>,
## #   Fur <chr>, Blood <chr>, Swab <chr>, Fecal <chr>, Parasite <chr>, `Radio
## #   Frequency` <lgl>, Photos <lgl>, `Non-target species` <lgl>, Comments <chr>

The file.choose() function

If you don’t remember the exact path where your file lives, the file.choose() function is a handy trick. Nesting that within a read function will open a separate window where you can select the file by hand. While this can be handy, it can also encourage lazy file organization. I prefer to organize my code and data into projects that store all of the files within the project directory, and so rarely use this function (In fact, I had to look up how to use it for this section). If you want to use it, here’s an example:

dat <- read.csv(file.choose())

Subsetting and Sorting

Subsetting Data

Subsetting is a generic term for reducing the size of a dataset, either by reducing the number of rows, the number columns, or both, in the dataset. Filtering typically means reducing rows. Selecting usually means reducing columns. Technically, we already did some subsetting when we were trying to identify and fix problems in the df data frame. Now we’re going to do it more intentionally and with a bigger dataset.

Using the wetdat data frame from earlier, let’s subset the data to filter out all species that are considered protected from public (eg sensitive species we don’t want to publish the location to- also this is a fake dataset, so don’t worry that we’re publicizing this.). There’s a logical column called Protected in the dataset that we’re going to use to filter out rows when Protected is TRUE. Let’s also simplify the data frame to only select the columns we care about.

# Filter to remove protected species
wetdat_pub <- wetdat[wetdat$Protected == FALSE, ]
nrow(wetdat) #508 rows in original dataset
nrow(wetdat_pub) #499 rows in filtered dataset
table(wetdat_pub$Protected) # Only Protected = FALSE

# Select only fields you want
wetdat_pub <- wetdat_pub[ , c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov", 
                              "Invasive", "Protected")]
names(wetdat_pub)

# Previous steps can all happen in 1 line of code:
wetdat_pub <- wetdat[wetdat$Protected == FALSE, 
                     c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov", 
                       "Invasive", "Protected")]

nrow(wetdat_pub) # 499 rows
names(wetdat_pub) # Only includes fields we want
table(wetdat_pub$Protected) #All FALSE

# Another base R solution that's a bit easier to work with:
wetdat_pub2 <- subset(wetdat, Protected == FALSE, 
                              select = c(Site_Name, Site_Type, Latin_Name, Year,
                                         Ave_Cov, Invasive, Protected))

nrow(wetdat_pub2) # 499 rows
names(wetdat_pub2) # Only includes fields we want
table(wetdat_pub2$Protected) #All FALSE

A couple of comments about the above code:

Note the use of == to filter out Protected species. In R, when you’re trying to match against something (like Protected == FALSE, or Site_Name == “RAM-44”), you use double == signs. When assigning something to an object (like mutate(newcolum = x + y)), you use a single =.
If you want to pull out all records that are NOT equal to a value, you use !=. That is, in R, ! means NOT. For example Protected != TRUE means to only include records where Protected is NOT TRUE (ie FALSE).
The subset() function is a base R function that is a bit easier to work with than brackets. You can filter rows, select columns, or both (as seen above). The first argument with subset is always the dataset you’re subsetting.

Now the tidyverse way do in the previous code chunk using select() and filter() functions chained together via pipes (%>%). The select function is used for columns and the filter command is used for rows.

wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% 
                          select(Site_Name, Site_Type, Latin_Name, Year, Ave_Cov, Invasive, Protected)
                          # Any column name not mentioned in the select is dropped

nrow(wetdat_tidy)
names(wetdat_tidy)
table(wetdat_tidy$Protected)

# Even more efficient way to code this:
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% select(-Common, -PctFreq)

Note the %>%, called a “pipe”. The way to read what the pipe is doing is it takes the wetdat data frame, pipes it into the filter function, which drops protected species, then that newly filtered dataset without protected species is piped into the select function, and the Common and PctFreq columns are dropped. Also notice that column names aren’t quoted in tidy approach. One of the benefits of the tidyverse packages is that you don’t have to quote column names, whereas you typically do in base R (subset() being an obvious exception). This saves time typing and was a conscious decision of the tidyverse developers. This approach is called non-standard evaluation (NSE). NSE is great most of the time, but can cause headaches when you start writing your own functions. Personally I use tidyverse for most scripting purposes because it’s intuitive to me and I like being able to chain steps together. I use base R when I’m writing functions/ developing packages to avoid NSE issues, and because base R functions tend to be faster. Until you get to the level that you’re building functions/packages, you’re free to work within the tidyverse.

Filtering with multiple values

This section shows you how to filter rows based on multiple possible values. Using the wetland data again, let’s say we only want to look at the data from 2 sites: SEN-01 and SEN-03. To do that, we create a list of strings we want to match and tell R to take any record that matches any of the strings in that list.

# Base R
wetdat_2sites <- wetdat[wetdat$Site_Name %in% c("SEN-01", "SEN-03"), ] 
nrow(wetdat_2sites)
table(wetdat_2sites$Site_Name)

# Tidyverse equivalent of same operation
wetdat_2tidy <- wetdat %>% filter(Site_Name %in% c("SEN-01", "SEN-03"))
nrow(wetdat_2tidy)
table(wetdat_2tidy$Site_Name)

The nice thing about using %in% is that it only pulls records that match exactly one of the values you give it, and you don’t have to worry about it also including NAs.

Question 6: How would you select only Invasive species records?

Question 7: Which sites have both Arethusa bulbosa and Carex exilis present?

Sorting Data

Next topic is sorting data in R, which you can do alphabetically or numerically. Typically when you read in a dataset in R, the order is exactly the order of that in the original file, and the row numbers are assigned based on that order. If you want to change the original order, there are multiple ways to sort data in R, as demonstrated below.

head(wetdat) # check original order before sorting

# Base R- I always forget the exact code and have to look it up.
wetdat_sort <- wetdat[order(wetdat$Latin_Name, wetdat$Site_Name), ]
head(wetdat_sort)

# Sort in reverse
wetdat_rev <- wetdat[order(desc(wetdat$Latin_Name), desc(wetdat$Site_Name)), ] #desc() stands for descending
head(wetdat_rev)

# Another example with numbers
wetdat_rev2 <- wetdat[order(-wetdat$PctFreq), ]
# Note: desc() is for text, - is for numbers

# Tidyverse version
wetdat_sort_tidy <- wetdat %>% arrange(Latin_Name, Site_Name)
head(wetdat_sort_tidy)

# Tidyverse reverse sort
wetdat_rev_tidy <- wetdat %>% arrange(desc(Latin_Name), desc(Site_Name))
head(wetdat_rev_tidy)

Question 8: How would you sort the wetdat by Ave_Cov (high to low) and Latin_Name?

That’s the end of Day 1! Be sure to save your Day 1 R script, try to answer the questions from today, and complete the tasks in the Assignments tab as you have time. Answers to the questions are in the Answers tab. Note that I provide one or a few solutions to get the answer to the questions, and it’s fine if you come up with your own. That’s how coding in R works!

Lastly, don’t worry if you don’t understand everything. The key here was to expose you to the many different ways to view and work with data in R. Learning R requires repetition, determination, and patience. Just stick with it!

Assignments

Please complete the assignments below (they’re ordered from highest to lowest priority). Note that the readings are to expose you to new topics, but you’re not expected to understand or digest 100% of the material.

Answer the questions for this day and review today’s code. Answers are in the Answers tab.
Complete the Feedback form by the end of the day
R4DS: Chapter 5. Data transformations
STAT545: Chapter 15. Join two tables
R4DS: Chapter 13. Relational Data

If you didn’t fully understand the material we covered in the first 4 tabs of today’s training (Intro to R & R Studio to Types of Data), the links below might help.

STAT545: Chapter 2. R basics and workflows
STAT545: Chapter 5. Basic care and feeding of data in R

If you don’t have experience with relational databases and joining tables, the resources below introduce the topics.

MSAccess tutorial: Intro to Databases
MSAccess tutorial: Intro to Objects

Day 2

Data Wrangling I

Data Wrangling Intro

We already did a bit of data wrangling with the df and wetdat data frames yesterday. Today, we take those concepts to the next level. If you’ve never heard of the term, data wrangling essentially refers to all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process. This was the most checked topic in the survey you all filled out, so I’ll cover this in (painful?) detail.

Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.

For this training module, we’re going to check and clean up the wetland plant data from yesterday. Just a reminder that in this dataset each row represents a species that was recorded in a wetland plot. Multiple sites have been sampled twice, as indicated by the Year column. PctFreq is the % of subplots a species was found in, ranging from 0 to 100. Ave_Cov is the average % cover of the species, ranging from 0 to 100.

Before we get started, open the R_training.Rproj project in RStudio, start a Day_2.R script. Then, load the tidyverse package and the wetland dataset we used yesterday.

library(tidyverse)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")

Explore data structure

Once you’ve loaded the wetdat dataset, remember that you can view the structure of the wetdat by clicking on the arrow next to the name in the global environment panel. If you click directly on wetdat, a separate window will open to view the data like a spreadsheet. Both of these are acceptable ways to explore data, but I prefer to explore datasets through code, rather than by mouse.

The code below is usually where I start (though I rarely use tail).

head(wetdat) # shows the first 6 lines of the data frame, including the names of the columns
tail(wetdat[1:3]) # shows last 6 records of first 3 columns in the data frame
names(wetdat) # shows the column names of the data frame
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
table(complete.cases(wetdat)) # FALSE counts the number of rows that have an NA in at least 1 column
dim(wetdat) # number of rows (first) and columns (second)
nrow(wetdat) # number of rows (ncol is # columns)
str(wetdat) # shows the type of data in each column
summary(wetdat) # summarize each field based on data type

Check and fix numeric fields that R reads as a character

Both the summary and str functions show an issue with the Avg_Cov column: R classified this as a character, but we want it to be numeric. If R classified this as a character, it means that at least one record within this column contains something that isn’t a number. If your data frame isn’t huge, you can just run View(wetdat) to see the data and scroll/sort until you find the culprit. Strings will typically sort to the top or bottom of an otherwise numeric column.

View(wetdat)

Just by sorting Ave_Cov in the view above, you can see that there are 2 records with <0.1 instead of 0.1. Also, note the row numbers to the far left, which are rows 83 and 106 in this case. the Row numbers are assigned at the time you load the data, and are maintained even when you sort in the view. If you sort the data in code (e.g. wetdat %>% arrange(Site_Name)), the row numbers will be updated.

Knowing what you just learned, you could open this file in Excel and fix the data, then reload the file in R. But I’m going to do it in code, so that it works any time you add new data to the dataset that might have the same issues. This isn’t the slickest approach because I’m pretending that I don’t know which record and which symbols are causing the problems (say, for example, that there are a lot of records with this issue). First, I’m going to create a temporary field called Ave_Cov_num that’s a numeric version of Ave_Cov.

wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov) #create new field that converts Ave_Cov to numeric

## Warning: NAs introduced by coercion

Note the “NAs introduced by coercion” warning message in your console. This means that when you converted the Ave_Cov to numeric, there was at least 1 record that R didn’t read as a number, and it was set to NA (blank) in the Ave_Cov_num column. We’re going to use that to find and then fix the record(s) that are causing the issue.

which(is.na(wetdat$Ave_Cov_Num))

## [1]  83 106

The line above is asking which rows in the new Ave_Cov_num field are NA, and returns 83 and 106. Now let’s use brackets to return the data from those 2 rows.

wetdat[which(is.na(wetdat$Ave_Cov_Num)), ] # all columns with NA in Ave_Cov_num

##     Site_Name Site_Type            Latin_Name             Common Year PctFreq
## 83     SEN-03  Sentinel     Cornus canadensis bunchberry dogwood 2011     100
## 106    SEN-03  Sentinel Vaccinium macrocarpon    large cranberry 2011     100
##     Ave_Cov Invasive Protected X_Coord Y_Coord Ave_Cov_Num
## 83     <0.1    FALSE     FALSE  553993 4898560          NA
## 106    <0.1    FALSE     FALSE  553993 4898560          NA

wetdat[which(is.na(wetdat$Ave_Cov_Num)), "Ave_Cov"] # Ave_Cov only

## [1] "<0.1" "<0.1"

wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Another way to see Ave_Cov only

## [1] "<0.1" "<0.1"

In the code above, we’re using brackets to subset the wetdat data frame to only return the rows that have NA in Ave_Cov_Num. The first line returns all columns in those rows. The second and third lines only show us the Ave_Cov field, which is what we want to fix. Notice that I’m using the NAs in Ave_Cov_Num to return the records causing problems in Ave_Cov.

Also, just a reminder that brackets specify the [row, column] of a data frame. If you specify the dataframe$column then you only have the row left, and don’t need a comma in the brackets. Admittedly, subsetting with brackets can get complicated, and I’m not always sure if I need a comma in the bracket or not. If I get the “incorrect number of dimensions” error, I know I did it wrong and need to either add or delete the comma.

Back to cleaning the data. We found that the records with non-numeric symbols were both “<0.1”. We’re going to use that information to replace “<0.1” with 0.1. Of course the first thing you should do, when you see <0.1 is to check your raw data, datasheets, etc. to make sure 0.1 is the correct replacement for the <0.1 records. If you don’t know what the correct replacement should be, you should probably change the value to NA instead of “0.1”. That example is commented out in the code chunk below.

wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- "0.1"
#wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- NA

Another approach is to drop the “<” using gsub(). This is handy for lab-derived data that report detection limits as < ## or > ##, and when you just want the number after the < or >.

wetdat$Ave_Cov <- gsub("<", "", wetdat$Ave_Cov)

str(wetdat)

## 'data.frame':    508 obs. of  12 variables:
##  $ Site_Name  : chr  "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
##  $ Site_Type  : chr  "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
##  $ Latin_Name : chr  "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
##  $ Common     : chr  "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
##  $ Year       : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ PctFreq    : int  NA 20 80 40 100 60 100 60 100 100 ...
##  $ Ave_Cov    : chr  "0.02" "0.02" "2.22" "0.04" ...
##  $ Invasive   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Protected  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ X_Coord    : num  574855 574855 574855 574855 574855 ...
##  $ Y_Coord    : num  4911909 4911909 4911909 4911909 4911909 ...
##  $ Ave_Cov_Num: num  0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...

Here gsub is looking for the “<” symbol, and replaces it with nothing, via empty quotes. If you wanted to replace the “<” with something else, like a minus sign, the 2nd argument would be “-” instead of "". You can now View(wetdat) to check that <0.1 is no longer in the Ave_Cov column. Rows 83 and 106 were the culprits.

However, notice that str(wetdat$Ave_Cov) shows that Ave_Cov is still considered a character. Once R assigns a data type to a column, you have to manually change it to something different. Now that we fixed <0.1, we shouldn’t get any NAs introduced by coercion when we convert to numeric. I’m going to overwrite the Ave_Cov_Num field because it’s just a temporary field for the purpose of data checking.

wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
str(wetdat)

## 'data.frame':    508 obs. of  12 variables:
##  $ Site_Name  : chr  "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
##  $ Site_Type  : chr  "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
##  $ Latin_Name : chr  "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
##  $ Common     : chr  "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
##  $ Year       : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ PctFreq    : int  NA 20 80 40 100 60 100 60 100 100 ...
##  $ Ave_Cov    : chr  "0.02" "0.02" "2.22" "0.04" ...
##  $ Invasive   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Protected  : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ X_Coord    : num  574855 574855 574855 574855 574855 ...
##  $ Y_Coord    : num  4911909 4911909 4911909 4911909 4911909 ...
##  $ Ave_Cov_Num: num  0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...

Ave_Cov_Num is now numeric and no NAs were introduced by coercion. It’s now safe to overwrite the original Ave_Cov field. The entire workflow from start to finish is below:

wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)

## Warning: NAs introduced by coercion

wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Identify which values in Ave_Cov can't be converted to a number and what you need to do to fix them.

## [1] "<0.1" "<0.1"

wetdat$Ave_Cov <- as.numeric(gsub("<", "", wetdat$Ave_Cov)) # replace < with nothing
wetdat <- subset(wetdat, select = -c(Ave_Cov_Num)) #now drop the temp column Ave_Cov_Num
str(wetdat) #Ave_Cov is now numeric

## 'data.frame':    508 obs. of  11 variables:
##  $ Site_Name : chr  "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
##  $ Site_Type : chr  "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
##  $ Latin_Name: chr  "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
##  $ Common    : chr  "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
##  $ Year      : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ PctFreq   : int  NA 20 80 40 100 60 100 60 100 100 ...
##  $ Ave_Cov   : num  0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...
##  $ Invasive  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Protected : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ X_Coord   : num  574855 574855 574855 574855 574855 ...
##  $ Y_Coord   : num  4911909 4911909 4911909 4911909 4911909 ...

table(complete.cases(wetdat$Ave_Cov)) #checks that there aren't any more NAs.

## 
## TRUE 
##  508

# If there were NAs, you'd have a column named FALSE with the number of NAs below it.

Check and fix range of numeric data

Another check I often do is to make sure there aren’t any impossible values. For example, both the Ave_Cov and PctFreq columns are percents that should range from 0 to 100. Negative values or values over 100 are errors. Let’s see if we find any in the data. Note that we couldn’t use the range() function on Ave_Cov with the original data frame until we fixed the <0.1 issue and converted to numeric.

range(wetdat$PctFreq, na.rm = TRUE) # 20 to 100- that looks fine.

## [1]  20 100

range(wetdat$Ave_Cov, na.rm = TRUE) # 0.02 to 110.0- no bueno

## [1]   0.02 110.00

The Ave_Cov column has at least 1 record >100. Now we’re going to find out how many records are >100, then replace them with the correct value.

Note also the na.rm = TRUE. That tells R that if there are any blanks in the data, remove them for the range calculation, and only return numbers. If you didn’t include na.rm = TRUE, range would return NA NA if there’s a blank in the column (more on this in the next section).

wetdat[wetdat$Ave_Cov > 100, ] # If Ave_Cov is still a chr, this filter won't work

##    Site_Name Site_Type       Latin_Name         Common Year PctFreq Ave_Cov
## 36    SEN-02  Sentinel Arethusa bulbosa dragon's mouth 2011      20     110
##    Invasive Protected X_Coord Y_Coord
## 36    FALSE      TRUE  556242 4914369

There’s only 1 row, row 36, which has an Ave_Cov > 100 for Arethusa bulbosa in SEN-01. For our purposes, pretend that we went back to the datasheet, and found that there was a data transcription error, and the correct value should be 11, not 110. The next line of code fixes the data for that specific record.

wetdat$Ave_Cov[wetdat$Ave_Cov > 100] <- 11
range(wetdat$Ave_Cov)

## [1]  0.02 74.00

Now Ave_Cov values are within the acceptable range of 0 to 100.

Finding and fixing blanks (NA)

As I’ve mentioned before, R calls blanks in your data NA. R by default is very conservative about how it handles NAs. If you have an NA in an otherwise numeric column and are trying to do a mathematical operation (eg calculate sum or mean), R will most likely return NA as the result. The philosophy behind that is the user must decide how to deal with NAs- either replace them with a logical value, like 0, or remove them from the calculation. Every situation is different. You, the user, need to decide the correct way to deal with NAs in your data.

First, let’s look to see if there are NAs in the PctFreq and Ave_Cov columns (the main data of interest in the wetdat dataset).

table(complete.cases(wetdat$Ave_Cov)) #all TRUE

## 
## TRUE 
##  508

table(complete.cases(wetdat$PctFreq)) #some FALSE

## 
## FALSE  TRUE 
##     5   503

While the Ave_Cov column is all TRUE, the PctFreq has 5 NAs (FALSE in output). Now we need to decide whether the NAs should be changed to 0 or dropped from the dataset. Again, that’s up to the user and what you’re trying to do. For the purposes here, I’ll show you how the results change based on the decision you make.

First, let’s do nothing.

mean(wetdat$PctFreq)

## [1] NA

NA is returned. R’s basically awaiting further instruction before it’s willing to return a value. Let’s drop NAs from the calculation.

mean(wetdat$PctFreq, na.rm = TRUE)

## [1] 70.36779

Returns 70.36779. Note that na.rm = TRUE is pretty universal argument in R functions. Functions are typically set to na.rm = FALSE, which returns NA if there are blanks in your data. Changing the argument to na.rm = TRUE will drop NAs from the function.

Now let’s assume that blanks should actually be 0.

wetdat$PctFreq[is.na(wetdat$PctFreq)] <- 0
table(complete.cases(wetdat$PctFreq))

## 
## TRUE 
##  508

mean(wetdat$PctFreq)

## [1] 69.6752

Now all records in PctFreq are TRUE/complete, and the mean is slightly smaller because the 0s are included in the calculation. This is why it’s important to think about how to deal with blanks, and every situation is different.

Check and fix character columns

A common issue with text/character fields is inconsistent naming conventions, especially if data are being stored in a spreadsheet instead of a database with standardized look ups. In the wetdat data frame, we have a column called Site_Name, which should be a unique name for each site that starts with a 3-letter code (either SEN or RAM), is followed by a dash, then a 2 digit site number. Let’s check to see if all of the Site_Name records follow that. Since we don’t have a huge dataset here, we can use the table() function, which will count the number of records in each unique value of a column.

table(wetdat$Site_Name)

## 
## RAM-05 RAM-41 RAM-44 RAM-53 RAM-62  RAM44 SEN-01 SEN-02 SEN-03  SEN-3 
##     98     73     34     98     52     45     34     41     32      1

A couple of issues show up here. First, we see that there’s a site named RAM44 that’s missing a dash. We also see that there’s one SEN-3, which should be SEN-03.

Another way to do this, if you had a large dataset, would be to check if there are any Site_Name records that are not length 6 (eg RAM-01 is 6 digits) using the nchar() function, which counts the number of characters in a record.

unique(wetdat$Site_Name[which(nchar(wetdat$Site_Name) != 6)]) #

## [1] "SEN-3" "RAM44"

Note unique() returns only unique values, so you don’t get 40 RAM44 returns (run it without unique() if you don’t quite follow.

You can also check if any Site_Name records are missing a dash using the grepl function. The grepl function is a pattern matching function that we will use to look for “-” in the Site_Name column. The ! in front of which is telling R to return any row that doesn’t have a match for “-” in the Site_Name column.

wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))]

##  [1] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [10] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [19] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [28] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"
## [37] "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44" "RAM44"

Now to fix and then check both of the issues:

wetdat$Site_Name[wetdat$Site_Name == "RAM44"] <- "RAM-44"
wetdat$Site_Name[wetdat$Site_Name == "SEN-3"] <- "SEN-03"
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))] #character(0) means nothing returned
wetdat$Site_Name[which(nchar(wetdat$Site_Name)!=6)]
table(wetdat$Site_Name)
sort(unique(wetdat$Site_Name)) #another way to see all the unique values in a column

Problems solved (character(0) means no records returned). Note that there are other more powerful solutions using regular expressions from the POSIX or Perl language and functions like gsub() or grep(), but they’re more involved and tricky to work with. I only use them when the solution above doesn’t work. If you ever need to know more about that, a decent intro is here. I also provide a more involved example on how to use regular expressions for pattern matching in Answer 1 for today’s questions, but I’m not going to go too far down that rabbit hole during training (I can barely make these work myself, and always have to look them up).

Checking your work

If you plan to calculate summary statistics, like average or standard error, or are doing a lot of filtering, reshaping, etc. to your data, you always want to check your work and make sure you’ve still got the correct number of sites, individuals, visits, etc. before moving on to the next step. Just because you didn’t get an error or warning, doesn’t mean the code you ran worked the way you intended it to!

In the case of the wetdat, there should be 508 total rows, 7 sites, and three different years. The code below shows how to verify that. I also often skim the unique values in each field for anything weird, hence the sort(unique()) line.

dim(wetdat) # reports # rows then # cols

## [1] 508  11

nrow(wetdat) # reports # rows only

## [1] 508

length(unique(wetdat$Site_Name)) # number of unique values in Site_Name

## [1] 8

length(unique(wetdat$Year)) # number of unique years

## [1] 3

table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years

##         
##          2011 2012 2017
##   RAM-05    0   44   54
##   RAM-41    0   33   40
##   RAM-44    0   34   45
##   RAM-53    0   48   50
##   RAM-62    0   26   26
##   SEN-01   34    0    0
##   SEN-02   41    0    0
##   SEN-03   33    0    0

sort(unique(wetdat$Common))[1:10] # see all the common names in the dataset (truncated to first 10 common names for printing here).

##  [1] "American mountain ash" "apple"                 "Asian bittersweet"    
##  [4] "bigleaf aster"         "bigtooth aspen"        "bird vetch"           
##  [7] "black chokeberry"      "black crowberry"       "black huckleberry"    
## [10] "black spruce"

Question 1: There’s a Latin name that needs to be fixed (hint look for non-letter symbols). How would you find that record?

Question 2: Now that you’ve found it, fix it, and check that you fixed it.

Joining data frames

Basics of joining tables

Most of the data we collect in NETN are stored in relational databases, where plot-level data that never change (eg X, Y coords) are stored in 1 table (tbl_Location) and data related to each visit (also called events) are stored in other tables. Databases are a way to efficiently organize data (eg not a lot of duplication across tables), standardize values through lookup tables, and enforce rules like required fields. To work with the data stored in databases, we typically pull the individual tables into R and join them together. To do that, we use a column that is found in each table called a primary key, that allows you to match the correct records in one table to the other. In our database, we typically use GUIDs (globally unique identifiers) as our keys, which I’ll point out in the next section.

There are different flavors of joining tables. The main joins we use are:

Left joins: take all of the records in the primary key on the left side data frame (ie the first data frame specified in the function) and find one or more matching records in the right side data frame (ie the second data frame specified). Any records in the left data frame that don’t have a match in the right data frame will still be included in the final result, but all of the columns for those records
coming from the right data frame will be NA (blank). Records on the right data frame without a match in the left data frame are dropped from the result.
Right joins: basically the same as left joins, except the second dataset includes all records and the first only includes records with matching primary keys.
Semi joins: only take records from both left and right datasets that have matching primary keys.
Full joins: take all of the records from both sides of the join. Records on both sides without matches are included in the results with NAs in the cells that came from the table missing a match.

In practice, I use left and full joins the most often, so I’ll focus mostly on those in the rest of this module. We’ll try this out on simple datasets first, then we’ll build up to more complex examples with our forest data. Another great example with visuals can be found here.

tree <- data.frame(plot = c("p1","p1","p1","p2","p2","p3","p4","p4"), 
                   treenum = c(1, 2, 3, 11, 12, 21, 32, 33),
                   species = c("ACERUB" ,"PINSTR", "PINSTR", "QUERUB", "PICRUB", 
                               "PICRUB", "PICRUB", "ACERUB"))
tree

tree_data <- data.frame(plot = c("p1", "p1", "p1", "p1", "p1","p1", "p2", "p2", "p5", "p7"),
                        year = c(2012, 2012, 2012, 2016, 2016, 2016, 2012, 2012, 2016, 2016),
                        treenum = c(1, 2, 3, 1, 2, 3, 11, 12, 51, 71),
                        DBH = c(12.5, 14.5, 16.1, 12.8, 14.9, 17.2, 28.1, 35.4, 36.1, 45.2))

tree_data

tree_full_join <- full_join(tree, tree_data, by = c("plot", "treenum"))
tree_left_join <- left_join(tree, tree_data, by = c("plot", "treenum"))

nrow(tree_full_join)

## [1] 13

nrow(tree_left_join)

## [1] 11

print(tree_full_join)

##    plot treenum species year  DBH
## 1    p1       1  ACERUB 2012 12.5
## 2    p1       1  ACERUB 2016 12.8
## 3    p1       2  PINSTR 2012 14.5
## 4    p1       2  PINSTR 2016 14.9
## 5    p1       3  PINSTR 2012 16.1
## 6    p1       3  PINSTR 2016 17.2
## 7    p2      11  QUERUB 2012 28.1
## 8    p2      12  PICRUB 2012 35.4
## 9    p3      21  PICRUB   NA   NA
## 10   p4      32  PICRUB   NA   NA
## 11   p4      33  ACERUB   NA   NA
## 12   p5      51    <NA> 2016 36.1
## 13   p7      71    <NA> 2016 45.2

print(tree_left_join)

##    plot treenum species year  DBH
## 1    p1       1  ACERUB 2012 12.5
## 2    p1       1  ACERUB 2016 12.8
## 3    p1       2  PINSTR 2012 14.5
## 4    p1       2  PINSTR 2016 14.9
## 5    p1       3  PINSTR 2012 16.1
## 6    p1       3  PINSTR 2016 17.2
## 7    p2      11  QUERUB 2012 28.1
## 8    p2      12  PICRUB 2012 35.4
## 9    p3      21  PICRUB   NA   NA
## 10   p4      32  PICRUB   NA   NA
## 11   p4      33  ACERUB   NA   NA

Note the left join has fewer records, and doesn’t include p5 and p7 in the plot column from tree_data. There’s also no tree data for plots p3 and p4 because there wasn’t any data in the tree_data that matched. The full join includes plots and tree numbers from both tables, but there are NAs where there wasn’t a matching record in one of the tables.

Advanced table joins

Let’s load some example forest data, and use the primary keys to join them together. The tables we’ll join are Location, Events and Stand. The Location table includes plot level data that never changes, like GPS coordinates, elevation, etc. The Events table records every time a plot is sampled. The Stand table has a row for every time a plot is sampled and includes data like crown closure, index of deer impacts, % bare rock, etc. Their relationships are shown in the image below.

Database tables

Note the GUIDS that link the tables together, including Location_ID between tbl_Locations and tbl_Events, and Event_ID between tbl_Events and tbl_Stand_Data. These primary keys are what we will use to join tables together in R as well. If you looked at these tables in Access, they’d look like the image below.

Data tables

Now we’re going to look at these same tables in R. First, load the tables by running to code below, and use head() or View() to look at the tables you just loaded.

library(tidyverse)
loc <- read.csv("./data/example_data/tbl_Locations_example.csv") # plot level data for forest plots in NETN
head(loc)
event <- read.csv("./data/example_data/tbl_Events_example.csv") # data on each visit in NETN forest plots
head(event)
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") # stand data collected in plots
head(stand)

View(loc) #Location_ID
View(event) #Location_ID links to tbl_Location; Event_ID links other visit-level tables to this table
View(stand) #Event_ID links stand table to event table.

intersect(names(loc), names(event)) # way to check columns two tables have in common.

## [1] "Location_ID"

intersect(names(event), names(stand))

## [1] "Event_ID"

The intersect function tells you the elements that the 2 data frames share, in this case the column names they have in common. Not coincidentally, the results include the primary keys (eg Location_ID) that link the two tables. Now we’re going to merge the tables.

# base R version
names(loc) # to help me pick the columns to include in the join
names(event)
loc_event_full <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = TRUE) # base R full join
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID") # tidyverse full join

View(loc_event_full) # Now you can see all the times a plot has been sampled. You'll also see that some plots 
# don't have sample dates. That's because the loc table includes plots that were rejected and never sampled.

loc_event_left <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = FALSE)
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_full) # same number of rows because all events have a location record

## [1] 164

nrow(loc_event_left) # same number of rows because all events have a location record

## [1] 164

# tidyverse version
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID")
nrow(loc_event_full_tidy) # same number of rows because all events have a location record

## [1] 164

loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_left_tidy) # same number of rows because all events have a location record

## [1] 164

Note that the merge() function is the base R way of joining databases. The all.x = TRUE and all.y = TRUE makes it a full join. Left join is all.x = TRUE, all.y = FALSE, and right join is all.x = FALSE, all.y = TRUE. I tend to use merge() instead of tidyverse join functions, because it’s a bit faster when you have huge datasets, and it’s what I’ve gotten used to. The exception, which I’ll show below, is if I’m chaining multiple steps that include a join via pipes.

Let’s filter the location data to only include ACAD plots, and the event table to only include non-QAQC plots (we resample 5% of plots per year; Event_QAQC = TRUE represents the 2nd visit).

# Base R approach
locev <- merge(loc[loc$Park_Code == "ACAD", ], 
               event[event$Event_QAQC == FALSE, ], 
               all.x = TRUE, all.y = FALSE)

# Tidyverse approach
loc_ACAD <- loc %>% filter(Park_Code == "ACAD")
locev_tidy <- event %>% filter(Event_QAQC == FALSE) %>% 
                        left_join(loc_ACAD, ., by = "Location_ID")

nrow(locev) 
nrow(locev_tidy)

View(locev)
View(locev_tidy)

Note the use of “.” above. Functions in the tidyverse all assume the dataset to the left of the pipe will go into the first argument in the function following the pipe. When that’s not the case, such as in the left_join above, you can use the “.” to specify where to use the dataset.

Question 3: How would you take only the plots on MDI (eg Unit_ID = “ACAD_MDI”) in the loc table and join them to their corresponding visits in the events table?

Question 4: How would you join the stand data to the locev table?

Data Summary II

Summarizing Data

On the first day, we talked about summarizing data for the entire dataset by using mean, min, max, etc. Now we’re going to talk about summarizing data for groups within your data set using group_by() and summarize() functions from dplyr (a core tidyverse package). The group_by() function allows you to specify the columns you want to summarize over. For example grouping by site, and counting up the number of trees or critters observed in each site. Other examples include grouping by year, park unit, watershed, stream, etc. The process is very similar using Totals in MS Access queries or subtotals in Excel, although the tools in R are more powerful, flexible, and efficient.

The summarize() function that we’re going to use is similar to the mutate() function we saw earlier. The main difference is that mutate() creates a new field that always has the exact number of rows as the original dataset. The summarize() function will return only as many rows are there are unique combinations of groups in your group_by(). For example, if your grouping variables are stream and years, and say you have 3 streams and 2 years, summarize will return 6 rows (3 streams X 2 years).

To demonstrate how this works, let’s load a cleaned version of the wetland dataset from before. Let’s start by counting the number of species found in each site by year.

wetdat <- read.csv('./data/example_data/ACAD_wetland_data_clean.csv')
head(wetdat)

wetsum <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                     group_by(Site_Name, Year) %>% 
                     summarize(numspp = sum(present),
                               .groups = 'drop')

head(wetdat)

##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909

In the code above, we added a column named “present” using mutate, and we set present to 1 if a species has an Ave_Cov > 0 and isn’t NA. In case you’ve never seen if/then or if/else statements, these are conditional statements that run a different set of commands depending on whether an expression is true or false. They’re foundational to most programming, can be done in Excel, Access, R, and most programming languages. In the code above, the ifelse statement takes the statement to check for TRUE/FALSE (Ave_Cov > 0 & !is.na(Ave_Cov)), if the result is TRUE, then it returns a 1. If it’s false, it returns a 0.

If you’re unclear on what the mutate did, you can break the steps up via the code below.

wetdat <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) 
wetdat_sum <- wetdat %>% group_by(Site_Name, Year) %>% 
                         summarize(numspp = sum(present),
                                  .groups = 'drop')

After the mutate, we then set the grouping to be by Site_Name and Year and then summarized the number of species per site by summing the present column.

A couple of things to note:

The dplyr package has recently added a .groups statement to summarize. If you don’t specify, you get a warning about the default that was used. For 99% of the cases I use summarize for, the default .groups=‘drop’ is what I want to happen. You don’t have to specify that, if you don’t mind the chatty console (tidyverse functions are notoriously chatty).
When you use group_by() and summarize(), any columns not mentioned in either of those two functions will get dropped from the final output. For example, notice that PctFreq, Invasive, and Protected are not in the final output. If you want those fields included, and they are consistent within your grouping variables. For example, if we wanted to include Site_Type (Sentinel or RAM), which is consistent within each site, we can either add it to the group_by() statement, or we can add it to the summarize statement, as demonstrated below.

wetsum_opt1 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                          group_by(Site_Name, Site_Type, Year) %>% 
                          summarize(numspp = sum(present),
                                    .groups = 'drop')

wetsum_opt2 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                          group_by(Site_Name, Year) %>% 
                          summarize(Site_Type = first(Site_Type), 
                                    numspp = sum(present),
                                   .groups = 'drop')

Taking this a step further, let’s pretend that we want to know what the average number of species per site was, and whether this has changed over time. First we need to remove the Sentinel site data, because we only have 1 visit for those sites. Then we’ll calculate the mean and standard error for years 2012 and 2017 across the RAM sites.

wetsum_pre <- wetdat %>% filter(Site_Type == "RAM") %>% 
                         mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                         group_by(Site_Name, Site_Type, Year) %>% 
                         summarize(numspp = sum(present), 
                                   .groups = 'drop')

wetsum_yr <- wetsum_pre %>% group_by(Year) %>% 
                            summarize(mean_numspp = mean(numspp),
                                      numplots = n(), 
                                      se_numspp = sd(numspp)/sqrt(numplots), 
                                      .groups = 'drop')
wetsum_yr

## # A tibble: 2 x 4
##    Year mean_numspp numplots se_numspp
## * <int>       <dbl>    <int>     <dbl>
## 1  2012          37        5      3.97
## 2  2017          43        5      4.86

A couple of things to note here:

Note the use of n() in the second step. That counts the number of rows within a group. You do have to be careful when you use n(), because it also counts NAs. If you don’t want NAs to be counted, you need to use a similar approach to how I calculated present, then numspp.
Note that I created numplots and then used it in the following line within the same summarize() call. That’s a handy trick with both summarize and mutate- as soon as you create them, you can call them to calculate other metrics.
While I did this in 2 steps for clarity, this could all have been 1 step. In general, I try to keep my pipes relatively short, so they’re easier to understand and test incrementally, and easier to debug if something breaks. There’s no hard rule on this, but I generally keep pipes to 4 or less steps. If I need more, I break the code up into separate steps.

Question 5: How would you calculate the min and max number of species by year?

Question 6: What percent of RAM sites had an invasive species present in 2012 and 2017?

Saving tabular data to file

The last thing today is how to write the data you’re working on to a file in your computer.

The following code saves the wetland data using a different file name:

write.csv(wetdat, "./data/ACAD_wetland_data2.csv")

You should now see a new file in your data folder.

We’re not actually going to use that file for anything, so you can delete it now (either manually, or running the following code.)

file.remove("./data/ACAD_wetland_data2.csv")

Assignments

Answer the questions for this day and review today’s code. Answers are in the Answers tab.
Complete the Feedback form by the end of the day
R4DS: Chapter 12. Tidy data
Mastering Software Development in R: Chapter 4.1. Basic plotting with ggplot2
STAT545: Chapter 7. Single table dplyr functions
Mastering Software Development in R: Chapter 4.2. Customizing ggplot2 plots

Day 3

Data Wrangling II

Reshaping Data I

Another important concept in data wrangling is reshaping data. Datasets are usually described as long, or wide. The long form, which is the structure database tables often take, consists of each row being an observation (ie if there are 3 visits to a site, there are 3 rows in the data), and each column being a variable (eg. plot, cycle, number of critters, % plant cover, etc). In summary tables, we often want to reshape the data to be wide, so that there’s only one row per plot, a column for each visit, or species, and the abundance of that species as the value in the cell.

We’re going to work with a fake bat dataset to see how this works. To get started, load the dataset and packages, as shown below.

library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
head(bat_long)

##         Site Julian Year                  Latin              Common
## 1 BrownMtnCR    195 2015          Myotis leibii        small-footed
## 2 BrownMtnCR    214 2015          Myotis leibii        small-footed
## 3 BrownMtnCR    237 2016          Myotis leibii        small-footed
## 4 BrownMtnCR    230 2017       Myotis lucifugus        little brown
## 5 BrownMtnCR    201 2016      Lasiurus cinereus               hoary
## 6 BrownMtnCR    230 2018 Myotis septentrionalis northern long-eared

The data in this example dataset are simplified capture data. Every row represents an individual bat that was captured at a given site and date. We want to turn this into a wide data frame that has a column for every species, and the number of individuals of that species that were caught for each year and site combination. Before we start, we’ll make a species code that will be easier to work with (R doesn’t like spaces). We’ll also summarize the data, so we have a count of the number of individuals of each species found per year.

bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
                                                         substr(word(Latin, 2), 1, 3))))

bat_long_sum <- bat_long %>% group_by(Site, Year, sppcode) %>% 
                             summarize(numindiv = sum(!is.na(sppcode)),
                                       .groups = 'drop')

head(bat_long_sum)

## # A tibble: 6 x 4
##   Site        Year sppcode numindiv
##   <chr>      <int> <chr>      <int>
## 1 BrownMtnCR  2013 MYOLEI         1
## 2 BrownMtnCR  2014 MYOLUC         2
## 3 BrownMtnCR  2015 MYOLEI         2
## 4 BrownMtnCR  2016 LASCIN         1
## 5 BrownMtnCR  2016 MYOLEI         2
## 6 BrownMtnCR  2016 MYOLUC         1

If you didn’t follow what I just did, there are 3 main steps that happened above. First, we used the word() function (another tidyverse function) to pull out the genus (word(Latin, 1)) and species (word(Latin, 2)). The substr() function is taking the first through third characters of each genus and species. The paste0 function is pasting the first three characters of genus and species together. The toupper is making them all capitalized. Note that the 0 after paste means not to separate the elements of the paste. If you wanted to separate the elements by a underscore, you’d use paste instead of paste0, like paste(element1, element2, sep = "_").

To make this wide, we’re essentially going to make a pivot table using functions in tidyr (another tidyverse package). There are two functions you can use. The spread() function was original to tidyr, is what I know the best, and still performs faster on larger datasets than pivot_wider(), the replacement for spread. I show both options, so you can use whichever function makes the most sense to you.

bat_wide <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv") #retired spread function
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv") #replacement in tidyr
bat_wide

## # A tibble: 21 x 6
##    Site        Year LASCIN MYOLEI MYOLUC MYOSEP
##    <chr>      <int>  <int>  <int>  <int>  <int>
##  1 BrownMtnCR  2013     NA      1     NA     NA
##  2 BrownMtnCR  2014     NA     NA      2     NA
##  3 BrownMtnCR  2015     NA      2     NA     NA
##  4 BrownMtnCR  2016      1      2      1     NA
##  5 BrownMtnCR  2017     NA      1      1     NA
##  6 BrownMtnCR  2018     NA      1     NA      1
##  7 BrownMtnCR  2019      1     NA     NA     NA
##  8 BrownMtnGH  2013     NA     NA     NA      1
##  9 BrownMtnGH  2014     NA      2      1     NA
## 10 BrownMtnGH  2015      1     NA     NA      1
## # ... with 11 more rows

Note the NAs for the various species. That’s because original dataset didn’t have a data point for that combination of site/species/year. The tidyr package will assign NA when reshaping for combinations not represented in the original dataset. If there are sites that weren’t sampled in a certain year, NA makes sense. However, if all sites were sampled and there just weren’t any individuals found, then filling those with 0 makes sense, and you can tell tidyr to fill blanks with 0 when you rehape the data. This is a handy feature, but it’s not to be abused. You should always think carefully before automatically filling blanks with 0s.

Here’s the same reshape code from before, but now we fill NAs with 0.

bat_wide_fill <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv", fill = 0) #retired spread function
head(bat_wide_fill)

## # A tibble: 6 x 6
##   Site        Year LASCIN MYOLEI MYOLUC MYOSEP
##   <chr>      <int>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 BrownMtnCR  2013      0      1      0      0
## 2 BrownMtnCR  2014      0      0      2      0
## 3 BrownMtnCR  2015      0      2      0      0
## 4 BrownMtnCR  2016      1      2      1      0
## 5 BrownMtnCR  2017      0      1      1      0
## 6 BrownMtnCR  2018      0      1      0      1

bat_wide2_fill <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv", 
                                               values_fill = 0) #replacement in tidyr

Now, say you want to take the wide version of the dataframe and make it long, so that you have a row for every site, year, and species combination. You can use gather(), the original tidyr function, or pivot_longer(), the new one.

bat_long2 <- bat_wide_fill %>% gather(key = "sppcode", value = "numindiv", -Site, -Year) # old way
head(bat_long2)

## # A tibble: 6 x 4
##   Site        Year sppcode numindiv
##   <chr>      <int> <chr>      <dbl>
## 1 BrownMtnCR  2013 LASCIN         0
## 2 BrownMtnCR  2014 LASCIN         0
## 3 BrownMtnCR  2015 LASCIN         0
## 4 BrownMtnCR  2016 LASCIN         1
## 5 BrownMtnCR  2017 LASCIN         0
## 6 BrownMtnCR  2018 LASCIN         0

names(bat_wide_fill)

## [1] "Site"   "Year"   "LASCIN" "MYOLEI" "MYOLUC" "MYOSEP"

bat_long2b <- bat_wide_fill %>% pivot_longer(cols=c(LASCIN, MYOLEI, MYOLUC, MYOSEP), 
                                             names_to = "sppcode", 
                                             values_to = "numindiv") #new way
head(bat_long2b)

## # A tibble: 6 x 4
##   Site        Year sppcode numindiv
##   <chr>      <int> <chr>      <dbl>
## 1 BrownMtnCR  2013 LASCIN         0
## 2 BrownMtnCR  2013 MYOLEI         1
## 3 BrownMtnCR  2013 MYOLUC         0
## 4 BrownMtnCR  2013 MYOSEP         0
## 5 BrownMtnCR  2014 LASCIN         0
## 6 BrownMtnCR  2014 MYOLEI         0

nrow(bat_long_sum) # dataset before made wide, then long

## [1] 34

nrow(bat_long2) # dataset that has every species represented for every year and every site

## [1] 84

Both functions above are keeping a Site and Year column. In the gather function, we’re telling R which ones not to make long. In the pivot_longer, we’re telling R which columns to gather, rather than which columns to ignore. I show both the old and new approach, because the pivot_### functions are still relatively new, slower with big datasets, and I find more examples online of the old approaches. But, personally I find the pivot_### functions to be more intuitive than gather/spread, so I’ve started using them more.

Another way to do this, since there’s almost always more than one way to do something in R, is to use expand() and joins to get all combinations of factors in your data. There’s an added step of having to replace NAs from the join with 0.

#First figure out how many levels you expect
with(bat_long, length(unique(Site)) * length(unique(Year)) * length(unique(sppcode))) #84
bat_all_comb <- bat_long_sum %>% expand(Site, Year, sppcode)
nrow(bat_all_comb) #84

bat_long_comb <- left_join(bat_all_comb, bat_long_sum, by = c("Site", "Year", "sppcode"))
bat_long_comb$numindiv[is.na(bat_long_comb$numindiv)] <- 0 # replace NA with 0
nrow(bat_long2) == nrow(bat_long_comb) #TRUE

A couple of things to note in the above code:

Note the use of with(). This is a base R way of only having to specify a data frame once, and is handy if you have to reference multiple columns from the same data frame in a line of code.
The expand() function is also introduced here. Expand is a way to create a data frame of all combinations of the factors listed, including any combinations not found in the existing data frame. For example, if no bats were found in site BubPdS in 2018, and so 2018 wasn’t in the data for BubPdS, expand will add it, if 2018 was listed for any other site.
If you only want site/year combinations that were in the original dataset (so no 2018 in BubPdS), use distinct() instead of expand().
When you use a left join that doesn’t have data in the right-hand dataset, R assigns NAs for the numindiv column with those cases. The 2nd to last line above identifies the NAs and replaces them with 0. There are tidyverse ways of doing this with replace_na() or replace(), but I find the code above more intuitive (this may be because I’ve used this for years pre-tidyverse)

Question 1: How would you create a dataset that has a column for every sample year (ie 2013-2019), a row for every site, and the number of northern long-eared bats found as the value?

You can also pivot on 2 columns using pivot_wider. To demonstrate, let’s pivot on sppcode and Year.

head(bat_long_sum)

## # A tibble: 6 x 4
##   Site        Year sppcode numindiv
##   <chr>      <int> <chr>      <int>
## 1 BrownMtnCR  2013 MYOLEI         1
## 2 BrownMtnCR  2014 MYOLUC         2
## 3 BrownMtnCR  2015 MYOLEI         2
## 4 BrownMtnCR  2016 LASCIN         1
## 5 BrownMtnCR  2016 MYOLEI         2
## 6 BrownMtnCR  2016 MYOLUC         1

bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = c(sppcode, Year), 
                                          values_from = numindiv,
                                          values_fill = 0)
head(bat_wide2)

## # A tibble: 3 x 21
##   Site  MYOLEI_2013 MYOLUC_2014 MYOLEI_2015 LASCIN_2016 MYOLEI_2016 MYOLUC_2016
##   <chr>       <int>       <int>       <int>       <int>       <int>       <int>
## 1 Brow~           1           2           2           1           2           1
## 2 Brow~           0           1           0           0           0           1
## 3 BubP~           1           1           1           0           2           0
## # ... with 14 more variables: MYOLEI_2017 <int>, MYOLUC_2017 <int>,
## #   MYOLEI_2018 <int>, MYOSEP_2018 <int>, LASCIN_2019 <int>, MYOSEP_2013 <int>,
## #   MYOLEI_2014 <int>, LASCIN_2015 <int>, MYOSEP_2015 <int>, LASCIN_2017 <int>,
## #   MYOLEI_2019 <int>, MYOLUC_2015 <int>, MYOLUC_2018 <int>, MYOLUC_2019 <int>

The above steps may seem like a lot of work for something you could do by hand in Excel. But, if you have a huge dataset that only stores records of species that were present, this can really come in handy. It’s also repeatable and transparent. So, if you realized you did something wrong, you can fix the code and rerun everything, rather than having to redo all of the work in R by hand.

Reshaping Data II

Let’s work with a larger dataset to demonstrate these concepts again. Say we want a list of all invasive species found in ACAD wetland sites in 2017 and we want to calculate the proportion of plots invasive species occurred in. If we don’t find an invasive in a plot, there’s no record of the data. Let’s load the wetland dataset from before and work through that example.

wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
head(wetdat)

##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909

# Prepare data for pivot
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
                                                          substr(word(Latin_Name, 2), 1, 3))),
                                 present = ifelse(Ave_Cov>0, 1, 0)) %>% 
                                 filter(Year == 2017) 

# Pivot wide, so every species has a column
wetdat_wide <- wetdat_prep %>% select(Site_Name, Year, sppcode, present) %>% 
                               pivot_wider(names_from = "sppcode", values_from = "present", 
                                           values_fill = 0)
head(wetdat_wide)

## # A tibble: 5 x 106
##   Site_Name  Year ACERUB ALNINC AMENA BERTHU CARFOL CARTRI CARNA CHACAL CORCAN
##   <chr>     <int>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
## 1 RAM-41     2017      1      1     1      1      1      1     1      1      1
## 2 RAM-53     2017      1      1     1      1      1      0     0      1      1
## 3 RAM-62     2017      1      0     0      0      0      1     0      1      1
## 4 RAM-44     2017      1      1     0      0      0      0     0      1      0
## 5 RAM-05     2017      1      1     1      0      1      1     0      1      1
## # ... with 95 more variables: DOEUMB <dbl>, DRYCRI <dbl>, ERIANG <dbl>,
## #   GAYBAC <dbl>, ILEMUC <dbl>, ILEVER <dbl>, JUNEFF <dbl>, KALANG <dbl>,
## #   LARLAR <dbl>, MAICAN <dbl>, MAITRI <dbl>, MYRGAL <dbl>, ONOSEN <dbl>,
## #   OSMCIN <dbl>, PICGLA <dbl>, PICMAR <dbl>, PRENA <dbl>, RHOCAN <dbl>,
## #   RHOGRO <dbl>, ROSNIT <dbl>, RUBHIS <dbl>, SOLULI <dbl>, SPIALB <dbl>,
## #   SYMFOE <dbl>, THEPAL <dbl>, TRIBOR <dbl>, VACANG <dbl>, VACCOR <dbl>,
## #   VACMYR <dbl>, VACVIT <dbl>, VIBNUD <dbl>, AROMEL <dbl>, CALCAN <dbl>,
## #   CARATL <dbl>, CARLAS <dbl>, CARSTR <dbl>, CELORB <dbl>, DANSPI <dbl>,
## #   DICACU <dbl>, DROINT <dbl>, DROROT <dbl>, DULARU <dbl>, EPILEP <dbl>,
## #   ERITEN <dbl>, ERIVIR <dbl>, JUNCAN <dbl>, LYSTER <dbl>, MORPEN <dbl>,
## #   OCLNEM <dbl>, OSMREG <dbl>, PICRUB <dbl>, PINSTR <dbl>, POGOPH <dbl>,
## #   RHAFRA <dbl>, SCICYP <dbl>, SPITOM <dbl>, SYMNOV <dbl>, TRINA <dbl>,
## #   TYPLAT <dbl>, VACMAC <dbl>, GAUHIS <dbl>, MONUNI <dbl>, VACOXY <dbl>,
## #   APOAND <dbl>, BETPOP <dbl>, CARLAC <dbl>, EQUARV <dbl>, EURMAC <dbl>,
## #   FESFIL <dbl>, IRIVER <dbl>, POPGRA <dbl>, POPTRE <dbl>, QUERUB <dbl>,
## #   RANACR <dbl>, ROSVIR <dbl>, RUBNA <dbl>, SALPET <dbl>, SOLRUG <dbl>,
## #   UTRCOR <dbl>, VEROFF <dbl>, VICCRA <dbl>, AREBUL <dbl>, CALTUB <dbl>,
## #   CAREXI <dbl>, CARMAG <dbl>, CARPAU <dbl>, EMPNIG <dbl>, EURRAD <dbl>,
## #   GLYSTR <dbl>, KALPOL <dbl>, `LON-` <dbl>, OCLX <dbl>, RHYALB <dbl>,
## #   SARPUR <dbl>, XYRMON <dbl>

# Pivot long, so every site has a species with 1 or 0 to denote present/absent
wetdat_long <- wetdat_wide %>% pivot_longer(cols = c(-Site_Name, -Year), names_to = "sppcode", 
                                            values_to = "present") %>% 
                               arrange(Site_Name, sppcode)

# Now we need to filter the data to only include invasive species
# First make a list of the invasive species
inv_list <- unique(wetdat_prep$sppcode[wetdat_prep$Invasive == TRUE])
inv_list

## [1] "BERTHU" "CELORB" "RHAFRA" "LON-"

#Now filter on that invasive list
wetdat_long_inv <- wetdat_long %>% filter(sppcode %in% inv_list)


# Now summarize to count the percent of plots each species was found in
wetdat_sum <- wetdat_long_inv %>% group_by(sppcode) %>% 
                                  summarize(pct_sites = sum(present)/n()*100, 
                                           .groups = 'keep')

wetdat_sum

## # A tibble: 4 x 2
## # Groups:   sppcode [4]
##   sppcode pct_sites
##   <chr>       <dbl>
## 1 BERTHU         40
## 2 CELORB         20
## 3 LON-           20
## 4 RHAFRA         40

Note the use of n() in the summarize() for wetdat_sum. That function counts the number of rows in each group. It’s handy, but be careful. It count’s every row, including rows with NAs. If you want that, n() is great. If you don’t, you need to use another approach.

Question 2: How would you create a wide matrix with all protected species as columns, sites and years sampled as rows, and percent cover as the value?

Plotting with ggplot2

Intro to ggplot

To get started with ggplot2, we need to make a dataset to plot. Instead of starting with a dataset that’s ready to be plotted, let’s use what we’ve learned to make a real dataset from NETN forest data. We’re going to create the plot below that shows the mean density of seedlings and saplings by size class for the most recent 4-years of sampling at ACAD.

Example plot

Note this is also a demonstration of how building packages for your own programs can really save you time performing tasks that you’re likely to do over and over.

First you need to load the forestNETN package and import the data.

library(forestNETNarch)
library(tidyverse)
importCSV("./data/NETN_forest_csvs")

Now we’ll use one of the functions in the forestNETN package to pull together the regeneration data. The resulting dataset will have a row for every species of tree seedling or sapling found in a given plot from 2016-2019. We’ll use that dataset to sum up the total number of seedlings and saplings by size class for each plot.

acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")

head(acad_regen) # this dataframe includes species, so there are multiple rows per plot. 

# We need to group_by Plot_Name, and sum seedling densities to get plot-level seedling densities.
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>% 
                              summarize(seed15_30 = sum(seed15.30),
                                        seed30_100 = sum(seed30.100),
                                        seed100_150 = sum(seed100.150),
                                        seed150p = sum(seed150p),
                                        sapling = sum(sap.den),
                                        .groups = 'drop') %>% 
                              arrange(Plot_Name)

# This is the shortest way to get a dataset that has a mean seedling density for each size class and a se for each size class. 
# Note we want to make this long, so every plot has each of the size classes
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, names_to = "size_class", values_to = "density")

# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>% 
                                  summarize(mean_dens = mean(density),
                                            se_dens = sd(density)/sqrt(n()),
                                            .groups = 'drop')
# We have our dataset to plot.

The pros with using ggplot are that you can get a plot with very little work. The cons are that as soon as you want to change something, like removing the grid lines, customizing a legend, changing the font size, etc., it’s quite tedious to do. Using the default formatting that comes with ggplot, let’s plot the acad_stats dataset.

The first step in ggplot is to specify the data and the x and y (and grouping, if you have any) values with the ggplot() function. After you set that up, you have to decide what kind of plot you want to us. The most common ones I use are:

geom_bar is a bar chart
geom_point plots points
geom_line plots lines
geom_boxplot is for boxplots
geom_errorbar is a bar chart

As long as the mapping is the same (eg x/y values fit on the plot), you can add multiple geoms, such as points and lines that connect the points, or geom_errorbar to the geom_bar. You can even include different datasets. Just know that the ranges of the x and y axis default to the x and y you set in ggplot(). You can manually change the range of an axis to fix that, but it requires more tinkering. You can also keep adding things to the object with +.

acad_base <- ggplot(data = acad_stats, aes(x = size_class, y = mean_dens))

# Make a point plot
acad_base + geom_point()

# Make a bar chart
acad_bar <- acad_base + geom_bar(stat = 'identity')
acad_bar

# Add error bars to the previous bar chart
acad_bar + geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
           labs(x = "Regeneration Size Class", y = "Stems per sq.m")

Note that the x-axis plots the size classes alphabetically instead of from small to big size classes. To fix this (this is where ggplot starts to get fiddly), we want to make the size_class column an ordered factor, and we set the order the levels.

acad_stats$size_class_fact <- ordered(acad_stats$size_class, 
                                      levels = c('seed15_30', 'seed30_100', 'seed100_150', 'seed150p', 'sapling'))

acad_bar2 <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
             geom_bar(stat = 'identity')+ 
             geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
             labs(x = "Regeneration Size Class", y = "Stems per sq.m")

acad_bar2

That’s more like it, but I still don’t like how the figure looks. For example, I don’t like the gridlines in the figure. Those settings are buried in theme(), and to change it you need to +theme(change_settings_here)

acad_bar2 + theme(panel.grid.minor = element_blank(),
                  panel.grid.major = element_blank())

That’s a little better, but I still don’t like it. And, I never remember exactly what I need to type to remove the gridlines, take the grey fill off the background, etc. So I made my own theme that does all of that in the forestNETN package, called theme_FHM.

acad_bar2+theme_FHM()

Much better! But, I still prefer the bars be a different color than grey. Let’s fill the bars with a nice blue color. I’ll start over on the code too, so you can see it all together. Note the use of “#5E79A8” to the fill bar color. I used the hex code for a blue-grey color I like. Hex codes are a universal color code that specify the intensity of red, green, blue, and makes it easier to use the same color across multiple programs. If you want to pick your own color, you can use HTML Color Codes website to pick a color and see the code. I also like the color picker, mixer and shades in w3schools.com.

acad_final <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
              geom_bar(stat = 'identity', fill = "#5E79A8")+
              geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
              labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
              theme_FHM()

acad_final

Question 4: Take the acad_final graph and make 2 more changes to it. For example, make the line width on the error bars bigger, change the font size of the axis text bigger, tilt the x-axis labels (eg seed15_30) 45 degrees, rename x-axis labels, etc.

More advanced plotting in ggplot

Another great feature of ggplot is that you can use grouping variables to code by color or shape, and quickly create multiple panels of plots using facet_wrap() or facet_grid(). For this section, let’s work with some water data from Acadia. We’ll need to load the NCRNWater package and data first. Data files are pretty big, so it might take a minute or two to load.

library(NCRNWater)

netnwd<-importNCRNWater(Dir = "./data/NETN_water_data", 
                        Data = "Water Data.csv", 
                        MetaData = "VizMetaData.csv")

You should see an object in your global environment called netnwd. All of the water data are stored within this object, and we’ll use getter functions in the water package to retrieve it. This is a different work flow than we’ve seen up until now. Hopefully it’s straightforward enough that you can get to the final dataset for plotting, which will have one row for every pH measurement taken in each ACAD annually sampled lake. The columns will contain the measurement and information about the site/sample.

To find the lakes that have been sampled every year, we have to find all of the site codes in ACAD and filter to only include Lakes. Then we need to find the sites that are sampled every year using the frequency of measurements in the dataset to determine those sites.

# Create data frame that lists the sites and their type using the getSiteInfo getter.
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
                             type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
                             fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
head(acad_all_sites)

##         sitecode   type              fullname
## 1 NETN_ACAD_ABIN Stream Aunt Betty Pond Inlet
## 2 NETN_ACAD_ANTB   Lake      Aunt Bettys Pond
## 3 NETN_ACAD_BOWL   Lake              The Bowl
## 4 NETN_ACAD_BRBK   Lake       Bear Brook Pond
## 5 NETN_ACAD_BRKB Stream       Breakneck Brook
## 6 NETN_ACAD_BRWN Stream          Browns Brook

# Filter data frame to only include lakes
acad_lakes <- acad_all_sites %>% filter(type == "Lake")  #list of site codes and full names for ACAD lakes
acad_lakes

##          sitecode type         fullname
## 1  NETN_ACAD_ANTB Lake Aunt Bettys Pond
## 2  NETN_ACAD_BOWL Lake         The Bowl
## 3  NETN_ACAD_BRBK Lake  Bear Brook Pond
## 4  NETN_ACAD_BUBL Lake      Bubble Pond
## 5  NETN_ACAD_EAGL Lake       Eagle Lake
## 6  NETN_ACAD_ECHO Lake        Echo Lake
## 7  NETN_ACAD_HODG Lake     Hodgdon Pond
## 8  NETN_ACAD_JORD Lake      Jordan Pond
## 9  NETN_ACAD_LBRK Lake  Lower Breakneck
## 10 NETN_ACAD_LHAD Lake    Lower Hadlock
## 11 NETN_ACAD_LONG Lake  Long Pond (MDI)
## 12 NETN_ACAD_LPIH Lake  Long Pond (IAH)
## 13 NETN_ACAD_ROUN Lake       Round Pond
## 14 NETN_ACAD_SAMP Lake Sargent Mtn Pond
## 15 NETN_ACAD_SEAL Lake   Seal Cove Pond
## 16 NETN_ACAD_SEAW Lake     Seawall Pond
## 17 NETN_ACAD_UBRK Lake  Upper Breakneck
## 18 NETN_ACAD_UHAD Lake    Upper Hadlock
## 19 NETN_ACAD_WHOL Lake  Witch Hole Pond
## 20 NETN_ACAD_WOOD Lake        Lake Wood

# Compile the water pH data using getWData function
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')

# Filter water data to only include lakes
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
table(acad_lakes_ph$Site) # 8 sites with >90 measurements

## 
## NETN_ACAD_ANTB NETN_ACAD_BOWL NETN_ACAD_BRBK NETN_ACAD_BUBL NETN_ACAD_EAGL 
##             29             20             29             94             95 
## NETN_ACAD_ECHO NETN_ACAD_HODG NETN_ACAD_JORD NETN_ACAD_LBRK NETN_ACAD_LHAD 
##             94             28             93             29             30 
## NETN_ACAD_LONG NETN_ACAD_LPIH NETN_ACAD_ROUN NETN_ACAD_SAMP NETN_ACAD_SEAL 
##             94              5             24             19             93 
## NETN_ACAD_SEAW NETN_ACAD_UBRK NETN_ACAD_UHAD NETN_ACAD_WHOL NETN_ACAD_WOOD 
##             30             24             95             91             24

# Turn the table above into a dataframe, and use it to filter the water data on.
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90) 

# Filter acad_lakes_pH to only include the annual lakes and add column for month and year
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>% 
                              mutate(month = format(Date, "%m"),
                                     year = format(Date, "%Y"))
head(acad_ph2)

##         Date SampleDepth Depth.Units Value MQL UQL Censored ValueCen TextValue
## 1 2006-05-23  epilimnion        <NA> 6.570  NA  NA    FALSE    6.570      6.57
## 2 2006-06-21  epilimnion        <NA> 6.515  NA  NA    FALSE    6.515     6.515
## 3 2006-07-20  epilimnion        <NA> 6.610  NA  NA    FALSE    6.610      6.61
## 4 2006-08-10  epilimnion        <NA> 6.620  NA  NA    FALSE    6.620      6.62
## 5 2006-09-26  epilimnion        <NA> 6.120  NA  NA    FALSE    6.120      6.12
## 6 2006-10-17  epilimnion        <NA> 6.560  NA  NA    FALSE    6.560      6.56
##   Characteristic Category           Site Park month year
## 1             pH physical NETN_ACAD_BUBL ACAD    05 2006
## 2             pH physical NETN_ACAD_BUBL ACAD    06 2006
## 3             pH physical NETN_ACAD_BUBL ACAD    07 2006
## 4             pH physical NETN_ACAD_BUBL ACAD    08 2006
## 5             pH physical NETN_ACAD_BUBL ACAD    09 2006
## 6             pH physical NETN_ACAD_BUBL ACAD    10 2006

acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")], 
                           by= c("Site" = "sitecode")) # how to join if columns have diff. names

head(acad_ph_final) #final dataset to plot

##         Date SampleDepth Depth.Units Value MQL UQL Censored ValueCen TextValue
## 1 2006-05-23  epilimnion        <NA> 6.570  NA  NA    FALSE    6.570      6.57
## 2 2006-06-21  epilimnion        <NA> 6.515  NA  NA    FALSE    6.515     6.515
## 3 2006-07-20  epilimnion        <NA> 6.610  NA  NA    FALSE    6.610      6.61
## 4 2006-08-10  epilimnion        <NA> 6.620  NA  NA    FALSE    6.620      6.62
## 5 2006-09-26  epilimnion        <NA> 6.120  NA  NA    FALSE    6.120      6.12
## 6 2006-10-17  epilimnion        <NA> 6.560  NA  NA    FALSE    6.560      6.56
##   Characteristic Category           Site Park month year    fullname
## 1             pH physical NETN_ACAD_BUBL ACAD    05 2006 Bubble Pond
## 2             pH physical NETN_ACAD_BUBL ACAD    06 2006 Bubble Pond
## 3             pH physical NETN_ACAD_BUBL ACAD    07 2006 Bubble Pond
## 4             pH physical NETN_ACAD_BUBL ACAD    08 2006 Bubble Pond
## 5             pH physical NETN_ACAD_BUBL ACAD    09 2006 Bubble Pond
## 6             pH physical NETN_ACAD_BUBL ACAD    10 2006 Bubble Pond

Now that we have our final dataset, let’s first create a plot that color codes points and lines by site.

ph_plot <- ggplot(acad_ph_final, aes(x = Date, y = Value, color = Site, shape = Site))+
           geom_point()+ 
           labs(y = "pH")+
           theme_bw()
ph_plot

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 186 rows containing missing values (geom_point).

There’s too much going on here to make sense of the data. Let’s split the data so that each site has its own plot.

ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
                geom_point(shape = 16, color = "LightBlue")+ 
                labs(y = "pH")+
                theme_bw()+
                facet_wrap(~fullname)
ph_plot_site

Now let’s add a loess smoother to explore trends. The great thing about ggplot is that you can keep adding to existing plots without having to repeat code.

ph_plot_site_smooth <- ph_plot_site + geom_smooth(method = 'loess', span = 1, color = "RoyalBlue", se = FALSE)

# vary the span to increase/decrease wavyness of smooth
ph_plot_site_smooth + geom_smooth(method = 'loess', span = 0.1, color = "red", se = FALSE)+
                      geom_smooth(method = 'loess', span = 0.5, color = 'green', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Now let’s change the years so they’re at an angle and include every 3 years.

ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
                geom_point(shape = 16, color = "LightBlue")+ 
                labs(y = "pH")+
                theme_bw()+
                facet_wrap(~fullname)+
                geom_smooth(method = 'loess', span = 0.5, color = 'RoyalBlue', se = FALSE)

ph_plot_site2 <- ph_plot_site + scale_x_date(date_breaks = "2 years",
                                             date_labels = "%Y")+
                                theme(axis.text.x = element_text(angle = 45, hjust = 1))

ph_plot_site2

## `geom_smooth()` using formula 'y ~ x'

I’m just barely scratching the surface of what you can do in ggplot2. For more examples of the types of graphs you can make, check out the R ggplot gallery.

Question 5: Filter the acad_ph_final data frame to only include 2019 data and create a plot with months as the X axis, and includes points, a loess smooth with span = 1, and a facet on site.

Working with Shapefiles

Creating a Shapefile in R

Often times I want to map the final output of summaries I do in R. While you can save tables in R, then join them to existing shapefiles in ArcGIS, I prefer to save my output as a shapefile in R. In case this is helpful to you, I’ll show you the easiest way I’ve found to do this.

The keys are to have the X, Y Coordinates included in your final summary, and to know which Coordinate Reference System to specify (abbreviated CRS). For example, parks in NETN cover 2 UTM Zones, and you have to specify the correct zone. The datum and projection for each park are listed below.

NAD83 UTM Zone 19N: ACAD, MIMA
NAD83 UTM Zone 18N: MABI, MORR, ROVA, SAGA, SARA, WEFA

Most of the major coordinate systems have a registered EPSG Code assigned to it, including ours. EPSG stands for European Petroleum Survey Group, which seems pretty random, but is a useful set of universally unique codes for datum and protections that most GIS software can recognize. Using the EPSG code is the fastest way I’ve found to assign the coordinate system for our shapefile. For ACAD and MIMA, the EPSG Code is 26919. For the other parks it’s 26918 (notice the last digit is the only difference, and matches the UTM Zone above).

To start out, let’s load our wetland dataset again. Let’s say we want to create a shapefile for the most recent sample of each site and a column with the number of species found at each site.

library(tidyverse)
library(rgdal) # for GIS functions
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
table(wetdat$Site_Name, wetdat$Year) # SEN most recent is 2011, RAM is 2017
wetdat_rich <- wetdat %>% filter(Year %in% c(2011, 2017)) %>% 
                          mutate(present = ifelse(!is.na(Ave_Cov) & Ave_Cov > 0, 1, 0)) %>% 
                          group_by(Site_Name, Year, Site_Type, X_Coord, Y_Coord) %>% 
                          summarize(numspp = sum(present), 
                                    .groups = 'drop')
head(wetdat_rich)

## # A tibble: 6 x 6
##   Site_Name  Year Site_Type X_Coord  Y_Coord numspp
##   <chr>     <int> <chr>       <dbl>    <dbl>  <dbl>
## 1 RAM-05     2017 RAM       553186. 4899764.     54
## 2 RAM-41     2017 RAM       549667. 4903243.     40
## 3 RAM-44     2017 RAM       530882. 4877460.     45
## 4 RAM-53     2017 RAM       549032. 4909911.     50
## 5 RAM-62     2017 RAM       556731. 4908125.     26
## 6 SEN-01     2011 Sentinel  574855. 4911909.     34

Now we’re ready to save to shapefile.

# Step 1: Create spatial object by specifying the coordinates
coordinates(wetdat_rich) <- ~X_Coord + Y_Coord
head(wetdat_rich)
plot(wetdat_rich) # These points now plot in space (bottom left point is on IAH)

# Step 2: Define the coordinate system 
proj4string(wetdat_rich) <- CRS("+init=epsg:26919")

# Step 3: Write the spatial object to a shapefile
writeOGR(wetdat_rich, dsn = "./shapefiles", layer = "wetdat_rich", driver = "ESRI Shapefile", 
         overwrite_layer = TRUE) # overwrite is in case you need to update your shapefile.

A lot of this may be unfamiliar to you, especially since ArcGIS does a lot of this work behind the scenes.

First notice that after you assign the coordinates in Step 1, the X_Coord and Y_Coord columns disappear from your dataframe. That’s because you’ve just created a spatial object from a data frame, and the coordinates are now stored in a different slot than the data.
In the second step, we’re telling R that the datum and projection (in proj4 format) for our spatial dataset is the EPSG Code 26919 (code for NAD83, UTM Zone 19N).
Finally, we’re using the writeOGR command to output to a shapefile. The dsn is the location you’re saving the shapefile to. The layer name is the name you want the shapefile to be.
Notice that we’re using ./shapefiles, which is a folder within the project we created on day 1. During this step column names usually end up getting abbreviated (don’t worry if you get a warning message).
Finally, writeOGR by default will not allow you to overwrite an existing file. While it’s safe to have that as a default, sometimes you want to overwrite shapefile, which is what overwrite_layer = TRUE does.

GIS Mapping in R

I only have time to scratch the surface on this topic, but I think it’s useful to know that anything you can do in ArcGIS with vector and raster data, you can do in R. GIS in R requires more understanding about what’s happening under the hood, and sometimes it isn’t worth the effort in R. It depends on what you need to do, how involved the process is, and whether you need to document the process and/or repeat it over and over. For example, I’ve found the raster package in R to be relatively easy to use and faster at processing large files (e.g. NLCD rasters), and I almost always use R when I’m working with rasters. In contrast, making nice maps that need a lot of tweaking (eg moving plot labels to prevent overlap, multiple legends, etc), I tend to use ArcGIS. But, to give you an idea of the things you can do in R,

I’ll show you two ways to plot data spatially in R. First is using ggplot, and second is using leaflet.

Let’s load the wetland shapefile we just created, along with a layer that plots wetlands in the vegetation map and the park fee boundary.

# Load packages
library(rgdal) # for readOGR
library(ggplot2) #for ggplot functions
library(ggspatial) #to make ggplot functions spatial

# Load shapefiles
bounds <- readOGR("./data/example_data/ParkBndCARTO_Fee_2020.shp", layer = "ParkBndCARTO_Fee_2020",
                  GDAL1_integer64_policy = TRUE) # Don't ask me why this is here. All I know is it works.
head(bounds@data) # How you view the attribute table of a shapefile in R

wetmap <- readOGR("./data/example_data/ACAD_vegmap_wetlands.shp", layer = "ACAD_vegmap_wetlands",
                  GDAL1_integer64_policy = TRUE)

wet_rich_shp <- readOGR("./shapefiles/wetdat_rich.shp", layer = "wetdat_rich",
                        GDAL1_integer64_policy = TRUE)
head(wet_rich_shp@data)
head(wet_rich_shp@coords) # Where the coordinates are stored

plot(bounds)

To prepare the shapefile polygons for plotting in ggplot, we first have to fortify them to make them into a spatial data frame.

# Prepare shapefiles for plotting
bounds2 <- fortify(bounds) #

## Regions defined for each Polygons

str(bounds2) # note the data frame structure that describes the boundary data

wetmap2 <- fortify(wetmap)

## Regions defined for each Polygons

str(wetmap2)

# Turn the wetland richness shapefile back into a dataset
  # One reason to do this is R is because you never had to make this a shapefile to plot
wet_rich_df <- cbind(as.data.frame(wet_rich_shp@data), 
                     long = wet_rich_shp@coords[,1],
                     lat = wet_rich_shp@coords[,2])

Now to map the different shapefiles using spatial geoms from ggspatial.

ggmap <- ggplot()+
         coord_sf(crs = 26919)+  # set coordinate system to NAD83 UTM Zone 19N using EPSG code      
         geom_polypath(data = bounds2, aes(x = long, y = lat, group = group),
                       fill = "grey", alpha = 0.4)+
         geom_polypath(data = wetmap2, aes(x = long, y = lat, group = group),
                       fill = "#80A8D1")+
         geom_point(data = wet_rich_df, 
                           aes(x = long, 
                               y = lat, 
                               color = Site_Type),
                            shape = 17)+
         theme_minimal() # because I don't like ggplot's grids in default grid

ggmap

You can add north arrows, scale bars and legends using other packages, including sf. For more information about this, check out r-spatial’s site.

Finally, I’ll show how to use leaflet to map data with park tiles as the base layer. The leaflet package is native to javascript, but the main functions have been made into an R package. The great thing about leaflet is you can add interactive maps that allow you to zoom/pan to different views. We use these a lot for our data visualizers.

To use leaflet, you have to have all of your spatial data in WGS84 instead of NAD83 UTM Zone 19N. The first step is to reproject the wetland points to WGS84.

wet_rich_dd <- spTransform(wet_rich_shp, CRS="+proj=longlat +datum=WGS84")

plot(wet_rich_dd) #check that they still plot spatially

library(leaflet)

park_tile_basic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"

map_leaf <- leaflet() %>% 
 setView(lng = -68.312,
         lat = 44.25,
         zoom = 10) %>% 
 addTiles(urlTemplate = park_tile_basic) %>% 
 addCircleMarkers(data = wet_rich_dd,
                  radius = 4, 
                  lng = wet_rich_dd@coords[,1],
                  lat = wet_rich_dd@coords[,2],
                  label = wet_rich_dd@data$Site_Name,
                  labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                                              direction = 'bottom', textsize = "12px"),
                  fillColor = "#4C9B4F",
                  fillOpacity = 1,
                  stroke = FALSE) %>% 
 addScaleBar(position='bottomright') %>% 
 scaleBarOptions(maxWidth=10, metric=TRUE) 

map_leaf

That’s the end of Day 3! Please see the assignments tab for further reading, complete the feedback form, and keep on coding!

The Dream

When I first started with NETN, the workflow below was how we operated. There’s no shame in this workflow, and it’s still what a lot of networks do. However, it’s time-consuming, requires a lot of human intervention where errors can be introduced, and you nearly have to start from scratch every year you need to update your reports or do a new summary.

This is the dream. Again there’s no judgment if you’re not here. This is something to aspire to, as you have time and gain more skills in R. I’m proud to report that we’re really close to achieving this with most of our protocols in NETN, but it’s been a long road (4+ years). RStudio has also come a long way in the last 5 years and has made this transition easier with improvements to R Markdown (automated reporting), R Shiny (interactive websites), and the tidyverse.

Assignments

Now that training has ended, the only tasks I ask for you to complete are to work through the code and answer the questions for today, and complete the feedback form. All other bullets are suggested resources for further reading, primarily related to data visualization and ggplot2. While I only suggested a few chapters from the R for Data Science book and STAT545.com site, they are both great resources that cover lots of other topics in R. I encourage you to read through to these resources as you have time and keep working on your R skills!

Answer the questions for this day and review today’s code. Answers are in the Answers tab.
Complete the Feedback form by the end of the day
R Graphics Gallery
R4DS: Chapter 3. Data visualization
STAT545: Chapter 24. Do’s and dont’s of making effective graphics. Note that I don’t completely agree with her dislike of pie charts, but I do agree that there are often better options than pie charts.
Mastering Software Development in R: Chapter 4.3. Mapping (GIS) in ggplot2

Useful websites

There’s a lot of great online material for learning new applications of R. The ones I’ve used the most are listed below.

Online Books

R for Data Science First author is Hadley Wickham, the programmer behind the tidyverse. There’s a lot of good stuff in here, including a chapter on using R Markdown, which is what I used to generate the website we used for training.
ggplot2: Elegant Graphics for Data Analysis A great reference on ggplot2 also by Hadley Wickham.
Mastering Software Development in R First author is Roger Peng, a Biostatistics professors at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. I like his intro to ggplot a lot. He’s also got a lot of more advanced topics in this book, like making functions and packages.
R Packages Another book by Hadley Wickham that teaches you how to build, debug, and test R packages.
Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.
Mastering Shiny And another Hadley Wickham book on building shiny apps.

Other useful sites

STAT545 Jenny Bryan’s site that accompanies the graduate level stats class of the same name. I like that she includes topics on best practices for coding, and not just how to make scripts work.
RStudio home page There’s a ton of info in the Resources tab on this site, including cheatsheets for each package developed by RStudio (ie tidyverse packages), webinars, presentations from past RStudio Conferences, etc.
RStudio list of useful R packages by topic
Happy Git with R If you find yourself wanting to go down the path of hosting your code on github, this site will walk you through the process of linking github to RStudio.
R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.

Answers

Day 1

Question 1: How do you find the value of the 3rd row of domSpp?

# If starting from scratch
library(tidyverse)
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', NA),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))


df[3, c("domSpp")] # option 1

## [1] "red_spruce"

df[3, 4] # option 2

## [1] "red_spruce"

Question 2: How would you make numtrees an integer?

df$numtrees <- as.integer(df$numtrees)
str(df)

## 'data.frame':    7 obs. of  5 variables:
##  $ plot         : chr  "1" "2" "3" "4" ...
##  $ numtrees     : int  24 18 17 14 31 27 15
##  $ numLive      : num  21 18 16 10 29 26 14
##  $ domSpp       : chr  "red_maple" "red_spruce" "red_spruce" "white_cedar" ...
##  $ invasive_pres: logi  TRUE FALSE FALSE FALSE FALSE TRUE ...

Question 3: How would you change the second row in invasive_pres from FALSE to TRUE?

df$invasive_pres[2] <- TRUE
head(df)

##   plot numtrees numLive       domSpp invasive_pres
## 1    1       24      21    red_maple          TRUE
## 2    2       18      18   red_spruce          TRUE
## 3    3       17      16   red_spruce         FALSE
## 4    4       14      10  white_cedar         FALSE
## 5    5       31      29   white_pine         FALSE
## 6    6       27      26 white_spruce          TRUE

Question 4: How would you calculate the percent of trees that are dead in the df data frame?

# base R version (2 options)
df$numDead <- df$numtrees - df$numLive
df$pctDead <- df$numDead/df$numtrees*100

df$pctDead <- with(df, (numtrees - numLive)/numtrees*100)

# tidyverse version (2 options)
df <- df %>% mutate(numDead = numtrees - numLive,
                    pctDead = numDead/numtrees*100)

df <- df %>% mutate(pctDead = (numtrees - numLive)/numtrees*100)

Question 5: What is the smallest number of dead trees observed in the df dataset?

min(df$numDead) # uses numdead column calculated for previous question.

## [1] 0

summary(df$numDead)[1] #Min. is the first record that summary reports for numeric columns

## Min. 
##    0

Question 6: How would you select only Invasive species records?

# If starting from scratch
wetdat<-read.csv("./data/example_data/ACAD_wetland_data_clean.csv")

wetinv <- wetdat[wetdat$Invasive == TRUE, ] # base R bracket-based solution
wetinv2 <- subset(wetdat, Invasive == TRUE) # another base R solution
wetinv_tidy <- wetdat %>% filter(Invasive == TRUE) # tidyverse solution

Question 7: Which sites have both Arethusa bulbosa and Carex exilis present?

# Base R solution
wetspp <- wetdat[wetdat$Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"),]
sort(unique(wetspp$Site_Name))

## [1] "RAM-05" "SEN-01" "SEN-02"

# Tidyverse solution
wetspp_tidy <- wetdat %>% filter(Latin_Name %in% c("Arethusa bulbosa", "Carex exilis")) 
sort(unique(wetspp_tidy$Site_Name)) # RAM-05 and SEN-01 have both species present

## [1] "RAM-05" "SEN-01" "SEN-02"

Question 8: How would you sort the wetdat by Ave_Cov (high to low) and Latin_Name?

wetdat_sort <- wetdat[order(-wetdat$Ave_Cov, wetdat$Latin_Name), ] # base R
head(wetdat_sort)

##     Site_Name Site_Type                      Latin_Name
## 286    RAM-62       RAM                 Carex trisperma
## 328    RAM-62       RAM           Vaccinium vitis-idaea
## 425    RAM-05       RAM               Carex folliculata
## 220    RAM-53       RAM                  Eurybia radula
## 506    RAM-05       RAM Viburnum nudum var. cassinoides
## 305    RAM-62       RAM           Maianthemum trifolium
##                                 Common Year PctFreq Ave_Cov Invasive Protected
## 286                  threeseeded sedge 2017     100      74    FALSE     FALSE
## 328                        lingonberry 2012     100      66    FALSE     FALSE
## 425                         long sedge 2017      50      63    FALSE     FALSE
## 220                    low rough aster 2012      50      63    FALSE     FALSE
## 506               northern wild raisin 2012     100      63    FALSE     FALSE
## 305 threeleaf false lily of the valley 2012     100      51    FALSE     FALSE
##      X_Coord Y_Coord
## 286 556731.5 4908125
## 328 556731.5 4908125
## 425 553186.0 4899764
## 220 549032.1 4909911
## 506 553186.0 4899764
## 305 556731.5 4908125

wetdat_sort_tidy <- wetdat %>% arrange(-Ave_Cov, Latin_Name) #tidyverse
head(wetdat_sort_tidy)

##   Site_Name Site_Type                      Latin_Name
## 1    RAM-62       RAM                 Carex trisperma
## 2    RAM-62       RAM           Vaccinium vitis-idaea
## 3    RAM-05       RAM               Carex folliculata
## 4    RAM-53       RAM                  Eurybia radula
## 5    RAM-05       RAM Viburnum nudum var. cassinoides
## 6    RAM-62       RAM           Maianthemum trifolium
##                               Common Year PctFreq Ave_Cov Invasive Protected
## 1                  threeseeded sedge 2017     100      74    FALSE     FALSE
## 2                        lingonberry 2012     100      66    FALSE     FALSE
## 3                         long sedge 2017      50      63    FALSE     FALSE
## 4                    low rough aster 2012      50      63    FALSE     FALSE
## 5               northern wild raisin 2012     100      63    FALSE     FALSE
## 6 threeleaf false lily of the valley 2012     100      51    FALSE     FALSE
##    X_Coord Y_Coord
## 1 556731.5 4908125
## 2 556731.5 4908125
## 3 553186.0 4899764
## 4 549032.1 4909911
## 5 553186.0 4899764
## 6 556731.5 4908125

Day 2

Question 1: There’s a Latin name that needs to be fixed (hint look for non-letter symbols). How would you find that record?

sort(unique(wetdat$Latin_Name))[1:10] # Low-tech way: use your eyes to spot the "Alnus incana++". I truncated to first 10.

##  [1] "Acer rubrum"               "Alnus incana"             
##  [3] "Alnus incana++"            "Amelanchier"              
##  [5] "Andromeda polifolia"       "Apocynum androsaemifolium"
##  [7] "Arethusa bulbosa"          "Aronia melanocarpa"       
##  [9] "Berberis thunbergii"       "Betula populifolia"

Hopefully you noticed Alnus incana++ when you sorted unique latin names. The ++ are errors that we want to remove (see Question 2). If you wanted to identify any non-alphabetical symbol in a column, and if your dataset was too big to spot all of them, you can use regular expressions like the code chunk below. The pattern in the grep is specifying that you want to return any non-alphabetical symbol (via alpha) in Latin_Name. Note that “^” is interpreted as NOT in regular expressions. To use the whole record for your search, rather than stopping at the first word (since Latin Names are usually 2 words), you add +$. The + specifies that you can have multiple matches within 1 record, and the $ tells you to do the search to the end of the record.

# The regular expression way to find non-alphabetical symbols (now you know why I didn't cover this).
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # looks for anything not A-Z or a-z in entire record

## [1] "Alnus incana++"

  # ^ and +$ are telling R to start with first word and check every following word within a record.
  # This took longer than I'd like to admit to make work, by the way.

Question 2: Now that you’ve found it, fix it, and check that you fixed it.

wetdat$Latin_Name[wetdat$Latin_Name == "Alnus incana++"] <- "Alnus incana" # easy approach
wetdat$Latin_Name <- gsub("[+]", "", wetdat$Latin_Name) # gsub approach
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # no values returned

## character(0)

sort(unique(wetdat$Latin_Name))[1:10] # using your eyes approach

##  [1] "Acer rubrum"               "Alnus incana"             
##  [3] "Amelanchier"               "Andromeda polifolia"      
##  [5] "Apocynum androsaemifolium" "Arethusa bulbosa"         
##  [7] "Aronia melanocarpa"        "Berberis thunbergii"      
##  [9] "Betula populifolia"        "Calamagrostis canadensis"

Question 3: How would you take only the plots on MDI (eg Unit_ID = “ACAD_MDI”) in the loc table and join them to their corresponding visits in the events table?

# In case you need to load the tables
loc <- read.csv("./data/example_data/tbl_Locations_example.csv")
event <- read.csv("./data/example_data/tbl_Events_example.csv")  
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") 

# Filter then join
loc_MDI <- loc %>% filter(Unit_ID == "ACAD_MDI")
locev_MDI <- left_join(loc_MDI, event, by = "Location_ID")

# All in one via pipes:
locev_MDI2 <- loc %>% filter(Unit_ID == "ACAD_MDI") %>% 
                      left_join(event, by = "Location_ID")

Question 4: How would you join the stand data to the locev table (the one we made that dropped QAQC events)?

locev <- merge(loc, event[event$Event_QAQC == FALSE, ], all.x = TRUE, all.y = TRUE)
intersect(names(locev), names(stand))

## [1] "Event_ID"

locev_stand <- merge(locev, stand, by = "Event_ID", all.x = TRUE, all.y = TRUE) #base R full join
locev_stand_tidy <- full_join(locev, stand, by = "Event_ID") #tidyverse full join

Question 5: How would you calculate the min and max number of species by year?

head(wetdat)

##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909

wetmin_max <- wetdat %>% mutate(present = ifelse(Ave_Cov >0 & !is.na(Ave_Cov), 1, 0)) %>% 
                         group_by(Site_Name, Year) %>% 
                         summarize(numspp = sum(present), 
                                   .groups = "drop") %>% 
                         group_by(Year) %>% 
                         summarize(minspp = min(numspp),
                                   maxspp = max(numspp), 
                                  .groups = "drop")
wetmin_max

## # A tibble: 3 x 3
##    Year minspp maxspp
## * <int>  <dbl>  <dbl>
## 1  2011     33     41
## 2  2012     26     48
## 3  2017     26     54

Note that first I had to sum up the number of species within a site (hence group_by(Site_Name, Year)). After I calculate that, I grouped by Year only, then found the min and max number of species recorded within a year.

Question 6: What percent of RAM sites had an invasive species present in 2012 and 2017?

#head(wetdat)
wetinv1 <- wetdat %>% filter(Site_Type == "RAM") %>% 
                      group_by(Site_Name, Year) %>% 
                      summarize(num_invspp = sum(Invasive), 
                                plot_inv = ifelse(num_invspp > 0, 1, 0), 
                                .groups = "drop")
head(wetinv1)

## # A tibble: 6 x 4
##   Site_Name  Year num_invspp plot_inv
##   <chr>     <int>      <int>    <dbl>
## 1 RAM-05     2012          0        0
## 2 RAM-05     2017          1        1
## 3 RAM-41     2012          0        0
## 4 RAM-41     2017          1        1
## 5 RAM-44     2012          1        1
## 6 RAM-44     2017          1        1

# Note plot_inv is to set plot to 1 or 0 even if multiple invasives are found on the plot.
wetinv <- wetinv1 %>% group_by(Year) %>% 
                      summarize(tot_inv_plots = (sum(plot_inv)/n())*100,
                                .groups = "drop")

wetinv

## # A tibble: 2 x 2
##    Year tot_inv_plots
## * <int>         <dbl>
## 1  2012            40
## 2  2017            80

Note that logical fields (ie TRUE/FALSE) are interpreted as numbers in R when you use sum. If the value is TRUE, R interprets it as 1. If it’s FALSE, R interprets it as 0. So sum(Invasive) is summing all of the TRUEs in the Invasive column.

Day 3

Question 1: How would you create a dataset that has a column for every sample year (ie 2013-2019), a row for every site, and the number of northern long-eared bats found as the value?

# In case you're starting from scratch
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
                                                         substr(word(Latin, 2), 1, 3))))

MYOSEP_sum <- bat_long %>% filter(sppcode == "MYOSEP") %>% 
                           group_by(Site, Year) %>% 
                           summarize(num_bats = sum(ifelse(!is.na(Latin), 1, 0)), 
                                     .groups = 'drop') 

head(MYOSEP_sum) # Note only the site and years where MYOSEP was found are listed. Next lines add those back before pivot
bat_site_year <- bat_long %>% expand(Site, Year)

MYOSEP_comp <- bat_site_year %>% left_join(., MYOSEP_sum, by = c("Site", "Year")) 
MYOSEP_comp$num_bats[is.na(MYOSEP_comp$num_bats)] <- 0 # Replace all the NAs created by left join with 0

MYOSEP_wide <- MYOSEP_comp %>% arrange(Site, Year) %>% 
                               pivot_wider(names_from = Year, values_from = num_bats, values_fill = 0)

MYOSEP_wide

## # A tibble: 3 x 8
##   Site       `2013` `2014` `2015` `2016` `2017` `2018` `2019`
##   <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 BrownMtnCR      0      0      0      0      0      1      0
## 2 BrownMtnGH      1      0      1      0      0      0      0
## 3 BubPdS          0      0      0      0      0      0      0

Question 2: How would you create a wide matrix with all protected species as columns, sites and years sampled as rows, and percent cover as the value?

# If you're starting from scratch
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
                                                          substr(word(Latin_Name, 2), 1, 3)))) 

wetdat_dist <- wetdat_prep %>% distinct(Site_Name, Year) # makes tibble of all site, year combinations, but doesn't add any like expand

wetdat_wide <- wetdat_prep %>% filter(Protected == TRUE) %>% 
                               select(Site_Name, Year, sppcode, Ave_Cov) %>% 
                               pivot_wider(names_from = sppcode, values_from = Ave_Cov, values_fill = 0) %>% 
                               left_join(wetdat_dist, ., by = c("Site_Name", "Year")) 
wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")][is.na(wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")])] <- 0

Note that I used distinct() instead of expand() because not every site is sampled every year. I used distinct to add site/year combinations that were sampled but where protected species weren’t found. Expand would have added a record for every year in the data to every site.

Question 3: How would you create a wide matrix for ACAD plots in 2019 that has plots for rows, and exotic species as columns?

# Starting from scratch
library(forestNETNarch)
importCSV("./data/NETN_forest_csvs")
acad_exo <- makeSppList(park = "ACAD", from = 2019, to = 2019, speciesType = "exotic") 
acad_exo <- acad_exo %>% mutate(present = ifelse(Exotic == TRUE & !is.na(Exotic), 1, 0))
head(acad_exo)

# Now to answer the question
# Old school spread version
acad_wide_spread <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>% 
                                 arrange(Latin_Name, Plot_Name) %>% 
                                 spread("Latin_Name", "present", fill = 0)

# Newer pivot version
acad_wide_pivot <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>% 
                                arrange(Latin_Name, Plot_Name) %>% 
                                pivot_wider(names_from = "Latin_Name", 
                                            values_from = "present",
                                            values_fill = 0) 

head(acad_wide_pivot)

## # A tibble: 6 x 7
##   Plot_Name `Festuca filifo~ `Hieracium caes~ `Lonicera - Exo~ `Poa compressa`
##   <chr>                <dbl>            <dbl>            <dbl>           <dbl>
## 1 ACAD-044                 1                0                1               0
## 2 ACAD-045                 1                0                0               0
## 3 ACAD-078                 1                0                0               0
## 4 ACAD-176                 1                0                0               1
## 5 ACAD-058                 0                1                0               0
## 6 ACAD-039                 0                0                0               0
## # ... with 2 more variables: `Rhamnus frangula` <dbl>, `NA` <dbl>

Question 4: Take the acad_final graph and make 2 more changes to it.

You may have chosen other things to do than I did. Here are a couple of changes to the ggplot bar chart

# Starting from scratch
library(forestNETN)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")

acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>% 
                              summarize(seed15_30 = sum(seed15.30),
                                        seed30_100 = sum(seed30.100),
                                        seed100_150 = sum(seed100.150),
                                        seed150p = sum(seed150p),
                                        sapling = sum(sap.den),
                                        .groups = 'drop') %>% 
                              arrange(Plot_Name)

acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, 
                                                names_to = "size_class", 
                                                values_to = "density")

# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>% 
                                  summarize(mean_dens = mean(density),
                                            se_dens = sd(density)/sqrt(n()),
                                            .groups = 'drop')

acad_stats$size_class_fact <- ordered(acad_stats$size_class, 
                                      levels = c('seed15_30', 'seed30_100', 
                                                 'seed100_150', 'seed150p', 'sapling'))

# Now to answer the question
ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
              geom_bar(stat = 'identity', fill = "steelblue4")+
              geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens),
                            width = 0.2, lwd = 0.7)+
              labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
              theme_FHM()+
              theme(axis.text.x = element_text(angle = 45, hjust = 1))+ #hjust is centering horizontally
              scale_x_discrete(labels = c('15-30cm', '30-100cm', '100-150cm', '>150cm', '1-10cm DBH'))

Note that width = 0.2 in the geom_errorbar is the width of the caps on the bars. lwd is the weight of the line.

Question 5: Filter the acad_ph_final data frame to only include 2019 data and create a plot with months as the X axis, and includes points, a loess smooth with span = 1, and a facet on site.

# Starting from scratch
library(NCRNWater)

netnwd<-importNCRNWater(Dir = "./data/NETN_water_data", 
                        Data = "Water Data.csv", 
                        MetaData = "VizMetaData.csv")

acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
                             type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
                             fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))

acad_lakes <- acad_all_sites %>% filter(type == "Lake")  #list of site codes and full names for ACAD lakes

acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90) 

acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>% 
                              mutate(month = format(Date, "%m"),
                                     year = format(Date, "%Y"))

acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")], 
                           by= c("Site" = "sitecode")) # how to join if columns have diff. names

#head(acad_ph_final)

# Now to answer the question
acad_ph_2019 <- acad_ph_final %>% filter(year == 2019) %>% mutate(mon_abr = format(Date, "%b"))

plot19 <- ggplot(acad_ph_2019, aes(x = Date, y = Value))+
             geom_point()+
             labs(y = "pH")+
             geom_smooth(method = 'loess', span = 1, se = FALSE)+
             scale_x_date(date_breaks = "1 month",
                          date_labels = "%b")+ #%b is months names abbreviated
             facet_wrap(~fullname, ncol = 4)+
             theme_bw()

plot19

## `geom_smooth()` using formula 'y ~ x'

Code printout

This tab prints all of the code chunks in this file in one long file.

#--------------------
#       Prep
#--------------------
install.packages(c("tidyverse", "devtools", "rgdal", "ggspatial", "leaflet"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
library(tidyverse)
library(forestNETN)
library(NCRNWater)
#--------------------
#    Day 1 Mod 1
#--------------------
install.packages(c("tidyverse", "rgdal", "devtools"))
devtools::install_github("katemmiller/forestNETNarch")
devtools::install_github("NCRN/NCRNWater")
library(tidyverse)
library(rgdal)
library(forestNETNarch)
library(NCRNWater)
library(tidyverse)
library(rgdal)
#--------------------
#    Day 1 Mod 2
#--------------------
subfolders<-c("data", "figures", "tables", "rscripts", "shapefiles")

invisible(lapply(subfolders, function(x){
  if(!dir.exists(x)){dir.create(paste(getwd(), x, sep = '/'))}
  }))

zips <- c("./data/example_data", "./data/NETN_forest_csvs", "./data/NETN_water_data", "./data/NETN_bird_data")
unlink(zips, recursive=T)
list.files("./data")
list.files("./data") # check that you have data.zip in your data folder
unzip("./data/data.zip", exdir = "./data")
list.files("./data") # check that the unzip worked. There should be multiple folders with datasets in each.
#--------------------
#    Day 1 Mod 3
#--------------------
?plot
?dplyr::filter
help(package=dplyr)
#--------------------
#    Day 1 Mod 4
#--------------------
library(tidyverse)
# Vector of numbers
vec_num <- c(1, 2, 3, 4, 5)
vec_num

# Vector of strings
vec_str <- c("Schoodic", "MDI-East", "MDI-West", "IAH")
vec_str
vec_str[1]

df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', NA),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))

print(df)
df$plot # List the values within the column plot for the df data frame
df[ , 'plot'] # Bracket version of the same thing
df[ , 1] # Bracket version of showing the first column, which is plot
df[2, 1] # Selecting 2nd row of 1st column

head(df) # to view the first 6 rows of df
tail(df) # to view last 6 rows of df
names(df) # to view the column names and their order
str(df) # to view how R is defining each column type
# The long way (note that I renamed the object here)
df2 <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 7), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', 'white_spruce'),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))

# A shorter way
df$plot[7] <- 7 #takes the 7th row of the plot column and changes it to 7
df$domSpp[7] <- "white_spruce" #takes the 7th row of domSpp and replaces it with "white_spruce"

# Another shorter way
df[7, "plot"] <- 7 #takes the 7th row of the plot column and makes it 7
df[7, 'domSpp'] <- "white_spruce" #takes the 7th row of the domSpp column and makes it "white_spruce"

# Another shorter way
df[7, 1] <- 7 #takes the 7th row of the first column and makes it 7
df[7, 4] <- "white_spruce" #takes the 7th row of the 4th column and makes it "white_spruce"

# Even shorter-way
df[7, c(1, 4)] <- c(7, "white_spruce") #Takes the 7th row and the first and 4th column, which happen to be plot and domSpp.
# Then the first column gets 7, and the 4th gets "white spruce"

#df[is.na("domSpp")] <- "white_oak" #wrong
df$domSpp[is.na(df$domSpp)]<- "white_oak"
df$domSpp <- df$domSpp %>% replace_na("white_oak")

str(df) #still a char
df$plot <- as.numeric(df$plot)
str(df)
#--------------------
#    Day 1 Mod 5
#--------------------
min(df$numtrees)
max(df$numtrees)
mean(df$numtrees)
range(df$numtrees) #gives min and max
summary(df)
df$numDead <- df$numtrees - df$numLive # base R version
df <- df %>% mutate(numDead = numtrees - numLive) # tidyverse version of the same operation.
# note if the 2nd line doesn't run, load the tidyverse library (library(tidyverse)) first.
#--------------------
#    Day 1 Mod 6
#--------------------
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
install.packages("readxl") #only need to run once. 
library(readxl)
bat_capture <- suppressWarnings(read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx", 
                         sheet = "All Records"))
library(readxl)
bat_capture <- read_xlsx(path = "./data/example_data/ANP_BAT_MasterCaptureDB_1996-2019 20200811.xlsx", 
                         sheet = "All Records")
head(bat_capture)
dat <- read.csv(file.choose())
# Filter to remove protected species
wetdat_pub <- wetdat[wetdat$Protected == FALSE, ]
nrow(wetdat) #508 rows in original dataset
nrow(wetdat_pub) #499 rows in filtered dataset
table(wetdat_pub$Protected) # Only Protected = FALSE

# Select only fields you want
wetdat_pub <- wetdat_pub[ , c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov", 
                              "Invasive", "Protected")]
names(wetdat_pub)

# Previous steps can all happen in 1 line of code:
wetdat_pub <- wetdat[wetdat$Protected == FALSE, 
                     c("Site_Name", "Site_Type", "Latin_Name", "Year", "Ave_Cov", 
                       "Invasive", "Protected")]

nrow(wetdat_pub) # 499 rows
names(wetdat_pub) # Only includes fields we want
table(wetdat_pub$Protected) #All FALSE

# Another base R solution that's a bit easier to work with:
wetdat_pub2 <- subset(wetdat, Protected == FALSE, 
                              select = c(Site_Name, Site_Type, Latin_Name, Year,
                                         Ave_Cov, Invasive, Protected))

nrow(wetdat_pub2) # 499 rows
names(wetdat_pub2) # Only includes fields we want
table(wetdat_pub2$Protected) #All FALSE
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% 
                          select(Site_Name, Site_Type, Latin_Name, Year, Ave_Cov, Invasive, Protected)
                          # Any column name not mentioned in the select is dropped

nrow(wetdat_tidy)
names(wetdat_tidy)
table(wetdat_tidy$Protected)

# Even more efficient way to code this:
wetdat_tidy <- wetdat %>% filter(Protected == FALSE) %>% select(-Common, -PctFreq)
# Base R
wetdat_2sites <- wetdat[wetdat$Site_Name %in% c("SEN-01", "SEN-03"), ] 
nrow(wetdat_2sites)
table(wetdat_2sites$Site_Name)

# Tidyverse equivalent of same operation
wetdat_2tidy <- wetdat %>% filter(Site_Name %in% c("SEN-01", "SEN-03"))
nrow(wetdat_2tidy)
table(wetdat_2tidy$Site_Name)


head(wetdat) # check original order before sorting

# Base R- I always forget the exact code and have to look it up.
wetdat_sort <- wetdat[order(wetdat$Latin_Name, wetdat$Site_Name), ]
head(wetdat_sort)

# Sort in reverse
wetdat_rev <- wetdat[order(desc(wetdat$Latin_Name), desc(wetdat$Site_Name)), ] #desc() stands for descending
head(wetdat_rev)

# Another example with numbers
wetdat_rev2 <- wetdat[order(-wetdat$PctFreq), ]
# Note: desc() is for text, - is for numbers

# Tidyverse version
wetdat_sort_tidy <- wetdat %>% arrange(Latin_Name, Site_Name)
head(wetdat_sort_tidy)

# Tidyverse reverse sort
wetdat_rev_tidy <- wetdat %>% arrange(desc(Latin_Name), desc(Site_Name))
head(wetdat_rev_tidy)
#--------------------
#    Day 2 Mod 1
#--------------------
library(tidyverse)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
head(wetdat) # shows the first 6 lines of the data frame, including the names of the columns
tail(wetdat[1:3]) # shows last 6 records of first 3 columns in the data frame
names(wetdat) # shows the column names of the data frame
table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
table(complete.cases(wetdat)) # FALSE counts the number of rows that have an NA in at least 1 column
dim(wetdat) # number of rows (first) and columns (second)
nrow(wetdat) # number of rows (ncol is # columns)
str(wetdat) # shows the type of data in each column
summary(wetdat) # summarize each field based on data type

View(wetdat)
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov) #create new field that converts Ave_Cov to numeric
which(is.na(wetdat$Ave_Cov_Num)) 
wetdat[which(is.na(wetdat$Ave_Cov_Num)), ] # all columns with NA in Ave_Cov_num

wetdat[which(is.na(wetdat$Ave_Cov_Num)), "Ave_Cov"] # Ave_Cov only

wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Another way to see Ave_Cov only 
wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- "0.1"
#wetdat$Ave_Cov[wetdat$Ave_Cov == "<0.1"] <- NA
wetdat$Ave_Cov <- gsub("<", "", wetdat$Ave_Cov)

str(wetdat)
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
str(wetdat) 
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_example.csv")
wetdat$Ave_Cov_Num <- as.numeric(wetdat$Ave_Cov)
wetdat$Ave_Cov[which(is.na(wetdat$Ave_Cov_Num))] # Identify which values in Ave_Cov can't be converted to a number and what you need to do to fix them.
wetdat$Ave_Cov <- as.numeric(gsub("<", "", wetdat$Ave_Cov)) # replace < with nothing
wetdat <- subset(wetdat, select = -c(Ave_Cov_Num)) #now drop the temp column Ave_Cov_Num
str(wetdat) #Ave_Cov is now numeric
table(complete.cases(wetdat$Ave_Cov)) #checks that there aren't any more NAs. 
# If there were NAs, you'd have a column named FALSE with the number of NAs below it.
range(wetdat$PctFreq, na.rm = TRUE) # 20 to 100- that looks fine.
range(wetdat$Ave_Cov, na.rm = TRUE) # 0.02 to 110.0- no bueno
wetdat[wetdat$Ave_Cov > 100, ] # If Ave_Cov is still a chr, this filter won't work
wetdat$Ave_Cov[wetdat$Ave_Cov > 100] <- 11
range(wetdat$Ave_Cov) 
table(complete.cases(wetdat$Ave_Cov)) #all TRUE
table(complete.cases(wetdat$PctFreq)) #some FALSE
mean(wetdat$PctFreq)
mean(wetdat$PctFreq, na.rm = TRUE)
wetdat$PctFreq[is.na(wetdat$PctFreq)] <- 0
table(complete.cases(wetdat$PctFreq))
mean(wetdat$PctFreq)
table(wetdat$Site_Name)
unique(wetdat$Site_Name[which(nchar(wetdat$Site_Name) != 6)]) #

wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))]

wetdat$Site_Name[wetdat$Site_Name == "RAM44"] <- "RAM-44"
wetdat$Site_Name[wetdat$Site_Name == "SEN-3"] <- "SEN-03"
wetdat$Site_Name[which(!grepl("-", wetdat$Site_Name))] #character(0) means nothing returned
wetdat$Site_Name[which(nchar(wetdat$Site_Name)!=6)]
table(wetdat$Site_Name)
sort(unique(wetdat$Site_Name)) #another way to see all the unique values in a column
dim(wetdat) # reports # rows then # cols
nrow(wetdat) # reports # rows only

length(unique(wetdat$Site_Name)) # number of unique values in Site_Name
length(unique(wetdat$Year)) # number of unique years

table(wetdat$Site_Name, wetdat$Year) # distribution of sites across years
sort(unique(wetdat$Common))[1:10] # see all the common names in the dataset (truncated to first 10 common names for printing here).

#--------------------
#    Day 2 Mod 3
#--------------------
tree <- data.frame(plot = c("p1","p1","p1","p2","p2","p3","p4","p4"), 
                   treenum = c(1, 2, 3, 11, 12, 21, 32, 33),
                   species = c("ACERUB" ,"PINSTR", "PINSTR", "QUERUB", "PICRUB", 
                               "PICRUB", "PICRUB", "ACERUB"))
tree

tree_data <- data.frame(plot = c("p1", "p1", "p1", "p1", "p1","p1", "p2", "p2", "p5", "p7"),
                        year = c(2012, 2012, 2012, 2016, 2016, 2016, 2012, 2012, 2016, 2016),
                        treenum = c(1, 2, 3, 1, 2, 3, 11, 12, 51, 71),
                        DBH = c(12.5, 14.5, 16.1, 12.8, 14.9, 17.2, 28.1, 35.4, 36.1, 45.2))

tree_data

tree_full_join <- full_join(tree, tree_data, by = c("plot", "treenum"))
tree_left_join <- left_join(tree, tree_data, by = c("plot", "treenum"))
nrow(tree_full_join)
nrow(tree_left_join)
print(tree_full_join)
print(tree_left_join)
library(tidyverse)
loc <- read.csv("./data/example_data/tbl_Locations_example.csv") # plot level data for forest plots in NETN
head(loc)
event <- read.csv("./data/example_data/tbl_Events_example.csv") # data on each visit in NETN forest plots
head(event)
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") # stand data collected in plots
head(stand)
View(loc) #Location_ID
View(event) #Location_ID links to tbl_Location; Event_ID links other visit-level tables to this table
View(stand) #Event_ID links stand table to event table. 
intersect(names(loc), names(event)) # way to check columns two tables have in common.
intersect(names(event), names(stand))
# base R version
names(loc) # to help me pick the columns to include in the join
names(event)
loc_event_full <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = TRUE) # base R full join
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID") # tidyverse full join

View(loc_event_full) # Now you can see all the times a plot has been sampled. You'll also see that some plots 
# don't have sample dates. That's because the loc table includes plots that were rejected and never sampled.
loc_event_left <- merge(loc, event, by = "Location_ID", all.x = TRUE, all.y = FALSE)
loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_full) # same number of rows because all events have a location record
nrow(loc_event_left) # same number of rows because all events have a location record

# tidyverse version
loc_event_full_tidy <- full_join(loc, event, by = "Location_ID")
nrow(loc_event_full_tidy) # same number of rows because all events have a location record

loc_event_left_tidy <- left_join(loc, event, by = "Location_ID")
nrow(loc_event_left_tidy) # same number of rows because all events have a location record

# Base R approach
locev <- merge(loc[loc$Park_Code == "ACAD", ], 
               event[event$Event_QAQC == FALSE, ], 
               all.x = TRUE, all.y = FALSE)

# Tidyverse approach
loc_ACAD <- loc %>% filter(Park_Code == "ACAD")
locev_tidy <- event %>% filter(Event_QAQC == FALSE) %>% 
                        left_join(loc_ACAD, ., by = "Location_ID")

nrow(locev) 
nrow(locev_tidy)
View(locev)
View(locev_tidy)
#--------------------
#    Day 2 Mod 4
#--------------------
wetdat <- read.csv('./data/example_data/ACAD_wetland_data_clean.csv')
head(wetdat)
wetsum <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                     group_by(Site_Name, Year) %>% 
                     summarize(numspp = sum(present),
                               .groups = 'drop')
head(wetdat)
wetdat <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) 
wetdat_sum <- wetdat %>% group_by(Site_Name, Year) %>% 
                         summarize(numspp = sum(present),
                                  .groups = 'drop')
wetsum_opt1 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                          group_by(Site_Name, Site_Type, Year) %>% 
                          summarize(numspp = sum(present),
                                    .groups = 'drop')

wetsum_opt2 <- wetdat %>% mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                          group_by(Site_Name, Year) %>% 
                          summarize(Site_Type = first(Site_Type), 
                                    numspp = sum(present),
                                   .groups = 'drop')
wetsum_pre <- wetdat %>% filter(Site_Type == "RAM") %>% 
                         mutate(present = ifelse(Ave_Cov > 0 & !is.na(Ave_Cov), 1, 0)) %>% 
                         group_by(Site_Name, Site_Type, Year) %>% 
                         summarize(numspp = sum(present), 
                                   .groups = 'drop')

wetsum_yr <- wetsum_pre %>% group_by(Year) %>% 
                            summarize(mean_numspp = mean(numspp),
                                      numplots = n(), 
                                      se_numspp = sd(numspp)/sqrt(numplots), 
                                      .groups = 'drop')
wetsum_yr
#--------------------
#    Day 2 Mod 5
#--------------------
write.csv(wetdat, "./data/ACAD_wetland_data2.csv")
file.remove("./data/ACAD_wetland_data2.csv")
#--------------------
#    Day 3 Mod 1
#--------------------
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
head(bat_long)

bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
                                                         substr(word(Latin, 2), 1, 3))))

bat_long_sum <- bat_long %>% group_by(Site, Year, sppcode) %>% 
                             summarize(numindiv = sum(!is.na(sppcode)),
                                       .groups = 'drop')

head(bat_long_sum)
bat_wide <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv") #retired spread function
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv") #replacement in tidyr
bat_wide

bat_wide_fill <- bat_long_sum %>% spread(key = "sppcode", value = "numindiv", fill = 0) #retired spread function
head(bat_wide_fill)
bat_wide2_fill <- bat_long_sum %>% pivot_wider(names_from = "sppcode", values_from = "numindiv", 
                                               values_fill = 0) #replacement in tidyr

bat_long2 <- bat_wide_fill %>% gather(key = "sppcode", value = "numindiv", -Site, -Year) # old way
head(bat_long2)

names(bat_wide_fill)
bat_long2b <- bat_wide_fill %>% pivot_longer(cols=c(LASCIN, MYOLEI, MYOLUC, MYOSEP), 
                                             names_to = "sppcode", 
                                             values_to = "numindiv") #new way
head(bat_long2b)
nrow(bat_long_sum) # dataset before made wide, then long
nrow(bat_long2) # dataset that has every species represented for every year and every site
#First figure out how many levels you expect
with(bat_long, length(unique(Site)) * length(unique(Year)) * length(unique(sppcode))) #84
bat_all_comb <- bat_long_sum %>% expand(Site, Year, sppcode)
nrow(bat_all_comb) #84

bat_long_comb <- left_join(bat_all_comb, bat_long_sum, by = c("Site", "Year", "sppcode"))
bat_long_comb$numindiv[is.na(bat_long_comb$numindiv)] <- 0 # replace NA with 0
nrow(bat_long2) == nrow(bat_long_comb) #TRUE

head(bat_long_sum)
bat_wide2 <- bat_long_sum %>% pivot_wider(names_from = c(sppcode, Year), 
                                          values_from = numindiv,
                                          values_fill = 0)
head(bat_wide2)
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
head(wetdat)

# Prepare data for pivot
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
                                                          substr(word(Latin_Name, 2), 1, 3))),
                                 present = ifelse(Ave_Cov>0, 1, 0)) %>% 
                                 filter(Year == 2017) 

# Pivot wide, so every species has a column
wetdat_wide <- wetdat_prep %>% select(Site_Name, Year, sppcode, present) %>% 
                               pivot_wider(names_from = "sppcode", values_from = "present", 
                                           values_fill = 0)
head(wetdat_wide)

# Pivot long, so every site has a species with 1 or 0 to denote present/absent
wetdat_long <- wetdat_wide %>% pivot_longer(cols = c(-Site_Name, -Year), names_to = "sppcode", 
                                            values_to = "present") %>% 
                               arrange(Site_Name, sppcode)

# Now we need to filter the data to only include invasive species
# First make a list of the invasive species
inv_list <- unique(wetdat_prep$sppcode[wetdat_prep$Invasive == TRUE])
inv_list

#Now filter on that invasive list
wetdat_long_inv <- wetdat_long %>% filter(sppcode %in% inv_list)


# Now summarize to count the percent of plots each species was found in
wetdat_sum <- wetdat_long_inv %>% group_by(sppcode) %>% 
                                  summarize(pct_sites = sum(present)/n()*100, 
                                           .groups = 'keep')
wetdat_sum
#--------------------
#    Day 3 Mod 2
#--------------------
library(forestNETNarch)
library(tidyverse)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")

head(acad_regen) # this dataframe includes species, so there are multiple rows per plot. 

# We need to group_by Plot_Name, and sum seedling densities to get plot-level seedling densities.
acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>% 
                              summarize(seed15_30 = sum(seed15.30),
                                        seed30_100 = sum(seed30.100),
                                        seed100_150 = sum(seed100.150),
                                        seed150p = sum(seed150p),
                                        sapling = sum(sap.den),
                                        .groups = 'drop') %>% 
                              arrange(Plot_Name)

# This is the shortest way to get a dataset that has a mean seedling density for each size class and a se for each size class. 
# Note we want to make this long, so every plot has each of the size classes
acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, names_to = "size_class", values_to = "density")

# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>% 
                                  summarize(mean_dens = mean(density),
                                            se_dens = sd(density)/sqrt(n()),
                                            .groups = 'drop')
# We have our dataset to plot. 
acad_base <- ggplot(data = acad_stats, aes(x = size_class, y = mean_dens))

# Make a point plot
acad_base + geom_point()

# Make a bar chart
acad_bar <- acad_base + geom_bar(stat = 'identity')
acad_bar

# Add error bars to the previous bar chart
acad_bar + geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
           labs(x = "Regeneration Size Class", y = "Stems per sq.m")

acad_stats$size_class_fact <- ordered(acad_stats$size_class, 
                                      levels = c('seed15_30', 'seed30_100', 'seed100_150', 'seed150p', 'sapling'))

acad_bar2 <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
             geom_bar(stat = 'identity')+ 
             geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
             labs(x = "Regeneration Size Class", y = "Stems per sq.m")

acad_bar2

acad_bar2 + theme(panel.grid.minor = element_blank(),
                  panel.grid.major = element_blank())
acad_bar2+theme_FHM()

acad_final <- ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
              geom_bar(stat = 'identity', fill = "#5E79A8")+
              geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens))+
              labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
              theme_FHM()

acad_final
library(NCRNWater)

netnwd<-importNCRNWater(Dir = "./data/NETN_water_data", 
                        Data = "Water Data.csv", 
                        MetaData = "VizMetaData.csv")
# Create data frame that lists the sites and their type using the getSiteInfo getter.
acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
                             type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
                             fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))
head(acad_all_sites)

# Filter data frame to only include lakes
acad_lakes <- acad_all_sites %>% filter(type == "Lake")  #list of site codes and full names for ACAD lakes
acad_lakes

# Compile the water pH data using getWData function
acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')

# Filter water data to only include lakes
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
table(acad_lakes_ph$Site) # 8 sites with >90 measurements

# Turn the table above into a dataframe, and use it to filter the water data on.
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90) 

# Filter acad_lakes_pH to only include the annual lakes and add column for month and year
acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>% 
                              mutate(month = format(Date, "%m"),
                                     year = format(Date, "%Y"))
head(acad_ph2)
acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")], 
                           by= c("Site" = "sitecode")) # how to join if columns have diff. names

head(acad_ph_final) #final dataset to plot

ph_plot <- ggplot(acad_ph_final, aes(x = Date, y = Value, color = Site, shape = Site))+
           geom_point()+ 
           labs(y = "pH")+
           theme_bw()
ph_plot
ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
                geom_point(shape = 16, color = "LightBlue")+ 
                labs(y = "pH")+
                theme_bw()+
                facet_wrap(~fullname)
ph_plot_site
ph_plot_site_smooth <- ph_plot_site + geom_smooth(method = 'loess', span = 1, color = "RoyalBlue", se = FALSE)

# vary the span to increase/decrease wavyness of smooth
ph_plot_site_smooth + geom_smooth(method = 'loess', span = 0.1, color = "red", se = FALSE)+
                      geom_smooth(method = 'loess', span = 0.5, color = 'green', se = FALSE)      

ph_plot_site <- ggplot(acad_ph_final, aes(x = Date, y = Value))+
                geom_point(shape = 16, color = "LightBlue")+ 
                labs(y = "pH")+
                theme_bw()+
                facet_wrap(~fullname)+
                geom_smooth(method = 'loess', span = 0.5, color = 'RoyalBlue', se = FALSE)

ph_plot_site2 <- ph_plot_site + scale_x_date(date_breaks = "2 years",
                                             date_labels = "%Y")+
                                theme(axis.text.x = element_text(angle = 45, hjust = 1))

ph_plot_site2
#--------------------
#    Day 3 Mod 3
#--------------------
library(tidyverse)
library(rgdal) # for GIS functions
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
table(wetdat$Site_Name, wetdat$Year) # SEN most recent is 2011, RAM is 2017
wetdat_rich <- wetdat %>% filter(Year %in% c(2011, 2017)) %>% 
                          mutate(present = ifelse(!is.na(Ave_Cov) & Ave_Cov > 0, 1, 0)) %>% 
                          group_by(Site_Name, Year, Site_Type, X_Coord, Y_Coord) %>% 
                          summarize(numspp = sum(present), 
                                    .groups = 'drop')
head(wetdat_rich)
head(wetdat_rich)  
# Step 1: Create spatial object by specifying the coordinates
coordinates(wetdat_rich) <- ~X_Coord + Y_Coord
head(wetdat_rich)
plot(wetdat_rich) # These points now plot in space (bottom left point is on IAH)

# Step 2: Define the coordinate system 
proj4string(wetdat_rich) <- CRS("+init=epsg:26919")

# Step 3: Write the spatial object to a shapefile
writeOGR(wetdat_rich, dsn = "./shapefiles", layer = "wetdat_rich", driver = "ESRI Shapefile", 
         overwrite_layer = TRUE) # overwrite is in case you need to update your shapefile.

# Load packages
library(rgdal) # for readOGR
library(ggplot2) #for ggplot functions
library(ggspatial) #to make ggplot functions spatial

# Load shapefiles
bounds <- readOGR("./data/example_data/ParkBndCARTO_Fee_2020.shp", layer = "ParkBndCARTO_Fee_2020",
                  GDAL1_integer64_policy = TRUE) # Don't ask me why this is here. All I know is it works.
head(bounds@data) # How you view the attribute table of a shapefile in R

wetmap <- readOGR("./data/example_data/ACAD_vegmap_wetlands.shp", layer = "ACAD_vegmap_wetlands",
                  GDAL1_integer64_policy = TRUE)

wet_rich_shp <- readOGR("./shapefiles/wetdat_rich.shp", layer = "wetdat_rich",
                        GDAL1_integer64_policy = TRUE)
head(wet_rich_shp@data)
head(wet_rich_shp@coords) # Where the coordinates are stored
plot(bounds)
# Prepare shapefiles for plotting
bounds2 <- fortify(bounds) # 
str(bounds2) # note the data frame structure that describes the boundary data

wetmap2 <- fortify(wetmap)
str(wetmap2)

# Turn the wetland richness shapefile back into a dataset
  # One reason to do this is R is because you never had to make this a shapefile to plot
wet_rich_df <- cbind(as.data.frame(wet_rich_shp@data), 
                     long = wet_rich_shp@coords[,1],
                     lat = wet_rich_shp@coords[,2])
ggmap <- ggplot()+
         coord_sf(crs = 26919)+  # set coordinate system to NAD83 UTM Zone 19N using EPSG code      
         geom_polypath(data = bounds2, aes(x = long, y = lat, group = group),
                       fill = "grey", alpha = 0.4)+
         geom_polypath(data = wetmap2, aes(x = long, y = lat, group = group),
                       fill = "#80A8D1")+
         geom_point(data = wet_rich_df, 
                           aes(x = long, 
                               y = lat, 
                               color = Site_Type),
                            shape = 17)+
         theme_minimal() # because I don't like ggplot's grids in default grid

ggmap

wet_rich_dd <- spTransform(wet_rich_shp, CRS="+proj=longlat +datum=WGS84")
plot(wet_rich_dd) #check that they still plot spatially
library(leaflet)

park_tile_basic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"

map_leaf <- leaflet() %>% 
 setView(lng = -68.312,
         lat = 44.25,
         zoom = 10) %>% 
 addTiles(urlTemplate = park_tile_basic) %>% 
 addCircleMarkers(data = wet_rich_dd,
                  radius = 4, 
                  lng = wet_rich_dd@coords[,1],
                  lat = wet_rich_dd@coords[,2],
                  label = wet_rich_dd@data$Site_Name,
                  labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                                              direction = 'bottom', textsize = "12px"),
                  fillColor = "#4C9B4F",
                  fillOpacity = 1,
                  stroke = FALSE) %>% 
 addScaleBar(position='bottomright') %>% 
 scaleBarOptions(maxWidth=10, metric=TRUE) 

map_leaf
#--------------------
#    Day 1 Answers
#--------------------
# If starting from scratch
library(tidyverse)
df <- data.frame(plot = c(1, 2, 3, 4, 5, 6, 'plot7'), #column for plot number
                 numtrees = c(24, 18, 17, 14, 31, 27, 15), 
                 numLive = c(21, 18, 16, 10, 29, 26, 14),
                 domSpp = c('red_maple', 'red_spruce', 
                            'red_spruce','white_cedar','white_pine',
                            'white_spruce', NA),
                 invasive_pres = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE))


df[3, c("domSpp")] # option 1
df[3, 4] # option 2
df$numtrees <- as.integer(df$numtrees)
str(df)
df$invasive_pres[2] <- TRUE
head(df)
# base R version (2 options)
df$numDead <- df$numtrees - df$numLive
df$pctDead <- df$numDead/df$numtrees*100

df$pctDead <- with(df, (numtrees - numLive)/numtrees*100)

# tidyverse version (2 options)
df <- df %>% mutate(numDead = numtrees - numLive,
                    pctDead = numDead/numtrees*100)

df <- df %>% mutate(pctDead = (numtrees - numLive)/numtrees*100)
min(df$numDead) # uses numdead column calculated for previous question. 
summary(df$numDead)[1] #Min. is the first record that summary reports for numeric columns
# If starting from scratch
wetdat<-read.csv("./data/example_data/ACAD_wetland_data_clean.csv")

wetinv <- wetdat[wetdat$Invasive == TRUE, ] # base R bracket-based solution
wetinv2 <- subset(wetdat, Invasive == TRUE) # another base R solution
wetinv_tidy <- wetdat %>% filter(Invasive == TRUE) # tidyverse solution
# Base R solution
wetspp <- wetdat[wetdat$Latin_Name %in% c("Arethusa bulbosa", "Carex exilis"),]
sort(unique(wetspp$Site_Name))

# Tidyverse solution
wetspp_tidy <- wetdat %>% filter(Latin_Name %in% c("Arethusa bulbosa", "Carex exilis")) 
sort(unique(wetspp_tidy$Site_Name)) # RAM-05 and SEN-01 have both species present
wetdat_sort <- wetdat[order(-wetdat$Ave_Cov, wetdat$Latin_Name), ] # base R
head(wetdat_sort)
wetdat_sort_tidy <- wetdat %>% arrange(-Ave_Cov, Latin_Name) #tidyverse
head(wetdat_sort_tidy)
#--------------------
#    Day 2 Answers
#--------------------
sort(unique(wetdat$Latin_Name))[1:10] # Low-tech way: use your eyes to spot the "Alnus incana++". I truncated to first 10.

# The regular expression way to find non-alphabetical symbols (now you know why I didn't cover this).
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # looks for anything not A-Z or a-z in entire record 
  # ^ and +$ are telling R to start with first word and check every following word within a record.
  # This took longer than I'd like to admit to make work, by the way.

wetdat$Latin_Name[wetdat$Latin_Name == "Alnus incana++"] <- "Alnus incana" # easy approach
wetdat$Latin_Name <- gsub("[+]", "", wetdat$Latin_Name) # gsub approach
wetdat$Latin_Name[grep("[^[:alpha:]]+$", wetdat$Latin_Name)] # no values returned 
sort(unique(wetdat$Latin_Name))[1:10] # using your eyes approach

# In case you need to load the tables
loc <- read.csv("./data/example_data/tbl_Locations_example.csv")
event <- read.csv("./data/example_data/tbl_Events_example.csv")  
stand <- read.csv("./data/example_data/tbl_Stand_Data_example.csv") 

# Filter then join
loc_MDI <- loc %>% filter(Unit_ID == "ACAD_MDI")
locev_MDI <- left_join(loc_MDI, event, by = "Location_ID")

# All in one via pipes:
locev_MDI2 <- loc %>% filter(Unit_ID == "ACAD_MDI") %>% 
                      left_join(event, by = "Location_ID")

locev <- merge(loc, event[event$Event_QAQC == FALSE, ], all.x = TRUE, all.y = TRUE)
intersect(names(locev), names(stand))
locev_stand <- merge(locev, stand, by = "Event_ID", all.x = TRUE, all.y = TRUE) #base R full join
locev_stand_tidy <- full_join(locev, stand, by = "Event_ID") #tidyverse full join
head(wetdat)
wetmin_max <- wetdat %>% mutate(present = ifelse(Ave_Cov >0 & !is.na(Ave_Cov), 1, 0)) %>% 
                         group_by(Site_Name, Year) %>% 
                         summarize(numspp = sum(present), 
                                   .groups = "drop") %>% 
                         group_by(Year) %>% 
                         summarize(minspp = min(numspp),
                                   maxspp = max(numspp), 
                                  .groups = "drop")
wetmin_max
#head(wetdat)
wetinv1 <- wetdat %>% filter(Site_Type == "RAM") %>% 
                      group_by(Site_Name, Year) %>% 
                      summarize(num_invspp = sum(Invasive), 
                                plot_inv = ifelse(num_invspp > 0, 1, 0), 
                                .groups = "drop")
head(wetinv1)
# Note plot_inv is to set plot to 1 or 0 even if multiple invasives are found on the plot.
wetinv <- wetinv1 %>% group_by(Year) %>% 
                      summarize(tot_inv_plots = (sum(plot_inv)/n())*100,
                                .groups = "drop")

wetinv 
#--------------------
#    Day 3 Answers
#--------------------
# In case you're starting from scratch
library(tidyverse)
bat_long <- read.csv("./data/example_data/fake_bats.csv")
bat_long <- bat_long %>% mutate(sppcode = toupper(paste0(substr(word(Latin, 1), 1, 3),
                                                         substr(word(Latin, 2), 1, 3))))

MYOSEP_sum <- bat_long %>% filter(sppcode == "MYOSEP") %>% 
                           group_by(Site, Year) %>% 
                           summarize(num_bats = sum(ifelse(!is.na(Latin), 1, 0)), 
                                     .groups = 'drop') 

head(MYOSEP_sum) # Note only the site and years where MYOSEP was found are listed. Next lines add those back before pivot
bat_site_year <- bat_long %>% expand(Site, Year)

MYOSEP_comp <- bat_site_year %>% left_join(., MYOSEP_sum, by = c("Site", "Year")) 
MYOSEP_comp$num_bats[is.na(MYOSEP_comp$num_bats)] <- 0 # Replace all the NAs created by left join with 0

MYOSEP_wide <- MYOSEP_comp %>% arrange(Site, Year) %>% 
                               pivot_wider(names_from = Year, values_from = num_bats, values_fill = 0)
MYOSEP_wide
# If you're starting from scratch
wetdat <- read.csv("./data/example_data/ACAD_wetland_data_clean.csv")
wetdat_prep <- wetdat %>% mutate(sppcode = toupper(paste0(substr(word(Latin_Name, 1), 1, 3),
                                                          substr(word(Latin_Name, 2), 1, 3)))) 

wetdat_dist <- wetdat_prep %>% distinct(Site_Name, Year) # makes tibble of all site, year combinations, but doesn't add any like expand

wetdat_wide <- wetdat_prep %>% filter(Protected == TRUE) %>% 
                               select(Site_Name, Year, sppcode, Ave_Cov) %>% 
                               pivot_wider(names_from = sppcode, values_from = Ave_Cov, values_fill = 0) %>% 
                               left_join(wetdat_dist, ., by = c("Site_Name", "Year")) 
wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")][is.na(wetdat_wide[, c("AREBUL", "POGOPH", "CALTUB")])] <- 0

# Starting from scratch
library(forestNETNarch)
importCSV("./data/NETN_forest_csvs")
acad_exo <- makeSppList(park = "ACAD", from = 2019, to = 2019, speciesType = "exotic") 
acad_exo <- acad_exo %>% mutate(present = ifelse(Exotic == TRUE & !is.na(Exotic), 1, 0))
head(acad_exo)
# Now to answer the question
# Old school spread version
acad_wide_spread <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>% 
                                 arrange(Latin_Name, Plot_Name) %>% 
                                 spread("Latin_Name", "present", fill = 0)

# Newer pivot version
acad_wide_pivot <- acad_exo %>% select(Plot_Name, Latin_Name, present) %>% 
                                arrange(Latin_Name, Plot_Name) %>% 
                                pivot_wider(names_from = "Latin_Name", 
                                            values_from = "present",
                                            values_fill = 0) 

head(acad_wide_pivot)
# Starting from scratch
library(forestNETN)
importCSV("./data/NETN_forest_csvs")
acad_regen <- joinRegenData(park = "ACAD", from = 2016, to = 2019, canopyForm = "canopy")

acad_regen2 <- acad_regen %>% group_by(Plot_Name) %>% 
                              summarize(seed15_30 = sum(seed15.30),
                                        seed30_100 = sum(seed30.100),
                                        seed100_150 = sum(seed100.150),
                                        seed150p = sum(seed150p),
                                        sapling = sum(sap.den),
                                        .groups = 'drop') %>% 
                              arrange(Plot_Name)

acad_regen_long <- acad_regen2 %>% pivot_longer(-Plot_Name, 
                                                names_to = "size_class", 
                                                values_to = "density")

# Now we calculate the mean and se for each size class at the park level
acad_stats <- acad_regen_long %>% group_by(size_class) %>% 
                                  summarize(mean_dens = mean(density),
                                            se_dens = sd(density)/sqrt(n()),
                                            .groups = 'drop')

acad_stats$size_class_fact <- ordered(acad_stats$size_class, 
                                      levels = c('seed15_30', 'seed30_100', 
                                                 'seed100_150', 'seed150p', 'sapling'))
# Now to answer the question
ggplot(data = acad_stats, aes(x = size_class_fact, y = mean_dens))+
              geom_bar(stat = 'identity', fill = "steelblue4")+
              geom_errorbar(aes(ymin = mean_dens - se_dens, ymax = mean_dens + se_dens),
                            width = 0.2, lwd = 0.7)+
              labs(x = "Regeneration Size Class", y = "Stems per sq.m")+
              theme_FHM()+
              theme(axis.text.x = element_text(angle = 45, hjust = 1))+ #hjust is centering horizontally
              scale_x_discrete(labels = c('15-30cm', '30-100cm', '100-150cm', '>150cm', '1-10cm DBH'))
             
# Starting from scratch
library(NCRNWater)

netnwd<-importNCRNWater(Dir = "./data/NETN_water_data", 
                        Data = "Water Data.csv", 
                        MetaData = "VizMetaData.csv")

acad_all_sites <- data.frame(sitecode = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteCode"),
                             type = getSiteInfo(netnwd, parkcode = "ACAD", info = "type"),
                             fullname = getSiteInfo(netnwd, parkcode = "ACAD", info = "SiteName"))

acad_lakes <- acad_all_sites %>% filter(type == "Lake")  #list of site codes and full names for ACAD lakes

acad_ph <- getWData(netnwd, parkcode = "ACAD", years = 2006:2019, charname = 'pH')
acad_lakes_ph <- acad_ph %>% filter(Site %in% acad_lakes$sitecode)
annual_lakes <- data.frame(table(acad_lakes_ph$Site)) %>% filter(Freq > 90) 

acad_ph2 <- acad_lakes_ph %>% filter(Site %in% annual_lakes$Var1) %>% 
                              mutate(month = format(Date, "%m"),
                                     year = format(Date, "%Y"))

acad_ph_final <- left_join(acad_ph2, acad_lakes[, c("sitecode", "fullname")], 
                           by= c("Site" = "sitecode")) # how to join if columns have diff. names

#head(acad_ph_final)

# Now to answer the question
acad_ph_2019 <- acad_ph_final %>% filter(year == 2019) %>% mutate(mon_abr = format(Date, "%b"))

plot19 <- ggplot(acad_ph_2019, aes(x = Date, y = Value))+
             geom_point()+
             labs(y = "pH")+
             geom_smooth(method = 'loess', span = 1, se = FALSE)+
             scale_x_date(date_breaks = "1 month",
                          date_labels = "%b")+ #%b is months names abbreviated
             facet_wrap(~fullname, ncol = 4)+
             theme_bw()

plot19