NPS IMD Intro to R Training: March 24 - 26, 2026

Prep for Training

Installing required software

The only prerequisite for the R training is to install the latest version of R and RStudio on your computer. These should be available in the Company Portal in Entra, and shouldn’t require special permissions to install. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed.

Your R version should be at least 4.4.3 or above to make sure everyone’s code behaves the same way. Likely the R version you have is 4.5.2, and the RStudio version is 2025.06.2-418 or higher. This is how it may appear in Company Portal. If your machine hasn’t migrated to Entra yet, it may appear different.

R in company portal

Required Packages

A number of packages are required to follow along with data wrangling and visualization sessions. Please try to install these in RStudio ahead of time by running the code below. If you don’t know how to run the code, open view the Running Code Screencast below for how to do this.

packages <- c("tidyverse", # for Day 2 and 3 data wrangling
              "RColorBrewer", "viridis", "patchwork", # for Day 3 ggplot
              "readxl", "writexl") # for day 1 importing from excel

install.packages(setdiff(packages, rownames(installed.packages())))  

# Check that installation worked
library(tidyverse) # turns on core tidyverse packages
library(RColorBrewer) # palette generator
library(viridis) # more palettes
library(patchwork) # multipanel plots
library(readxl) # reading xlsx
library(writexl) # writing xlsx
Running Code Screencast


Optional Reading This is completely optional, but if you have any time before training starts, I highly recommend reading Chapter 2: R basics and workflows in STAT545. This is an online book based on a graduate level statistics class that Dr. Jenny Bryan teaches and is turning into this book. She’s a stats professor turned senior programmer at RStudio, and I really like how she approaches programming R in this book.


About the Training
  • Timing: The training will take place over 3 half days. Each day will run from 9am - 1pm EST via MS Teams. We’ll try to take short breaks every 30 minutes or so, along with a 15 minute break at noon EST for lunch. I know this is cutting eastern folks’ lunch short, and I appreciate folks being flexible on this.
  • Office Hours: For the day prior to training and each afternoon following training, I will be available from 1 - 3pm EST for office hours in case there are questions that couldn’t be handled before or during training. I will send a Teams link to participants for this office hours. Teams chats are also fine for posing questions.
  • Structure: For most of the training, I will share my screen as I go through the website and then demo with live coding. Having 2 screens, one for my screen share and one for your R session, will make following along a lot easier.
  • Getting Help: I intentionally included people in this training I know to be kind, capable, and that will benefit from having better R skills. My hope is that this group is small and supportive enough that everyone feels comfortable openly asking questions and providing feedback to the group. However, if you aren’t comfortable asking questions openly, you can ask questions in the anonymous training feedback form. Additionally, if someone runs into an issue we can’t immediately troubleshoot (it happens), we may have to table it until office hours. We will then discuss it with the group the next day (if relevant).
  • Objectives: Three days is barely enough to scratch the surface on what you can do in R. My goals with this training are to:
    1. Help you get beyond the initial learning curve that can be really challenging to climb on your own.
    2. Expose you to what I consider are the useful things R can do for us.
    3. Provide you the tools needed to continue advancing your R skills on your own.
  • Credit: Much of this training was borrowed heavily from IMD Intro to R training in 2022. A ton of credit for this training goes to the developers of those lessons:
    • Day 1 Intro to R: Sarah Wright and Andrew Birch
    • Day 2 Data Wrangling: John Paul Schmit and Lauren Pandori
    • Day 3 Data Visualization: Ellen Cheng and Kate Miller (Spatial Data)
    • Day 4 Programming Best Practices: Sarah Wright and Thomas Parr
  • Feedback: Finally, to help me improve this training for future sessions, please leave feedback in the training feedback form. You can submit feedback multiple times and don’t need to answer every question. Responses are anonymous.


Day 1: Intro to R

Day 1 Goals

Goals for Day 1:

  1. Get comfortable navigating RStudio, such as opening a new project or script, running code and viewing the output, etc.
  2. Ability to import and save .csv and .xlsx files.
  3. Basic understanding of variables, functions, and data frames.
  4. Ability to explore data frames, such as dimensions (rows and columns), min/max of different columns, data type of column, basic plotting, etc.
  5. Basic understanding of square brackets to view and filter data.frame[rows, columns].
  6. Exposure to NAs (blanks).
  7. Able to access help within and outside of R.


Feedback: Please leave feedback in the training feedback form. You can submit feedback multiple times and don’t need to answer every question. Responses are anonymous.

R journey
Artwork by @allison_horst


Intro to R

Why I love R:

R welcoming illustration Artwork by @allison_horst

There are many reasons to use R. Some of my top reasons are below:
  • It’s free!
  • Thorough, helpful, and welcoming user community, including a ton of freely available online help and learning resources (see Resources tab).
  • Large user community in NPS to collaborate and share code.
  • Language was designed by statisticians to facilitate data analysis and visualization.
    • Relatively shallow learning curve compared to other coding languages (e.g. python).
    • Developers’ philosophy was to not have to know how to program to learn R. Then as you become more advanced and do more complicated tasks, learning how to program will benefit your work.
  • Code documents your workflow.
  • Code builds on code.
  • Automating tasks, like QA/QC, compiling/querying data, calculating summary statistics, and building dashboards has improved our data quality and made our data accessible and easy to work with for other users.

Other benefits of R:
  • Base R maintains backwards compatibility, so that code written in base R, regardless of R version should run.
    • Caveats are that packages are not guaranteed to be backwards compatible.
    • Python is not backwards compatible
  • The tidyverse, which is a collection of really useful R packages, makes code more readable and consistently formatted. Tidyverse packages aren’t always backwards compatible, but they tend to be pretty stable and are super helpful for data wrangling and plotting.
  • RStudio can integrate with other coding languages, such as SQL, HTML/CSS, python and javascript.

Recipe for learning R Everyone learns differently, but the ingredients I see that most ensure success are:
  • Community: Finding a group of other R users you can reach out to when you’re stuck or need feedback is invaluable. I was lucky to be part of a group that was learning R together. I still collaborate with many of these folks. I’m hoping you all see this group as your R community.
  • Persistence: Keep trying new things in R, even if you ultimately have to abandon attempts and go back to what you know, like Access or Excel. Persistence pays off.
  • Fearlessness: You have to be okay with failing the first, second, or tenth time to solve a problem, at least at first. As you get more comfortable, your success rate will improve. Throughout that entire process, you’re learning R.
  • Googling: Half of being a good coder is learning how to google what you’re trying to do, or the issue you’re having. At first, you may not find the answers you’re looking for, but by reading help pages, like StackOverflow, you’re learning to read code and seeing solutions that may help you in the future.
Debugging rollercoaster
Artwork by @allison_horst

AI soapbox Why I don’t use AI to write code:
  • Behind the scenes, AI is taking answers from websites and other sources without crediting them.
  • There’s a huge environmental footprint to run the generative AI servers.
  • Research has shown that people who use AI to write for them lose their ability to write and think critically over time. Writing code isn’t that different. If you’re not actively writing the code you are using, your ability to debug and verify code is doing what you expect may be weakened. A recent article in Frontiers in Ecology and the Environment cautioning early career scientists against using AI for scientific writing captures this concern well.
  • To prevent AI responses returned by google searches, include -ai in google search box.

R and RStudio

About R

R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source. That means you will never need a paid license to use it, and you can view the underlying source code of any function and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.

When you install R, it comes with a simple user interface that lets you write and execute code. However, writing code in this interface is similar to writing a report in Notepad: it’s simple and straightforward, but you likely need more features than Notepad has to format your document. This is where RStudio comes in.

For more information on the history of R, visit the R Project website.


About RStudio RStudio is what’s called an integrated development environment (IDE), which is essentially a shell around the R program. RStudio makes programming in R easier by color coding different types of code, auto-completing code, flagging mistakes, and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).


RStudio Anatomy When you open RStudio, you typically see 4 panes:
RStudio panes

Source

This is primarily where you write code. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.

The source pane color-codes your code to make it easier to read, and detects syntax errors (the coding equivalent of a spell checker) by flagging the line number with a red “x” and showing a squiggly line under the offending code.

When you’re ready to run all or part of your script:
  • Highlight the line(s) of code you want to run
  • Either click the “Run” button (top right of the source pane) or press Ctrl+Enter.
At this point, the code is sent to the console (the bottom left pane). You’ll first see your code appear in the console, and then you’ll see the output if there is any.

Console

This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.4.1 or greater).

While most often you’ll run code from a script in the source pane, you can also run code directly in the console. Code in the console won’t get saved to a file, but it’s a great way to experiment and test out lines of code before adding them to your script in the source pane. The console is also where errors appear if your code breaks. Deciphering errors can be a challenge that gets easier over time. Googling errors is a good place to start.

Environment/History/Connections
  • Environment: This is where you can see what is currently in your environment. Think of the environment as temporary storage for objects - things like datasets and stored values - that you are using in your script(s). You can also click on objects and view them. Anything you see in your environment is temporary and it will disappear when you restart R. If there is something in your environment that you want to access in the future, make sure your script is able to reproduce it (or save it to a file).
  • History: This shows the code you’ve run in the current session. It’s not good to rely on it, but it can be a way to recover code you ran in the console and later realized you needed in your script.
  • Connections: This is one way to connect R to a database.
  • Git: If you have installed Git on your computer, you may see a Git tab. We won’t talk much about it this week, but this is where you’ll keep track of changes to your code.
  • Tutorial: This has some interactive tutorials that you can check out if you are interested.

Workspace
  • Files: This tab shows the files within your working directory (typically the folder where your current code project lives). More on this later.
  • Plots: This tab will show plots that you create.
  • Packages: This tab allows you to install, load, and update packages, and also view the help files within each package. You can also access these files in code.
  • Help: Allows you to search for and view documentation for packages that are installed on your computer.
  • Viewer: Shows HTML outputs produced by R Markdown, R Shiny, and some plotting and mapping packages.


RStudio Global Options There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.
  1. Under the General tab, you should see that your R Version is [64-bit] and the version is R-4.4.3 or greater. If it’s not, you probably need to update R. Let me know if you need help with this.
  2. Also in the General tab, make sure you are not saving your environment. To do this, uncheck the Restore .RData into your workspace at startup option. When this option is checked, R Studio saves your current working environment (the stuff in the Environment tab) when you exit. The next time you open R Studio, it restores that same environment.
    • This may seem uesful, but part of the point of using R is that your code should return the same results every time you run it. Clearing the environment when you close RStudio forces you to run your code with a clean slate.
    • Set Save workspace to .RData on exit: to Never. The only reason not to set to “Never” is if you are working with a huge dataset that takes a long time to load and process. In that case, you may want to set Save workspace to .RData on exit to “Ask”. When you close RStudio, it will ask you if you want to save your workspace image.
  3. Change default pipe to base R pipe by going to the Code tab, and check the box Use native pipe operator, |> (requires R 4.1+). We will discuss what this pipe means tomorrow.
  4. Most other settings are whatever you prefer. For example, to change the color of your background and text, go to the Appearance tab. I prefer Cobalt.

Project and File Setup

File organization

File organization is an important part of being a good coder. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.

Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.
Default working directory

If you refer to a file in R using a relative path, for example ./data/my_data_file.csv, R will look in your current working directory for a folder called data containing a file called my_data_file.csv.

Note the use of forward slashes instead of back slashes for file paths. You can either use a forward slash (/) or a double back slash for file paths. The paths below are equivalent and the full file path the relative path above is specifying.

# forward slash file path approach
"C:/Users/KMMiller/OneDrive = DOI/data/"

# backward slash file path approach
"C:\\Users\\KMMiller\\OneDrive = DOI\\data\\"
Using relative paths is a helpful because the full path will be specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in. Projects package all of your code, data, output, etc. into a file type that is easily transferrable to other machines regardless of file location.

Start a new Project To demonstrate the value of a project, we’ll create and use one for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd_r_intro. Next, you’ll select what folder to keep your project folder in. Documents/R is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.

Step 1. Select New Directory New project step 1

Step 2. Select New Project New project step 2

Step 3. Name project imd_r_intro Save project to a place you can find it. Don’t worry about whether the git repository box is checked or not.
New project step 3


If you successfully started a project named imd_r_intro, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.

We also want to create a folder called “data”, where we will store datasets we’re using for this class. To do that, you can either go to Windows Explorer and add a new folder, or you run the code below. As long as you’re working within your project (project name should be at the top right of window), a folder named data will appear within your project. You can check that it worked by using the list.files() function, which lists everything in the working directory of your project.


Add Data Folder

Create data folder

dir.create("data")
list.files() # you should see a data folder listed 

Start Coding

Start a new script First let’s create a new R script file called day_1_script.R. Make sure you are working in the imd_r_intro project that you just created. Click on the New File icon new script in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call your new script day_1_script.R.

Coding basics

We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.

To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.

To leave notes in your script, use the hashtag/pound sign (#). This will change the color of text that R reads as a comment and doesn’t run. Commenting your code is one of the best habits you can form. Comments are a gift to your future self and anyone else who tries to use your code.

Type code below in your script and run each line.

# Commented text: try this line to generate some basic text and become familiar with where results will appear:
print("Welcome to R!")
## [1] "Welcome to R!"
# simple math
1+1
## [1] 2
(2*3)/4
## [1] 1.5
sqrt(9)
## [1] 3
# calculate basal area of tree with 14.6cm diameter; note pi is built in constant in R
(14.6^2)*pi
## [1] 669.6619
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
cos(pi)
## [1] -1

Coding Tip: Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.


Variables

Occasionally, it’s enough to just run a line of code and display the result in the console. But typically our code is more complex than adding one plus one, and we want to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs. Variables are also called objects in R.

R uses the <- symbol for variable assignment. If you’ve used other programming languages, you may be tempted to use = instead. It will work, but there are subtle differences between <- and =, so you should get in the habit of using <-.

R is case-sensitive. So if you name one object treedata and another Treedata or TREEDATA, R will interpret these all as unique objects. While you can do things like this, it’s best practice not to use the same name for different objects, as it makes code difficult to follow.

Type code below to assign values to variables named a and b

# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two 
# variables without having to repeatedly type out the actual numbers:

a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261

In the code above, we assign the variables a and b once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a or b, we only have to do it in one place.

Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.

All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.


Functions

It’s hard to get very far in R without making use of functions. Think of a function as a programmed task that takes some kind of input (the argument(s)) from the user and outputs a result (the return value).

anatomy of a function


Coding Tip: Note the difference in how RStudio color codes what it thinks are functions. There are a lot of pre-programmed functions in base R, which is what comes along with R when you install R. Installing R packages will add additional functions. You can also build your own. Names that R recognizes as a function are color coded differently than what R recognizes as text, numbers, etc. It’s good practice to not use existing functions as new object names.

Commonly used base R functions include:
  • mean(): calculate the mean of a set of numbers
  • min(): calculate the minimum of a set of numbers
  • max(): calculate the maximum of a set of numbers
  • range(): calculate the min and max of a set of numbers
  • sd(): calculate the standard deviation of set of numbers
  • sqrt(): calculate the square root of a value

Calculate mean and range to see how functions work

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# equivalent to x <- 1:10

# bad coding
#mean <- mean(x)

# good coding 
mean_x <- mean(x)
mean_x
## [1] 5.5
range_x <- range(x)
range_x
## [1]  1 10


Importing and Saving Data

Most of the work we do in R relies on one or more existing datasets that we want to query or summarize, rather than creating our own in R. Importing data in R is therefore an important skill. R can import just about any data type, including CSV and MS Excel files. You can also import tables from MS Access and SQL databases using ODBC drivers. That’s beyond the scope of this class, but I can share examples for anyone needing to import from a database. For now, I’ll show how to work with CSVs and Excel spreadsheets.

Import CSV

We use the read.csv() function to import CSVs in R. The read.csv() function takes the file path or url to the CSV as input and outputs a data frame containing the data from the CSV. Here we’re going to read a CSV from a website, then save that in the data folder of our project. We’ll talk more about what data frames are next.

Run the following line to import a teaching ACAD wetland data set from the github repository for this training

# read in the data from ACAD_wetland_data_clean.csv and assign it as a dataframe to the variable "ACAD_wetland"
ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )

View the data in a separate window by running the View() function.

# View the ACAD_wetland data frame we just created
View(ACAD_wetland)

Or, check out the first few or last few records in your console. Click on View R output to view output.

# Look at the top 6 rows of the data frame
head(ACAD_wetland)
View R output
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909

# Look at the bottom 6 rows of the data frame
tail(ACAD_wetland)
View R output
##     Site_Name Site_Type                      Latin_Name
## 503    RAM-05       RAM             Vaccinium oxycoccos
## 504    RAM-05       RAM           Vaccinium vitis-idaea
## 505    RAM-05       RAM Viburnum nudum var. cassinoides
## 506    RAM-05       RAM Viburnum nudum var. cassinoides
## 507    RAM-05       RAM                   Xyris montana
## 508    RAM-05       RAM                   Xyris montana
##                         Common Year PctFreq Ave_Cov Invasive Protected X_Coord
## 503            small cranberry 2012     100    0.04    FALSE     FALSE  553186
## 504                lingonberry 2017      25    0.02    FALSE     FALSE  553186
## 505       northern wild raisin 2017     100    0.84    FALSE     FALSE  553186
## 506       northern wild raisin 2012     100   63.00    FALSE     FALSE  553186
## 507 northern yellow-eyed-grass 2017      50    0.44    FALSE     FALSE  553186
## 508 northern yellow-eyed-grass 2012      50    1.24    FALSE     FALSE  553186
##     Y_Coord
## 503 4899764
## 504 4899764
## 505 4899764
## 506 4899764
## 507 4899764
## 508 4899764


Save CSV

Now write the csv to disk and show how to import from your computer.

# Write the data frame to your data folder using a relative path. 
# By default, write.csv adds a column with row names that are numbers. I don't
# like that, so I turn that off.
write.csv(ACAD_wetland, "./data/ACAD_wetland_data_clean.csv", row.names = FALSE)

Make sure the writing to disk worked by importing the CSV from your computer

# Read the data frame in using a relative path
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Equivalent code to read in the data frame using full path on my computer, but won't match another user.
ACAD_wetland <- read.csv("C:/Users/KMMiller/OneDrive - DOI/NETN/R_Dev/IMD_R_Training_2026/data/ACAD_wetland_data_clean.csv")

Import from XLSX

Base R does not have a way to import MS Excel files. The first step for working with Excel files (i.e., files with .xls or .xlsx extensions), therefore, is to install the readxl package to import .xlsx files and writexl to write files to .xlsx. The readxl package has a couple of options for loading Excel spreadsheets, depending on whether the extension is .xls, .xlsx, or unknown, along with options to import different worksheets within a spreadsheet.

The code below installs the required packages (if you forgot to ahead of time), loads them, then first writes the ACAD_wetland CSV we just imported to an .xlsx. The last step imports the .xslx version of the ACAD wetland data.

  1. Install packages (only if you haven’t already)
  2. install.packages("readxl") # only need to run once. 
    install.packages("writexl")
  3. Load packages
  4. library(writexl) # saving xlsx
    library(readxl) # importing xlsx
  5. Write CSV to .xlsx to data folder. I’m going in this order to keep this training stand-alone. The read_xlsx() function can’t read from a url like read.csv() can.
  6. write_xlsx(ACAD_wetland, "./data/ACAD_wetland_data_clean.xlsx")
  7. Import spreadsheet. Note that the default settings import the first sheet, so I didn’t really need to specify the sheet below. I included the sheet argument to show how it’s done.
  8. ACAD_wetxls <- read_xlsx(path = "./data/ACAD_wetland_data_clean.xlsx", sheet = "Sheet1") 
  9. View top 6 rows to check the data
  10. head(ACAD_wetxls)
    View R output
    ## # A tibble: 6 × 11
    ##   Site_Name Site_Type Latin_Name Common  Year PctFreq Ave_Cov Invasive Protected
    ##   <chr>     <chr>     <chr>      <chr>  <dbl>   <dbl>   <dbl> <lgl>    <lgl>    
    ## 1 SEN-01    Sentinel  Acer rubr… red m…  2011       0    0.02 FALSE    FALSE    
    ## 2 SEN-01    Sentinel  Amelanchi… servi…  2011      20    0.02 FALSE    FALSE    
    ## 3 SEN-01    Sentinel  Andromeda… bog r…  2011      80    2.22 FALSE    FALSE    
    ## 4 SEN-01    Sentinel  Arethusa … drago…  2011      40    0.04 FALSE    TRUE     
    ## 5 SEN-01    Sentinel  Aronia me… black…  2011     100    2.64 FALSE    FALSE    
    ## 6 SEN-01    Sentinel  Carex exi… coast…  2011      60    6.6  FALSE    FALSE    
    ## # ℹ 2 more variables: X_Coord <dbl>, Y_Coord <dbl>


Data Structures

Vectors

The data frame we just examined is a type of data structure. A data structure is what it sounds like: a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this data structure later). Today we’ll focus on vectors and data frames.

Vectors are the simplest data structure in R. Vectors are like a single column of data in an Excel spreadsheet. Vectors only have one dimension, and can be accessed by their row number. Here are some examples of vectors:

digits <- c(1:10)  # Use x:y to create a sequence of integers starting at x and ending at y
digits
##  [1]  1  2  3  4  5  6  7  8  9 10
digits + 1 # note how 1 was added to every element of digits. 
##  [1]  2  3  4  5  6  7  8  9 10 11
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
tree_dbh <- c(12.5, 20.4, 18.1, 38.5, 19.3)
tree_dbh
## [1] 12.5 20.4 18.1 38.5 19.3
bird_ids <- c("black-capped chickadee", "dark-eyed junco", "golden-crowned kinglet", "dark-eyed junco")
bird_ids
## [1] "black-capped chickadee" "dark-eyed junco"        "golden-crowned kinglet"
## [4] "dark-eyed junco"

Note the use of c(). The c() function stands for combine, and it combines elements into a single vector, with each element separated by a comma in code. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over. Note how in digits, when we added a 1, every value in digits increased by 1. This highlights the concept of vectorization in R. The general idea being that you can apply a single operation to a vector (or row in a data frame), and it will apply to all elements of that vector.

If you need to access a single element of a vector, you can use the syntax my_vector[x] where x is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1] is the first element of my_vector). If you’ve coded in other languages, you may be used to indexing starting at 0.

second_bird <- bird_ids[2]
second_bird
## [1] "dark-eyed junco"
top_two_birds <- bird_ids[c(1,2)]
top_two_birds
## [1] "black-capped chickadee" "dark-eyed junco"

You can also return only unique values from a vector. The bird_ids vector has dark-eyed juncos listed twice. To get only unique species, run the following code. I also added sort() to sort the list alphabetically.

sort(unique(bird_ids))
## [1] "black-capped chickadee" "dark-eyed junco"        "golden-crowned kinglet"


Data Types

In the examples above, each vector contains a different type of data: digits contains integers, is_odd contains logical (TRUE/FALSE) values, bird_ids contains text, and tree_dbh contains decimal numbers. That’s because a given vector can only contain a single type of data.

In R, there are six main data types:

  • character: Regular text, denoted with double or single quotation marks (e.g. "hello", "3", "R is my favorite programming language")
  • numeric: Decimal numbers (e.g. 23, 3.1415)
  • integer: Integers. If you want to explicitly denote a number as an integer in R, append L to it or use as.integer() (e.g. 5L, as.integer(30)).
  • logical: True or false values (TRUE, FALSE). Note that TRUE and FALSE must be all-uppercase.
  • date-time: specially formatted field for dates or time using POSIX format.
  • factor: These are strings that have defined levels (e.g., Parks in your network) that are kept with the column even if no records exist for a given factor level. Factors used to be a lot more common in R. I typically only use them in plotting to force an order level that’s not alphabetical.

You can use the class() function to get the data type of a vector:

class(bird_ids)
## [1] "character"
class(tree_dbh)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"


Data Frames

Data Frame Properties

Data frames are the main way will be interacting with data in R. They’re essentially like spreadsheets in excel with specific properties.

Properties of data frames:
  1. As the name implies, data frames are rectangular. That means each column has the same number of rows. Each row has the same number of columns. But data frames can have different number of columns and rows (ie rectangular, not square).
  2. Data frames have 2 dimensions: First is always Rows, and second is always Columns.
  3. You can access the data within data frames by specifying the row, the column, or both at the same time.
  4. Each column in a data frame is assigned one of the five main data types we discussed above: numeric, integer, character, logical, or date-time

Coding Tip: R is strict about assigning data types to columns, such that any text in an otherwise numeric field will turn the entire column into a character. Similarly, if there’s anything besides TRUE, FALSE, or a blank in a field meant to be TRUE/FALSE, R will treat that as a character field instead of logical. So, if R treats as a character something that should be a numeric field, it’s a good clue there may be a typo or issue in your data needing attention. You can check the assigned data types using the str() function.

str(ACAD_wetland)
View R output
## 'data.frame':    508 obs. of  11 variables:
##  $ Site_Name : chr  "SEN-01" "SEN-01" "SEN-01" "SEN-01" ...
##  $ Site_Type : chr  "Sentinel" "Sentinel" "Sentinel" "Sentinel" ...
##  $ Latin_Name: chr  "Acer rubrum" "Amelanchier" "Andromeda polifolia" "Arethusa bulbosa" ...
##  $ Common    : chr  "red maple" "serviceberry" "bog rosemary" "dragon's mouth" ...
##  $ Year      : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ PctFreq   : int  0 20 80 40 100 60 100 60 100 100 ...
##  $ Ave_Cov   : num  0.02 0.02 2.22 0.04 2.64 6.6 10.2 0.06 0.86 4.82 ...
##  $ Invasive  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Protected : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ X_Coord   : num  574855 574855 574855 574855 574855 ...
##  $ Y_Coord   : num  4911909 4911909 4911909 4911909 4911909 ...


Accessing Rows and Columns
Show me the $

One way to access the column dimension in data frames is to use the $ syntax. The $ is used to separate the data frame name from the column name. It’s similar to the [table_name].[column_name] syntax in Access.

To view the names of the columns in a data frame, you can use the names() function, or use head() to see the first 6 rows with column names. Whatever you prefer. I’ll use the former for now.

See column names in wetland data.

names(ACAD_wetland)
##  [1] "Site_Name"  "Site_Type"  "Latin_Name" "Common"     "Year"      
##  [6] "PctFreq"    "Ave_Cov"    "Invasive"   "Protected"  "X_Coord"   
## [11] "Y_Coord"

See list of all sites and species in the wetland data. You can view the output by clicking on the R output drop down.

ACAD_wetland$Site_Name
View R output
##   [1] "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01"
##   [9] "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01"
##  [17] "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01"
##  [25] "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01" "SEN-01"
##  [33] "SEN-01" "SEN-01" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02"
##  [41] "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02"
##  [49] "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02"
##  [57] "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02"
##  [65] "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02" "SEN-02"
##  [73] "SEN-02" "SEN-02" "SEN-02" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03"
##  [81] "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03"
##  [89] "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03"
##  [97] "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03" "SEN-03"
## [105] "SEN-03" "SEN-03" "SEN-03" "SEN-03" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [113] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [121] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [129] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [137] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [145] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [153] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [161] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [169] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41"
## [177] "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-41" "RAM-53" "RAM-53" "RAM-53"
## [185] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [193] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [201] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [209] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [217] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [225] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [233] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [241] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [249] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [257] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [265] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53"
## [273] "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-53" "RAM-62"
## [281] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [289] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [297] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [305] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [313] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [321] "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62" "RAM-62"
## [329] "RAM-62" "RAM-62" "RAM-62" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [337] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [345] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [353] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [361] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [369] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [377] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [385] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [393] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [401] "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44" "RAM-44"
## [409] "RAM-44" "RAM-44" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [417] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [425] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [433] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [441] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [449] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [457] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [465] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [473] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [481] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [489] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [497] "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05" "RAM-05"
## [505] "RAM-05" "RAM-05" "RAM-05" "RAM-05"

ACAD_wetland$Latin_Name
View R output
##   [1] "Acer rubrum"                     "Amelanchier"                    
##   [3] "Andromeda polifolia"             "Arethusa bulbosa"               
##   [5] "Aronia melanocarpa"              "Carex exilis"                   
##   [7] "Chamaedaphne calyculata"         "Drosera intermedia"             
##   [9] "Drosera rotundifolia"            "Empetrum nigrum"                
##  [11] "Eriophorum angustifolium"        "Eriophorum vaginatum"           
##  [13] "Gaylussacia baccata"             "Gaylussacia dumosa"             
##  [15] "Ilex mucronata"                  "Juniperus communis"             
##  [17] "Kalmia angustifolia"             "Kalmia polifolia"               
##  [19] "Larix laricina"                  "Myrica gale"                    
##  [21] "Nuphar variegata"                "Picea mariana"                  
##  [23] "Rhododendron canadense"          "Rhododendron groenlandicum"     
##  [25] "Rhynchospora alba"               "Sarracenia purpurea"            
##  [27] "Solidago uliginosa"              "Symplocarpus foetidus"          
##  [29] "Trichophorum cespitosum"         "Trientalis borealis"            
##  [31] "Utricularia cornuta"             "Vaccinium oxycoccos"            
##  [33] "Viburnum nudum var. cassinoides" "Xyris montana"                  
##  [35] "Acer rubrum"                     "Arethusa bulbosa"               
##  [37] "Aronia melanocarpa"              "Betula populifolia"             
##  [39] "Carex exilis"                    "Carex lasiocarpa"               
##  [41] "Carex stricta"                   "Carex utriculata"               
##  [43] "Chamaedaphne calyculata"         "Drosera intermedia"             
##  [45] "Drosera rotundifolia"            "Dulichium arundinaceum"         
##  [47] "Eriophorum virginicum"           "Gaylussacia baccata"            
##  [49] "Ilex mucronata"                  "Ilex verticillata"              
##  [51] "Juncus acuminatus"               "Kalmia angustifolia"            
##  [53] "Kalmia polifolia"                "Larix laricina"                 
##  [55] "Lysimachia terrestris"           "Maianthemum trifolium"          
##  [57] "Muhlenbergia uniflora"           "Myrica gale"                    
##  [59] "Oclemena nemoralis"              "Picea mariana"                  
##  [61] "Pinus strobus"                   "Rhododendron canadense"         
##  [63] "Rhododendron groenlandicum"      "Rhynchospora alba"              
##  [65] "Rubus hispidus"                  "Sarracenia purpurea"            
##  [67] "Scutellaria lateriflora"         "Spiraea alba"                   
##  [69] "Spiraea tomentosa"               "Thuja occidentalis"             
##  [71] "Triadenum virginicum"            "Vaccinium angustifolium"        
##  [73] "Vaccinium corymbosum"            "Vaccinium macrocarpon"          
##  [75] "Vaccinium oxycoccos"             "Acer rubrum"                    
##  [77] "Alnus incana++"                  "Amelanchier"                    
##  [79] "Aronia melanocarpa"              "Carex stricta"                  
##  [81] "Carex trisperma"                 "Chamaedaphne calyculata"        
##  [83] "Cornus canadensis"               "Drosera rotundifolia"           
##  [85] "Eriophorum angustifolium"        "Eurybia radula"                 
##  [87] "Gaultheria hispidula"            "Gaylussacia baccata"            
##  [89] "Ilex mucronata"                  "Ilex verticillata"              
##  [91] "Kalmia angustifolia"             "Kalmia polifolia"               
##  [93] "Larix laricina"                  "Morella pensylvanica"           
##  [95] "Osmundastrum cinnamomea"         "Picea mariana"                  
##  [97] "Pinus banksiana"                 "Rhododendron canadense"         
##  [99] "Rhododendron groenlandicum"      "Sarracenia purpurea"            
## [101] "Sorbus americana"                "Spiraea alba"                   
## [103] "Thuja occidentalis"              "Trientalis borealis"            
## [105] "Vaccinium angustifolium"         "Vaccinium macrocarpon"          
## [107] "Vaccinium oxycoccos"             "Viburnum nudum"                 
## [109] "Acer rubrum"                     "Acer rubrum"                    
## [111] "Alnus incana"                    "Amelanchier"                    
## [113] "Amelanchier"                     "Aronia melanocarpa"             
## [115] "Berberis thunbergii"             "Calamagrostis canadensis"       
## [117] "Carex folliculata"               "Carex trisperma"                
## [119] "Carex trisperma"                 "Carex"                          
## [121] "Chamaedaphne calyculata"         "Chamaedaphne calyculata"        
## [123] "Cornus canadensis"               "Cornus canadensis"              
## [125] "Doellingeria umbellata"          "Doellingeria umbellata"         
## [127] "Dryopteris cristata"             "Empetrum nigrum"                
## [129] "Eriophorum angustifolium"        "Eriophorum angustifolium"       
## [131] "Gaylussacia baccata"             "Gaylussacia baccata"            
## [133] "Ilex mucronata"                  "Ilex mucronata"                 
## [135] "Ilex verticillata"               "Iris versicolor"                
## [137] "Juncus effusus"                  "Kalmia angustifolia"            
## [139] "Kalmia angustifolia"             "Larix laricina"                 
## [141] "Larix laricina"                  "Maianthemum canadense"          
## [143] "Maianthemum canadense"           "Maianthemum trifolium"          
## [145] "Myrica gale"                     "Myrica gale"                    
## [147] "Onoclea sensibilis"              "Osmundastrum cinnamomea"        
## [149] "Osmundastrum cinnamomea"         "Picea glauca"                   
## [151] "Picea mariana"                   "Picea mariana"                  
## [153] "Prenanthes"                      "Rhododendron canadense"         
## [155] "Rhododendron canadense"          "Rhododendron groenlandicum"     
## [157] "Rhododendron groenlandicum"      "Rosa nitida"                    
## [159] "Rosa nitida"                     "Rosa palustris"                 
## [161] "Rubus flagellaris"               "Rubus hispidus"                 
## [163] "Rubus hispidus"                  "Sarracenia purpurea"            
## [165] "Solidago uliginosa"              "Solidago uliginosa"             
## [167] "Spiraea alba"                    "Spiraea alba"                   
## [169] "Symplocarpus foetidus"           "Thelypteris palustris"          
## [171] "Trientalis borealis"             "Trientalis borealis"            
## [173] "Vaccinium angustifolium"         "Vaccinium angustifolium"        
## [175] "Vaccinium corymbosum"            "Vaccinium corymbosum"           
## [177] "Vaccinium myrtilloides"          "Vaccinium myrtilloides"         
## [179] "Vaccinium vitis-idaea"           "Viburnum nudum var. cassinoides"
## [181] "Viburnum nudum var. cassinoides" "Acer rubrum"                    
## [183] "Acer rubrum"                     "Alnus incana"                   
## [185] "Alnus incana"                    "Amelanchier"                    
## [187] "Amelanchier"                     "Aronia melanocarpa"             
## [189] "Berberis thunbergii"             "Berberis thunbergii"            
## [191] "Calamagrostis canadensis"        "Calamagrostis canadensis"       
## [193] "Carex atlantica"                 "Carex atlantica"                
## [195] "Carex folliculata"               "Carex folliculata"              
## [197] "Carex lasiocarpa"                "Carex lasiocarpa"               
## [199] "Carex stricta"                   "Carex stricta"                  
## [201] "Celastrus orbiculatus"           "Chamaedaphne calyculata"        
## [203] "Chamaedaphne calyculata"         "Cornus canadensis"              
## [205] "Cornus canadensis"               "Danthonia spicata"              
## [207] "Dichanthelium acuminatum"        "Doellingeria umbellata"         
## [209] "Doellingeria umbellata"          "Drosera intermedia"             
## [211] "Drosera rotundifolia"            "Drosera rotundifolia"           
## [213] "Dulichium arundinaceum"          "Dulichium arundinaceum"         
## [215] "Epilobium leptophyllum"          "Eriophorum angustifolium"       
## [217] "Eriophorum tenellum"             "Eriophorum virginicum"          
## [219] "Eriophorum virginicum"           "Eurybia radula"                 
## [221] "Gaylussacia baccata"             "Gaylussacia baccata"            
## [223] "Glyceria"                        "Ilex mucronata"                 
## [225] "Ilex mucronata"                  "Ilex verticillata"              
## [227] "Ilex verticillata"               "Juncus canadensis"              
## [229] "Kalmia angustifolia"             "Kalmia angustifolia"            
## [231] "Larix laricina"                  "Lysimachia terrestris"          
## [233] "Lysimachia terrestris"           "Morella pensylvanica"           
## [235] "Myrica gale"                     "Myrica gale"                    
## [237] "Oclemena nemoralis"              "Oclemena nemoralis"             
## [239] "Osmunda regalis"                 "Osmunda regalis"                
## [241] "Osmundastrum cinnamomea"         "Osmundastrum cinnamomea"        
## [243] "Picea rubens"                    "Pinus strobus"                  
## [245] "Pinus strobus"                   "Pogonia ophioglossoides"        
## [247] "Quercus rubra"                   "Rhamnus frangula"               
## [249] "Rhamnus frangula"                "Rhododendron canadense"         
## [251] "Rhododendron canadense"          "Rhynchospora alba"              
## [253] "Rosa palustris"                  "Rubus flagellaris"              
## [255] "Rubus hispidus"                  "Scirpus cyperinus"              
## [257] "Scirpus cyperinus"               "Solidago rugosa"                
## [259] "Solidago uliginosa"              "Spiraea alba"                   
## [261] "Spiraea alba"                    "Spiraea tomentosa"              
## [263] "Spiraea tomentosa"               "Symphyotrichum novi-belgii"     
## [265] "Thelypteris palustris"           "Thelypteris palustris"          
## [267] "Triadenum virginicum"            "Triadenum"                      
## [269] "Trientalis borealis"             "Typha latifolia"                
## [271] "Typha latifolia"                 "Vaccinium angustifolium"        
## [273] "Vaccinium corymbosum"            "Vaccinium corymbosum"           
## [275] "Vaccinium macrocarpon"           "Vaccinium macrocarpon"          
## [277] "Viburnum nudum var. cassinoides" "Viburnum nudum var. cassinoides"
## [279] "Viola"                           "Acer rubrum"                    
## [281] "Acer rubrum"                     "Amelanchier"                    
## [283] "Aronia melanocarpa"              "Carex limosa"                   
## [285] "Carex trisperma"                 "Carex trisperma"                
## [287] "Chamaedaphne calyculata"         "Chamaedaphne calyculata"        
## [289] "Cornus canadensis"               "Cornus canadensis"              
## [291] "Eriophorum virginicum"           "Eriophorum virginicum"          
## [293] "Gaultheria hispidula"            "Gaultheria hispidula"           
## [295] "Gaylussacia baccata"             "Gaylussacia baccata"            
## [297] "Ilex mucronata"                  "Ilex mucronata"                 
## [299] "Ilex verticillata"               "Kalmia angustifolia"            
## [301] "Kalmia angustifolia"             "Larix laricina"                 
## [303] "Larix laricina"                  "Maianthemum trifolium"          
## [305] "Maianthemum trifolium"           "Monotropa uniflora"             
## [307] "Monotropa uniflora"              "Myrica gale"                    
## [309] "Myrica gale"                     "Picea mariana"                  
## [311] "Picea mariana"                   "Rhododendron canadense"         
## [313] "Rhododendron canadense"          "Rhododendron groenlandicum"     
## [315] "Rhododendron groenlandicum"      "Symplocarpus foetidus"          
## [317] "Symplocarpus foetidus"           "Trientalis borealis"            
## [319] "Trientalis borealis"             "Vaccinium angustifolium"        
## [321] "Vaccinium angustifolium"         "Vaccinium corymbosum"           
## [323] "Vaccinium corymbosum"            "Vaccinium myrtilloides"         
## [325] "Vaccinium myrtilloides"          "Vaccinium oxycoccos"            
## [327] "Vaccinium oxycoccos"             "Vaccinium vitis-idaea"          
## [329] "Vaccinium vitis-idaea"           "Viburnum nudum var. cassinoides"
## [331] "Viburnum nudum var. cassinoides" "Acer rubrum"                    
## [333] "Acer rubrum"                     "Alnus incana"                   
## [335] "Alnus incana"                    "Apocynum androsaemifolium"      
## [337] "Apocynum androsaemifolium"       "Betula populifolia"             
## [339] "Betula populifolia"              "Calamagrostis canadensis"       
## [341] "Calamagrostis canadensis"        "Carex atlantica"                
## [343] "Carex lacustris"                 "Carex lacustris"                
## [345] "Carex lasiocarpa"                "Carex lasiocarpa"               
## [347] "Carex Ovalis group"              "Carex stricta"                  
## [349] "Carex stricta"                   "Chamaedaphne calyculata"        
## [351] "Chamaedaphne calyculata"         "Comptonia peregrina"            
## [353] "Doellingeria umbellata"          "Doellingeria umbellata"         
## [355] "Dryopteris cristata"             "Dulichium arundinaceum"         
## [357] "Equisetum arvense"               "Eurybia macrophylla"            
## [359] "Festuca filiformis"              "Ilex mucronata"                 
## [361] "Ilex verticillata"               "Ilex verticillata"              
## [363] "Iris versicolor"                 "Iris versicolor"                
## [365] "Juncus canadensis"               "Juncus canadensis"              
## [367] "Juncus effusus"                  "Lupinus polyphyllus"            
## [369] "Lysimachia terrestris"           "Lysimachia terrestris"          
## [371] "Malus"                           "Myrica gale"                    
## [373] "Myrica gale"                     "Onoclea sensibilis"             
## [375] "Onoclea sensibilis"              "Osmunda regalis"                
## [377] "Osmundastrum cinnamomea"         "Phleum pratense"                
## [379] "Pinus strobus"                   "Pinus strobus"                  
## [381] "Populus grandidentata"           "Populus grandidentata"          
## [383] "Populus tremuloides"             "Populus tremuloides"            
## [385] "Quercus rubra"                   "Ranunculus acris"               
## [387] "Rhamnus frangula"                "Rhamnus frangula"               
## [389] "Rhododendron canadense"          "Rosa nitida"                    
## [391] "Rosa palustris"                  "Rosa virginiana"                
## [393] "Rubus hispidus"                  "Rubus"                          
## [395] "Salix petiolaris"                "Salix"                          
## [397] "Scirpus cyperinus"               "Scirpus cyperinus"              
## [399] "Scutellaria"                     "Solidago rugosa"                
## [401] "Spiraea alba"                    "Spiraea alba"                   
## [403] "Spiraea tomentosa"               "Triadenum virginicum"           
## [405] "Triadenum"                       "Utricularia cornuta"            
## [407] "Vaccinium corymbosum"            "Veronica officinalis"           
## [409] "Vicia cracca"                    "Vicia cracca"                   
## [411] "Acer rubrum"                     "Acer rubrum"                    
## [413] "Alnus incana"                    "Amelanchier"                    
## [415] "Arethusa bulbosa"                "Arethusa bulbosa"               
## [417] "Aronia melanocarpa"              "Aronia melanocarpa"             
## [419] "Calamagrostis canadensis"        "Calopogon tuberosus"            
## [421] "Calopogon tuberosus"             "Carex atlantica"                
## [423] "Carex exilis"                    "Carex folliculata"              
## [425] "Carex folliculata"               "Carex magellanica"              
## [427] "Carex pauciflora"                "Carex trisperma"                
## [429] "Carex trisperma"                 "Chamaedaphne calyculata"        
## [431] "Chamaedaphne calyculata"         "Cornus canadensis"              
## [433] "Cornus canadensis"               "Drosera intermedia"             
## [435] "Drosera intermedia"              "Drosera rotundifolia"           
## [437] "Drosera rotundifolia"            "Empetrum nigrum"                
## [439] "Empetrum nigrum"                 "Eriophorum angustifolium"       
## [441] "Eriophorum angustifolium"        "Eriophorum virginicum"          
## [443] "Eriophorum virginicum"           "Eurybia radula"                 
## [445] "Gaultheria hispidula"            "Gaultheria hispidula"           
## [447] "Gaylussacia baccata"             "Gaylussacia baccata"            
## [449] "Gaylussacia dumosa"              "Glyceria striata"               
## [451] "Glyceria"                        "Ilex mucronata"                 
## [453] "Ilex mucronata"                  "Iris versicolor"                
## [455] "Iris versicolor"                 "Kalmia angustifolia"            
## [457] "Kalmia angustifolia"             "Kalmia polifolia"               
## [459] "Larix laricina"                  "Larix laricina"                 
## [461] "Lonicera - Exotic"               "Maianthemum trifolium"          
## [463] "Maianthemum trifolium"           "Melampyrum lineare"             
## [465] "Myrica gale"                     "Myrica gale"                    
## [467] "Oclemena nemoralis"              "Oclemena nemoralis"             
## [469] "Oclemena X blakei"               "Osmundastrum cinnamomea"        
## [471] "Picea mariana"                   "Picea mariana"                  
## [473] "Pogonia ophioglossoides"         "Pogonia ophioglossoides"        
## [475] "Rhododendron canadense"          "Rhododendron canadense"         
## [477] "Rhododendron groenlandicum"      "Rhododendron groenlandicum"     
## [479] "Rhynchospora alba"               "Rhynchospora alba"              
## [481] "Rosa nitida"                     "Rosa palustris"                 
## [483] "Rubus flagellaris"               "Rubus hispidus"                 
## [485] "Sarracenia purpurea"             "Sarracenia purpurea"            
## [487] "Solidago uliginosa"              "Solidago uliginosa"             
## [489] "Spiraea alba"                    "Symphyotrichum novi-belgii"     
## [491] "Symplocarpus foetidus"           "Symplocarpus foetidus"          
## [493] "Trichophorum cespitosum"         "Trientalis borealis"            
## [495] "Trientalis borealis"             "Utricularia cornuta"            
## [497] "Utricularia cornuta"             "Vaccinium angustifolium"        
## [499] "Vaccinium angustifolium"         "Vaccinium corymbosum"           
## [501] "Vaccinium corymbosum"            "Vaccinium oxycoccos"            
## [503] "Vaccinium oxycoccos"             "Vaccinium vitis-idaea"          
## [505] "Viburnum nudum var. cassinoides" "Viburnum nudum var. cassinoides"
## [507] "Xyris montana"                   "Xyris montana"

Square brackets [ , ]

Remember that every data frame has 2 dimensions. The first dimension is rows and the second is columns. Thinking of the data in two dimensions in the order of rows then columns helps understand how brackets work.

Square brackets [rows, columns] are how you access specific rows and columns in a data frame using base R. Examples include:
  • Specifying row numbers or matching specific patterns, like return all TRUE values.
  • Specifying column numbers or names to return specific columns.
  • Returning specific columns and all rows (leave the left side of the “,” blank).
  • Returning specific rows and all columns (leave the right side of the “,” blank).

Square brackets were one of the hardest concepts when I was starting out. Don’t worry if this isn’t immediately intuitive. There are easier ways to work with data frame rows and columns, which you’ll learn on Day 2. It is still useful to have a basic understanding of how to interpret square brackets, as you will likely encounter them on StackOverflow or other R help sites. We’ll work through some examples of using the square brackets to access rows, columns and/or both.

The code below asks for the dimensions of the ACAD_wetland data frame, and returns 508 11. That means there are 508 rows, and 11 columns.

Return data frame number of rows and columns by checking data frame dimensions. Click on R output to view results.

dim(ACAD_wetland)
## [1] 508  11
nrow(ACAD_wetland) # first dim
## [1] 508
ncol(ACAD_wetland) # second dim
## [1] 11

Return first 5 rows of the wetland data frame.

Note the comma with nothing to the right. That means return all columns.

ACAD_wetland[1:5,]
ACAD_wetland[c(1, 2, 3, 4, 5),] #equivalent but more typing
View R output
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909

Return all rows and a subset of columns of the data frame

Note how the left side of the comma is empty. That means return all rows.

ACAD_wetland[, c("Site_Name", "Latin_Name", "Common", "Year", "PctFreq")]
View R output
##     Site_Name                      Latin_Name
## 1      SEN-01                     Acer rubrum
## 2      SEN-01                     Amelanchier
## 3      SEN-01             Andromeda polifolia
## 4      SEN-01                Arethusa bulbosa
## 5      SEN-01              Aronia melanocarpa
## 6      SEN-01                    Carex exilis
## 7      SEN-01         Chamaedaphne calyculata
## 8      SEN-01              Drosera intermedia
## 9      SEN-01            Drosera rotundifolia
## 10     SEN-01                 Empetrum nigrum
## 11     SEN-01        Eriophorum angustifolium
## 12     SEN-01            Eriophorum vaginatum
## 13     SEN-01             Gaylussacia baccata
## 14     SEN-01              Gaylussacia dumosa
## 15     SEN-01                  Ilex mucronata
## 16     SEN-01              Juniperus communis
## 17     SEN-01             Kalmia angustifolia
## 18     SEN-01                Kalmia polifolia
## 19     SEN-01                  Larix laricina
## 20     SEN-01                     Myrica gale
## 21     SEN-01                Nuphar variegata
## 22     SEN-01                   Picea mariana
## 23     SEN-01          Rhododendron canadense
## 24     SEN-01      Rhododendron groenlandicum
## 25     SEN-01               Rhynchospora alba
## 26     SEN-01             Sarracenia purpurea
## 27     SEN-01              Solidago uliginosa
## 28     SEN-01           Symplocarpus foetidus
## 29     SEN-01         Trichophorum cespitosum
## 30     SEN-01             Trientalis borealis
## 31     SEN-01             Utricularia cornuta
## 32     SEN-01             Vaccinium oxycoccos
## 33     SEN-01 Viburnum nudum var. cassinoides
## 34     SEN-01                   Xyris montana
## 35     SEN-02                     Acer rubrum
## 36     SEN-02                Arethusa bulbosa
## 37     SEN-02              Aronia melanocarpa
## 38     SEN-02              Betula populifolia
## 39     SEN-02                    Carex exilis
## 40     SEN-02                Carex lasiocarpa
## 41     SEN-02                   Carex stricta
## 42     SEN-02                Carex utriculata
## 43     SEN-02         Chamaedaphne calyculata
## 44     SEN-02              Drosera intermedia
## 45     SEN-02            Drosera rotundifolia
## 46     SEN-02          Dulichium arundinaceum
## 47     SEN-02           Eriophorum virginicum
## 48     SEN-02             Gaylussacia baccata
## 49     SEN-02                  Ilex mucronata
## 50     SEN-02               Ilex verticillata
## 51     SEN-02               Juncus acuminatus
## 52     SEN-02             Kalmia angustifolia
## 53     SEN-02                Kalmia polifolia
## 54     SEN-02                  Larix laricina
## 55     SEN-02           Lysimachia terrestris
## 56     SEN-02           Maianthemum trifolium
## 57     SEN-02           Muhlenbergia uniflora
## 58     SEN-02                     Myrica gale
## 59     SEN-02              Oclemena nemoralis
## 60     SEN-02                   Picea mariana
## 61     SEN-02                   Pinus strobus
## 62     SEN-02          Rhododendron canadense
## 63     SEN-02      Rhododendron groenlandicum
## 64     SEN-02               Rhynchospora alba
## 65     SEN-02                  Rubus hispidus
## 66     SEN-02             Sarracenia purpurea
## 67     SEN-02         Scutellaria lateriflora
## 68     SEN-02                    Spiraea alba
## 69     SEN-02               Spiraea tomentosa
## 70     SEN-02              Thuja occidentalis
## 71     SEN-02            Triadenum virginicum
## 72     SEN-02         Vaccinium angustifolium
## 73     SEN-02            Vaccinium corymbosum
## 74     SEN-02           Vaccinium macrocarpon
## 75     SEN-02             Vaccinium oxycoccos
## 76     SEN-03                     Acer rubrum
## 77     SEN-03                  Alnus incana++
## 78     SEN-03                     Amelanchier
## 79     SEN-03              Aronia melanocarpa
## 80     SEN-03                   Carex stricta
## 81     SEN-03                 Carex trisperma
## 82     SEN-03         Chamaedaphne calyculata
## 83     SEN-03               Cornus canadensis
## 84     SEN-03            Drosera rotundifolia
## 85     SEN-03        Eriophorum angustifolium
## 86     SEN-03                  Eurybia radula
## 87     SEN-03            Gaultheria hispidula
## 88     SEN-03             Gaylussacia baccata
## 89     SEN-03                  Ilex mucronata
## 90     SEN-03               Ilex verticillata
## 91     SEN-03             Kalmia angustifolia
## 92     SEN-03                Kalmia polifolia
## 93     SEN-03                  Larix laricina
## 94     SEN-03            Morella pensylvanica
## 95     SEN-03         Osmundastrum cinnamomea
## 96     SEN-03                   Picea mariana
## 97     SEN-03                 Pinus banksiana
## 98     SEN-03          Rhododendron canadense
## 99     SEN-03      Rhododendron groenlandicum
## 100    SEN-03             Sarracenia purpurea
## 101    SEN-03                Sorbus americana
## 102    SEN-03                    Spiraea alba
## 103    SEN-03              Thuja occidentalis
## 104    SEN-03             Trientalis borealis
## 105    SEN-03         Vaccinium angustifolium
## 106    SEN-03           Vaccinium macrocarpon
## 107    SEN-03             Vaccinium oxycoccos
## 108    SEN-03                  Viburnum nudum
## 109    RAM-41                     Acer rubrum
## 110    RAM-41                     Acer rubrum
## 111    RAM-41                    Alnus incana
## 112    RAM-41                     Amelanchier
## 113    RAM-41                     Amelanchier
## 114    RAM-41              Aronia melanocarpa
## 115    RAM-41             Berberis thunbergii
## 116    RAM-41        Calamagrostis canadensis
## 117    RAM-41               Carex folliculata
## 118    RAM-41                 Carex trisperma
## 119    RAM-41                 Carex trisperma
## 120    RAM-41                           Carex
## 121    RAM-41         Chamaedaphne calyculata
## 122    RAM-41         Chamaedaphne calyculata
## 123    RAM-41               Cornus canadensis
## 124    RAM-41               Cornus canadensis
## 125    RAM-41          Doellingeria umbellata
## 126    RAM-41          Doellingeria umbellata
## 127    RAM-41             Dryopteris cristata
## 128    RAM-41                 Empetrum nigrum
## 129    RAM-41        Eriophorum angustifolium
## 130    RAM-41        Eriophorum angustifolium
## 131    RAM-41             Gaylussacia baccata
## 132    RAM-41             Gaylussacia baccata
## 133    RAM-41                  Ilex mucronata
## 134    RAM-41                  Ilex mucronata
## 135    RAM-41               Ilex verticillata
## 136    RAM-41                 Iris versicolor
## 137    RAM-41                  Juncus effusus
## 138    RAM-41             Kalmia angustifolia
## 139    RAM-41             Kalmia angustifolia
## 140    RAM-41                  Larix laricina
## 141    RAM-41                  Larix laricina
## 142    RAM-41           Maianthemum canadense
## 143    RAM-41           Maianthemum canadense
## 144    RAM-41           Maianthemum trifolium
## 145    RAM-41                     Myrica gale
## 146    RAM-41                     Myrica gale
## 147    RAM-41              Onoclea sensibilis
## 148    RAM-41         Osmundastrum cinnamomea
## 149    RAM-41         Osmundastrum cinnamomea
## 150    RAM-41                    Picea glauca
## 151    RAM-41                   Picea mariana
## 152    RAM-41                   Picea mariana
## 153    RAM-41                      Prenanthes
## 154    RAM-41          Rhododendron canadense
## 155    RAM-41          Rhododendron canadense
## 156    RAM-41      Rhododendron groenlandicum
## 157    RAM-41      Rhododendron groenlandicum
## 158    RAM-41                     Rosa nitida
## 159    RAM-41                     Rosa nitida
## 160    RAM-41                  Rosa palustris
## 161    RAM-41               Rubus flagellaris
## 162    RAM-41                  Rubus hispidus
## 163    RAM-41                  Rubus hispidus
## 164    RAM-41             Sarracenia purpurea
## 165    RAM-41              Solidago uliginosa
## 166    RAM-41              Solidago uliginosa
## 167    RAM-41                    Spiraea alba
## 168    RAM-41                    Spiraea alba
## 169    RAM-41           Symplocarpus foetidus
## 170    RAM-41           Thelypteris palustris
## 171    RAM-41             Trientalis borealis
## 172    RAM-41             Trientalis borealis
## 173    RAM-41         Vaccinium angustifolium
## 174    RAM-41         Vaccinium angustifolium
## 175    RAM-41            Vaccinium corymbosum
## 176    RAM-41            Vaccinium corymbosum
## 177    RAM-41          Vaccinium myrtilloides
## 178    RAM-41          Vaccinium myrtilloides
## 179    RAM-41           Vaccinium vitis-idaea
## 180    RAM-41 Viburnum nudum var. cassinoides
## 181    RAM-41 Viburnum nudum var. cassinoides
## 182    RAM-53                     Acer rubrum
## 183    RAM-53                     Acer rubrum
## 184    RAM-53                    Alnus incana
## 185    RAM-53                    Alnus incana
## 186    RAM-53                     Amelanchier
## 187    RAM-53                     Amelanchier
## 188    RAM-53              Aronia melanocarpa
## 189    RAM-53             Berberis thunbergii
## 190    RAM-53             Berberis thunbergii
## 191    RAM-53        Calamagrostis canadensis
## 192    RAM-53        Calamagrostis canadensis
## 193    RAM-53                 Carex atlantica
## 194    RAM-53                 Carex atlantica
## 195    RAM-53               Carex folliculata
## 196    RAM-53               Carex folliculata
## 197    RAM-53                Carex lasiocarpa
## 198    RAM-53                Carex lasiocarpa
## 199    RAM-53                   Carex stricta
## 200    RAM-53                   Carex stricta
## 201    RAM-53           Celastrus orbiculatus
## 202    RAM-53         Chamaedaphne calyculata
## 203    RAM-53         Chamaedaphne calyculata
## 204    RAM-53               Cornus canadensis
## 205    RAM-53               Cornus canadensis
## 206    RAM-53               Danthonia spicata
## 207    RAM-53        Dichanthelium acuminatum
## 208    RAM-53          Doellingeria umbellata
## 209    RAM-53          Doellingeria umbellata
## 210    RAM-53              Drosera intermedia
## 211    RAM-53            Drosera rotundifolia
## 212    RAM-53            Drosera rotundifolia
## 213    RAM-53          Dulichium arundinaceum
## 214    RAM-53          Dulichium arundinaceum
## 215    RAM-53          Epilobium leptophyllum
## 216    RAM-53        Eriophorum angustifolium
## 217    RAM-53             Eriophorum tenellum
## 218    RAM-53           Eriophorum virginicum
## 219    RAM-53           Eriophorum virginicum
## 220    RAM-53                  Eurybia radula
## 221    RAM-53             Gaylussacia baccata
## 222    RAM-53             Gaylussacia baccata
## 223    RAM-53                        Glyceria
## 224    RAM-53                  Ilex mucronata
## 225    RAM-53                  Ilex mucronata
## 226    RAM-53               Ilex verticillata
## 227    RAM-53               Ilex verticillata
## 228    RAM-53               Juncus canadensis
## 229    RAM-53             Kalmia angustifolia
## 230    RAM-53             Kalmia angustifolia
## 231    RAM-53                  Larix laricina
## 232    RAM-53           Lysimachia terrestris
## 233    RAM-53           Lysimachia terrestris
## 234    RAM-53            Morella pensylvanica
## 235    RAM-53                     Myrica gale
## 236    RAM-53                     Myrica gale
## 237    RAM-53              Oclemena nemoralis
## 238    RAM-53              Oclemena nemoralis
## 239    RAM-53                 Osmunda regalis
## 240    RAM-53                 Osmunda regalis
## 241    RAM-53         Osmundastrum cinnamomea
## 242    RAM-53         Osmundastrum cinnamomea
## 243    RAM-53                    Picea rubens
## 244    RAM-53                   Pinus strobus
## 245    RAM-53                   Pinus strobus
## 246    RAM-53         Pogonia ophioglossoides
## 247    RAM-53                   Quercus rubra
## 248    RAM-53                Rhamnus frangula
## 249    RAM-53                Rhamnus frangula
## 250    RAM-53          Rhododendron canadense
## 251    RAM-53          Rhododendron canadense
## 252    RAM-53               Rhynchospora alba
## 253    RAM-53                  Rosa palustris
## 254    RAM-53               Rubus flagellaris
## 255    RAM-53                  Rubus hispidus
## 256    RAM-53               Scirpus cyperinus
## 257    RAM-53               Scirpus cyperinus
## 258    RAM-53                 Solidago rugosa
## 259    RAM-53              Solidago uliginosa
## 260    RAM-53                    Spiraea alba
## 261    RAM-53                    Spiraea alba
## 262    RAM-53               Spiraea tomentosa
## 263    RAM-53               Spiraea tomentosa
## 264    RAM-53      Symphyotrichum novi-belgii
## 265    RAM-53           Thelypteris palustris
## 266    RAM-53           Thelypteris palustris
## 267    RAM-53            Triadenum virginicum
## 268    RAM-53                       Triadenum
## 269    RAM-53             Trientalis borealis
## 270    RAM-53                 Typha latifolia
## 271    RAM-53                 Typha latifolia
## 272    RAM-53         Vaccinium angustifolium
## 273    RAM-53            Vaccinium corymbosum
## 274    RAM-53            Vaccinium corymbosum
## 275    RAM-53           Vaccinium macrocarpon
## 276    RAM-53           Vaccinium macrocarpon
## 277    RAM-53 Viburnum nudum var. cassinoides
## 278    RAM-53 Viburnum nudum var. cassinoides
## 279    RAM-53                           Viola
## 280    RAM-62                     Acer rubrum
## 281    RAM-62                     Acer rubrum
## 282    RAM-62                     Amelanchier
## 283    RAM-62              Aronia melanocarpa
## 284    RAM-62                    Carex limosa
## 285    RAM-62                 Carex trisperma
## 286    RAM-62                 Carex trisperma
## 287    RAM-62         Chamaedaphne calyculata
## 288    RAM-62         Chamaedaphne calyculata
## 289    RAM-62               Cornus canadensis
## 290    RAM-62               Cornus canadensis
## 291    RAM-62           Eriophorum virginicum
## 292    RAM-62           Eriophorum virginicum
## 293    RAM-62            Gaultheria hispidula
## 294    RAM-62            Gaultheria hispidula
## 295    RAM-62             Gaylussacia baccata
## 296    RAM-62             Gaylussacia baccata
## 297    RAM-62                  Ilex mucronata
## 298    RAM-62                  Ilex mucronata
## 299    RAM-62               Ilex verticillata
## 300    RAM-62             Kalmia angustifolia
## 301    RAM-62             Kalmia angustifolia
## 302    RAM-62                  Larix laricina
## 303    RAM-62                  Larix laricina
## 304    RAM-62           Maianthemum trifolium
## 305    RAM-62           Maianthemum trifolium
## 306    RAM-62              Monotropa uniflora
## 307    RAM-62              Monotropa uniflora
## 308    RAM-62                     Myrica gale
## 309    RAM-62                     Myrica gale
## 310    RAM-62                   Picea mariana
## 311    RAM-62                   Picea mariana
## 312    RAM-62          Rhododendron canadense
## 313    RAM-62          Rhododendron canadense
## 314    RAM-62      Rhododendron groenlandicum
## 315    RAM-62      Rhododendron groenlandicum
## 316    RAM-62           Symplocarpus foetidus
## 317    RAM-62           Symplocarpus foetidus
## 318    RAM-62             Trientalis borealis
## 319    RAM-62             Trientalis borealis
## 320    RAM-62         Vaccinium angustifolium
## 321    RAM-62         Vaccinium angustifolium
## 322    RAM-62            Vaccinium corymbosum
## 323    RAM-62            Vaccinium corymbosum
## 324    RAM-62          Vaccinium myrtilloides
## 325    RAM-62          Vaccinium myrtilloides
## 326    RAM-62             Vaccinium oxycoccos
## 327    RAM-62             Vaccinium oxycoccos
## 328    RAM-62           Vaccinium vitis-idaea
## 329    RAM-62           Vaccinium vitis-idaea
## 330    RAM-62 Viburnum nudum var. cassinoides
## 331    RAM-62 Viburnum nudum var. cassinoides
## 332    RAM-44                     Acer rubrum
## 333    RAM-44                     Acer rubrum
## 334    RAM-44                    Alnus incana
## 335    RAM-44                    Alnus incana
## 336    RAM-44       Apocynum androsaemifolium
## 337    RAM-44       Apocynum androsaemifolium
## 338    RAM-44              Betula populifolia
## 339    RAM-44              Betula populifolia
## 340    RAM-44        Calamagrostis canadensis
## 341    RAM-44        Calamagrostis canadensis
## 342    RAM-44                 Carex atlantica
## 343    RAM-44                 Carex lacustris
## 344    RAM-44                 Carex lacustris
## 345    RAM-44                Carex lasiocarpa
## 346    RAM-44                Carex lasiocarpa
## 347    RAM-44              Carex Ovalis group
## 348    RAM-44                   Carex stricta
## 349    RAM-44                   Carex stricta
## 350    RAM-44         Chamaedaphne calyculata
## 351    RAM-44         Chamaedaphne calyculata
## 352    RAM-44             Comptonia peregrina
## 353    RAM-44          Doellingeria umbellata
## 354    RAM-44          Doellingeria umbellata
## 355    RAM-44             Dryopteris cristata
## 356    RAM-44          Dulichium arundinaceum
## 357    RAM-44               Equisetum arvense
## 358    RAM-44             Eurybia macrophylla
## 359    RAM-44              Festuca filiformis
## 360    RAM-44                  Ilex mucronata
## 361    RAM-44               Ilex verticillata
## 362    RAM-44               Ilex verticillata
## 363    RAM-44                 Iris versicolor
## 364    RAM-44                 Iris versicolor
## 365    RAM-44               Juncus canadensis
## 366    RAM-44               Juncus canadensis
## 367    RAM-44                  Juncus effusus
## 368    RAM-44             Lupinus polyphyllus
## 369    RAM-44           Lysimachia terrestris
## 370    RAM-44           Lysimachia terrestris
## 371    RAM-44                           Malus
## 372    RAM-44                     Myrica gale
## 373    RAM-44                     Myrica gale
## 374    RAM-44              Onoclea sensibilis
## 375    RAM-44              Onoclea sensibilis
## 376    RAM-44                 Osmunda regalis
## 377    RAM-44         Osmundastrum cinnamomea
## 378    RAM-44                 Phleum pratense
## 379    RAM-44                   Pinus strobus
## 380    RAM-44                   Pinus strobus
## 381    RAM-44           Populus grandidentata
## 382    RAM-44           Populus grandidentata
## 383    RAM-44             Populus tremuloides
## 384    RAM-44             Populus tremuloides
## 385    RAM-44                   Quercus rubra
## 386    RAM-44                Ranunculus acris
## 387    RAM-44                Rhamnus frangula
## 388    RAM-44                Rhamnus frangula
## 389    RAM-44          Rhododendron canadense
## 390    RAM-44                     Rosa nitida
## 391    RAM-44                  Rosa palustris
## 392    RAM-44                 Rosa virginiana
## 393    RAM-44                  Rubus hispidus
## 394    RAM-44                           Rubus
## 395    RAM-44                Salix petiolaris
## 396    RAM-44                           Salix
## 397    RAM-44               Scirpus cyperinus
## 398    RAM-44               Scirpus cyperinus
## 399    RAM-44                     Scutellaria
## 400    RAM-44                 Solidago rugosa
## 401    RAM-44                    Spiraea alba
## 402    RAM-44                    Spiraea alba
## 403    RAM-44               Spiraea tomentosa
## 404    RAM-44            Triadenum virginicum
## 405    RAM-44                       Triadenum
## 406    RAM-44             Utricularia cornuta
## 407    RAM-44            Vaccinium corymbosum
## 408    RAM-44            Veronica officinalis
## 409    RAM-44                    Vicia cracca
## 410    RAM-44                    Vicia cracca
## 411    RAM-05                     Acer rubrum
## 412    RAM-05                     Acer rubrum
## 413    RAM-05                    Alnus incana
## 414    RAM-05                     Amelanchier
## 415    RAM-05                Arethusa bulbosa
## 416    RAM-05                Arethusa bulbosa
## 417    RAM-05              Aronia melanocarpa
## 418    RAM-05              Aronia melanocarpa
## 419    RAM-05        Calamagrostis canadensis
## 420    RAM-05             Calopogon tuberosus
## 421    RAM-05             Calopogon tuberosus
## 422    RAM-05                 Carex atlantica
## 423    RAM-05                    Carex exilis
## 424    RAM-05               Carex folliculata
## 425    RAM-05               Carex folliculata
## 426    RAM-05               Carex magellanica
## 427    RAM-05                Carex pauciflora
## 428    RAM-05                 Carex trisperma
## 429    RAM-05                 Carex trisperma
## 430    RAM-05         Chamaedaphne calyculata
## 431    RAM-05         Chamaedaphne calyculata
## 432    RAM-05               Cornus canadensis
## 433    RAM-05               Cornus canadensis
## 434    RAM-05              Drosera intermedia
## 435    RAM-05              Drosera intermedia
## 436    RAM-05            Drosera rotundifolia
## 437    RAM-05            Drosera rotundifolia
## 438    RAM-05                 Empetrum nigrum
## 439    RAM-05                 Empetrum nigrum
## 440    RAM-05        Eriophorum angustifolium
## 441    RAM-05        Eriophorum angustifolium
## 442    RAM-05           Eriophorum virginicum
## 443    RAM-05           Eriophorum virginicum
## 444    RAM-05                  Eurybia radula
## 445    RAM-05            Gaultheria hispidula
## 446    RAM-05            Gaultheria hispidula
## 447    RAM-05             Gaylussacia baccata
## 448    RAM-05             Gaylussacia baccata
## 449    RAM-05              Gaylussacia dumosa
## 450    RAM-05                Glyceria striata
## 451    RAM-05                        Glyceria
## 452    RAM-05                  Ilex mucronata
## 453    RAM-05                  Ilex mucronata
## 454    RAM-05                 Iris versicolor
## 455    RAM-05                 Iris versicolor
## 456    RAM-05             Kalmia angustifolia
## 457    RAM-05             Kalmia angustifolia
## 458    RAM-05                Kalmia polifolia
## 459    RAM-05                  Larix laricina
## 460    RAM-05                  Larix laricina
## 461    RAM-05               Lonicera - Exotic
## 462    RAM-05           Maianthemum trifolium
## 463    RAM-05           Maianthemum trifolium
## 464    RAM-05              Melampyrum lineare
## 465    RAM-05                     Myrica gale
## 466    RAM-05                     Myrica gale
## 467    RAM-05              Oclemena nemoralis
## 468    RAM-05              Oclemena nemoralis
## 469    RAM-05               Oclemena X blakei
## 470    RAM-05         Osmundastrum cinnamomea
## 471    RAM-05                   Picea mariana
## 472    RAM-05                   Picea mariana
## 473    RAM-05         Pogonia ophioglossoides
## 474    RAM-05         Pogonia ophioglossoides
## 475    RAM-05          Rhododendron canadense
## 476    RAM-05          Rhododendron canadense
## 477    RAM-05      Rhododendron groenlandicum
## 478    RAM-05      Rhododendron groenlandicum
## 479    RAM-05               Rhynchospora alba
## 480    RAM-05               Rhynchospora alba
## 481    RAM-05                     Rosa nitida
## 482    RAM-05                  Rosa palustris
## 483    RAM-05               Rubus flagellaris
## 484    RAM-05                  Rubus hispidus
## 485    RAM-05             Sarracenia purpurea
## 486    RAM-05             Sarracenia purpurea
## 487    RAM-05              Solidago uliginosa
## 488    RAM-05              Solidago uliginosa
## 489    RAM-05                    Spiraea alba
## 490    RAM-05      Symphyotrichum novi-belgii
## 491    RAM-05           Symplocarpus foetidus
## 492    RAM-05           Symplocarpus foetidus
## 493    RAM-05         Trichophorum cespitosum
## 494    RAM-05             Trientalis borealis
## 495    RAM-05             Trientalis borealis
## 496    RAM-05             Utricularia cornuta
## 497    RAM-05             Utricularia cornuta
## 498    RAM-05         Vaccinium angustifolium
## 499    RAM-05         Vaccinium angustifolium
## 500    RAM-05            Vaccinium corymbosum
## 501    RAM-05            Vaccinium corymbosum
## 502    RAM-05             Vaccinium oxycoccos
## 503    RAM-05             Vaccinium oxycoccos
## 504    RAM-05           Vaccinium vitis-idaea
## 505    RAM-05 Viburnum nudum var. cassinoides
## 506    RAM-05 Viburnum nudum var. cassinoides
## 507    RAM-05                   Xyris montana
## 508    RAM-05                   Xyris montana
##                                 Common Year PctFreq
## 1                            red maple 2011       0
## 2                         serviceberry 2011      20
## 3                         bog rosemary 2011      80
## 4                       dragon's mouth 2011      40
## 5                     black chokeberry 2011     100
## 6                        coastal sedge 2011      60
## 7                          leatherleaf 2011     100
## 8                     spoonleaf sundew 2011      60
## 9                    round-leaf sundew 2011     100
## 10                     black crowberry 2011     100
## 11                    tall cottongrass 2011     100
## 12                 tussock cottongrass 2011      60
## 13                   black huckleberry 2011      20
## 14                   dwarf huckleberry 2011     100
## 15             mountain holly/catberry 2011      40
## 16                      common juniper 2011      40
## 17                        sheep laurel 2011     100
## 18                          bog laurel 2011     100
## 19                            tamarack 2011      40
## 20                           sweetgale 2011      60
## 21                         spatterdock 2011      20
## 22                        black spruce 2011      80
## 23                             rhodora 2011      20
## 24                    bog Labrador tea 2011      60
## 25                     white beaksedge 2011     100
## 26                 purple pitcherplant 2011     100
## 27                       bog goldenrod 2011     100
## 28                       skunk cabbage 2011      20
## 29                      tufted bulrush 2011     100
## 30                          starflower 2011      40
## 31                  horned bladderwort 2011     100
## 32                     small cranberry 2011     100
## 33                northern wild raisin 2011      40
## 34          northern yellow-eyed-grass 2011      40
## 35                           red maple 2011     100
## 36                      dragon's mouth 2011      20
## 37                    black chokeberry 2011     100
## 38                          gray birch 2011      20
## 39                       coastal sedge 2011     100
## 40                  woolly-fruit sedge 2011      40
## 41                       upright sedge 2011     100
## 42           Northwest Territory sedge 2011      60
## 43                         leatherleaf 2011     100
## 44                    spoonleaf sundew 2011      40
## 45                   round-leaf sundew 2011     100
## 46                      threeway sedge 2011      60
## 47                   tawny cottongrass 2011      80
## 48                   black huckleberry 2011      80
## 49             mountain holly/catberry 2011      60
## 50                  common winterberry 2011      20
## 51                       tapertip rush 2011      20
## 52                        sheep laurel 2011     100
## 53                          bog laurel 2011      40
## 54                            tamarack 2011     100
## 55                   earth loosestrife 2011      40
## 56  threeleaf false lily of the valley 2011      20
## 57                           bog muhly 2011      20
## 58                           sweetgale 2011     100
## 59                           bog aster 2011      80
## 60                        black spruce 2011     100
## 61                  eastern white pine 2011      80
## 62                             rhodora 2011      40
## 63                    bog Labrador tea 2011      20
## 64                     white beaksedge 2011       0
## 65                    bristly dewberry 2011      60
## 66                 purple pitcherplant 2011     100
## 67                       blue skullcap 2011      20
## 68                   white meadowsweet 2011      20
## 69                         steeplebush 2011      80
## 70                 eastern white cedar 2011      80
## 71        Virginia marsh St. Johnswort 2011      80
## 72                   lowbush blueberry 2011      20
## 73                  highbush blueberry 2011      20
## 74                     large cranberry 2011      80
## 75                     small cranberry 2011     100
## 76                           red maple 2011      80
## 77                          gray alder 2011      40
## 78                        serviceberry 2011      80
## 79                    black chokeberry 2011     100
## 80                       upright sedge 2011     100
## 81                   threeseeded sedge 2011      60
## 82                         leatherleaf 2011     100
## 83                  bunchberry dogwood 2011     100
## 84                   round-leaf sundew 2011     100
## 85                    tall cottongrass 2011      60
## 86                     low rough aster 2011      20
## 87                  creeping snowberry 2011      20
## 88                   black huckleberry 2011     100
## 89             mountain holly/catberry 2011     100
## 90                  common winterberry 2011      20
## 91                        sheep laurel 2011     100
## 92                          bog laurel 2011      40
## 93                            tamarack 2011     100
## 94                   northern bayberry 2011      40
## 95                       cinnamon fern 2011      40
## 96                        black spruce 2011     100
## 97                           jack pine 2011      20
## 98                             rhodora 2011     100
## 99                    bog Labrador tea 2011     100
## 100                purple pitcherplant 2011     100
## 101              American mountain ash 2011      20
## 102                  white meadowsweet 2011      20
## 103                eastern white cedar 2011     100
## 104                         starflower 2011     100
## 105                  lowbush blueberry 2011      40
## 106                    large cranberry 2011     100
## 107                    small cranberry 2011      80
## 108                        wild raisin 2011     100
## 109                          red maple 2017      50
## 110                          red maple 2012      50
## 111                         gray alder 2017      25
## 112                       serviceberry 2017     100
## 113                       serviceberry 2012      75
## 114                   black chokeberry 2012     100
## 115                  Japanese barberry 2017     100
## 116                          bluejoint 2012      25
## 117                         long sedge 2017      25
## 118                  threeseeded sedge 2017     100
## 119                  threeseeded sedge 2012      75
## 120                      sedge species 2017      25
## 121                        leatherleaf 2012     100
## 122                        leatherleaf 2017     100
## 123                 bunchberry dogwood 2017      25
## 124                 bunchberry dogwood 2012      25
## 125                   parasol whitetop 2017      75
## 126                   parasol whitetop 2012      75
## 127                   crested woodfern 2017      25
## 128                    black crowberry 2012      25
## 129                   tall cottongrass 2012      50
## 130                   tall cottongrass 2017      50
## 131                  black huckleberry 2017     100
## 132                  black huckleberry 2012       0
## 133            mountain holly/catberry 2017     100
## 134            mountain holly/catberry 2012     100
## 135                 common winterberry 2017      25
## 136                 harlequin blueflag 2012      25
## 137                        common rush 2017      25
## 138                       sheep laurel 2012     100
## 139                       sheep laurel 2017     100
## 140                           tamarack 2012     100
## 141                           tamarack 2017     100
## 142                   Canada mayflower 2017     100
## 143                   Canada mayflower 2012      75
## 144 threeleaf false lily of the valley 2017      25
## 145                          sweetgale 2017     100
## 146                          sweetgale 2012     100
## 147                     sensitive fern 2017      25
## 148                      cinnamon fern 2012     100
## 149                      cinnamon fern 2017     100
## 150                       white spruce 2017      25
## 151                       black spruce 2012     100
## 152                       black spruce 2017      50
## 153                    rattlesnakeroot 2017      25
## 154                            rhodora 2017     100
## 155                            rhodora 2012     100
## 156                   bog Labrador tea 2017     100
## 157                   bog Labrador tea 2012     100
## 158                       shining rose 2017      50
## 159                       shining rose 2012      25
## 160                         swamp rose 2012      25
## 161                  northern dewberry 2012      25
## 162                   bristly dewberry 2012      75
## 163                   bristly dewberry 2017     100
## 164                purple pitcherplant 2012      25
## 165                      bog goldenrod 2012      25
## 166                      bog goldenrod 2017      75
## 167                  white meadowsweet 2012     100
## 168                  white meadowsweet 2017     100
## 169                      skunk cabbage 2017      25
## 170                 eastern marsh fern 2017      25
## 171                         starflower 2017     100
## 172                         starflower 2012     100
## 173                  lowbush blueberry 2017     100
## 174                  lowbush blueberry 2012      75
## 175                 highbush blueberry 2012     100
## 176                 highbush blueberry 2017     100
## 177             velvetleaf huckleberry 2012      25
## 178             velvetleaf huckleberry 2017      50
## 179                        lingonberry 2017      25
## 180               northern wild raisin 2017     100
## 181               northern wild raisin 2012     100
## 182                          red maple 2017     100
## 183                          red maple 2012     100
## 184                         gray alder 2017     100
## 185                         gray alder 2012     100
## 186                       serviceberry 2012      50
## 187                       serviceberry 2017     100
## 188                   black chokeberry 2017     100
## 189                  Japanese barberry 2012     100
## 190                  Japanese barberry 2017     100
## 191                          bluejoint 2012     100
## 192                          bluejoint 2017     100
## 193                  prickly bog sedge 2017       0
## 194                  prickly bog sedge 2012      50
## 195                         long sedge 2017      50
## 196                         long sedge 2012      50
## 197                 woolly-fruit sedge 2012      75
## 198                 woolly-fruit sedge 2017     100
## 199                      upright sedge 2017     100
## 200                      upright sedge 2012     100
## 201                  Asian bittersweet 2017      50
## 202                        leatherleaf 2012     100
## 203                        leatherleaf 2017     100
## 204                 bunchberry dogwood 2017     100
## 205                 bunchberry dogwood 2012      75
## 206                   poverty oatgrass 2017      25
## 207              tapered rosette grass 2017      25
## 208                   parasol whitetop 2017      75
## 209                   parasol whitetop 2012      25
## 210                   spoonleaf sundew 2017      50
## 211                  round-leaf sundew 2017      50
## 212                  round-leaf sundew 2012      50
## 213                     threeway sedge 2017      75
## 214                     threeway sedge 2012      50
## 215                     bog willowherb 2017      25
## 216                   tall cottongrass 2012      25
## 217              fewnerved cottongrass 2017      25
## 218                  tawny cottongrass 2017      75
## 219                  tawny cottongrass 2012     100
## 220                    low rough aster 2012      50
## 221                  black huckleberry 2017      75
## 222                  black huckleberry 2012      75
## 223                         mannagrass 2012      25
## 224            mountain holly/catberry 2012      25
## 225            mountain holly/catberry 2017      25
## 226                 common winterberry 2017     100
## 227                 common winterberry 2012      50
## 228                      Canadian rush 2017      25
## 229                       sheep laurel 2012      75
## 230                       sheep laurel 2017     100
## 231                           tamarack 2012      25
## 232                  earth loosestrife 2012      75
## 233                  earth loosestrife 2017      75
## 234                  northern bayberry 2017      50
## 235                          sweetgale 2017     100
## 236                          sweetgale 2012     100
## 237                          bog aster 2017     100
## 238                          bog aster 2012     100
## 239                         royal fern 2017     100
## 240                         royal fern 2012     100
## 241                      cinnamon fern 2012     100
## 242                      cinnamon fern 2017     100
## 243                         red spruce 2017      75
## 244                 eastern white pine 2017     100
## 245                 eastern white pine 2012     100
## 246                       rose pogonia 2017      25
## 247                   northern red oak 2012      50
## 248                   glossy buckthorn 2012      75
## 249                   glossy buckthorn 2017     100
## 250                            rhodora 2012     100
## 251                            rhodora 2017     100
## 252                    white beaksedge 2012      50
## 253                         swamp rose 2012      25
## 254                  northern dewberry 2012     100
## 255                   bristly dewberry 2017     100
## 256                          woolgrass 2012       0
## 257                          woolgrass 2017      25
## 258              wrinkleleaf goldenrod 2012      25
## 259                      bog goldenrod 2012      25
## 260                  white meadowsweet 2017     100
## 261                  white meadowsweet 2012     100
## 262                        steeplebush 2012     100
## 263                        steeplebush 2017     100
## 264                     New York aster 2017      25
## 265                 eastern marsh fern 2012     100
## 266                 eastern marsh fern 2017      75
## 267       Virginia marsh St. Johnswort 2012     100
## 268                marsh St. Johnswort 2017     100
## 269                         starflower 2012      50
## 270                  broadleaf cattail 2012      75
## 271                  broadleaf cattail 2017      50
## 272                  lowbush blueberry 2017     100
## 273                 highbush blueberry 2012     100
## 274                 highbush blueberry 2017     100
## 275                    large cranberry 2017     100
## 276                    large cranberry 2012     100
## 277               northern wild raisin 2012      25
## 278               northern wild raisin 2017     100
## 279                             violet 2012      25
## 280                          red maple 2017      25
## 281                          red maple 2012      25
## 282                       serviceberry 2012      25
## 283                   black chokeberry 2017      50
## 284                          mud sedge 2012      25
## 285                  threeseeded sedge 2012     100
## 286                  threeseeded sedge 2017     100
## 287                        leatherleaf 2012     100
## 288                        leatherleaf 2017     100
## 289                 bunchberry dogwood 2017     100
## 290                 bunchberry dogwood 2012     100
## 291                  tawny cottongrass 2017      25
## 292                  tawny cottongrass 2012     100
## 293                 creeping snowberry 2017     100
## 294                 creeping snowberry 2012     100
## 295                  black huckleberry 2017     100
## 296                  black huckleberry 2012     100
## 297            mountain holly/catberry 2012     100
## 298            mountain holly/catberry 2017     100
## 299                 common winterberry 2017     100
## 300                       sheep laurel 2017     100
## 301                       sheep laurel 2012     100
## 302                           tamarack 2017     100
## 303                           tamarack 2012      75
## 304 threeleaf false lily of the valley 2017     100
## 305 threeleaf false lily of the valley 2012     100
## 306                         Indianpipe 2012      25
## 307                         Indianpipe 2017      25
## 308                          sweetgale 2017     100
## 309                          sweetgale 2012      75
## 310                       black spruce 2012     100
## 311                       black spruce 2017     100
## 312                            rhodora 2017     100
## 313                            rhodora 2012     100
## 314                   bog Labrador tea 2017     100
## 315                   bog Labrador tea 2012     100
## 316                      skunk cabbage 2012     100
## 317                      skunk cabbage 2017     100
## 318                         starflower 2012      25
## 319                         starflower 2017      25
## 320                  lowbush blueberry 2012      75
## 321                  lowbush blueberry 2017     100
## 322                 highbush blueberry 2012     100
## 323                 highbush blueberry 2017     100
## 324             velvetleaf huckleberry 2012      25
## 325             velvetleaf huckleberry 2017     100
## 326                    small cranberry 2017     100
## 327                    small cranberry 2012     100
## 328                        lingonberry 2012     100
## 329                        lingonberry 2017     100
## 330               northern wild raisin 2012     100
## 331               northern wild raisin 2017     100
## 332                          red maple 2012      50
## 333                          red maple 2017      25
## 334                         gray alder 2017     100
## 335                         gray alder 2012     100
## 336                  spreading dogbane 2017      25
## 337                  spreading dogbane 2012      25
## 338                         gray birch 2017      75
## 339                         gray birch 2012      75
## 340                          bluejoint 2017     100
## 341                          bluejoint 2012     100
## 342                  prickly bog sedge 2017      50
## 343                        hairy sedge 2012     100
## 344                        hairy sedge 2017     100
## 345                 woolly-fruit sedge 2012      50
## 346                 woolly-fruit sedge 2017     100
## 347                               <NA> 2012      50
## 348                      upright sedge 2017     100
## 349                      upright sedge 2012     100
## 350                        leatherleaf 2012     100
## 351                        leatherleaf 2017     100
## 352                         sweet fern 2012      25
## 353                   parasol whitetop 2017      50
## 354                   parasol whitetop 2012      25
## 355                   crested woodfern 2017      50
## 356                     threeway sedge 2017      50
## 357                    field horsetail 2017      25
## 358                      bigleaf aster 2017      50
## 359              fineleaf sheep fescue 2017      50
## 360            mountain holly/catberry 2017      50
## 361                 common winterberry 2012      75
## 362                 common winterberry 2017      75
## 363                 harlequin blueflag 2012      50
## 364                 harlequin blueflag 2017      50
## 365                      Canadian rush 2017      50
## 366                      Canadian rush 2012      25
## 367                        common rush 2017      25
## 368                      garden lupine 2012      50
## 369                  earth loosestrife 2012      75
## 370                  earth loosestrife 2017     100
## 371                              apple 2012      25
## 372                          sweetgale 2012     100
## 373                          sweetgale 2017     100
## 374                     sensitive fern 2017      50
## 375                     sensitive fern 2012      25
## 376                         royal fern 2012      25
## 377                      cinnamon fern 2012      25
## 378                            timothy 2012      50
## 379                 eastern white pine 2017      50
## 380                 eastern white pine 2012      50
## 381                     bigtooth aspen 2017      50
## 382                     bigtooth aspen 2012      50
## 383                      quaking aspen 2012      75
## 384                      quaking aspen 2017      75
## 385                   northern red oak 2017      50
## 386                     tall buttercup 2017      25
## 387                   glossy buckthorn 2012      75
## 388                   glossy buckthorn 2017      75
## 389                            rhodora 2017      25
## 390                       shining rose 2017      75
## 391                         swamp rose 2012      25
## 392                      Virginia rose 2017      25
## 393                   bristly dewberry 2017      50
## 394                         blackberry 2017      50
## 395                     slender willow 2017     100
## 396                             willow 2012     100
## 397                          woolgrass 2017      25
## 398                          woolgrass 2012      50
## 399                   Huachuca skulcap 2012      50
## 400              wrinkleleaf goldenrod 2017      50
## 401                  white meadowsweet 2012     100
## 402                  white meadowsweet 2017     100
## 403                        steeplebush 2017      25
## 404       Virginia marsh St. Johnswort 2012      25
## 405                marsh St. Johnswort 2017      75
## 406                 horned bladderwort 2017      25
## 407                 highbush blueberry 2017      25
## 408                   common gypsyweed 2017      50
## 409                         bird vetch 2012      50
## 410                         bird vetch 2017      50
## 411                          red maple 2012     100
## 412                          red maple 2017     100
## 413                         gray alder 2017      25
## 414                       serviceberry 2017      50
## 415                     dragon's mouth 2012      50
## 416                     dragon's mouth 2017      50
## 417                   black chokeberry 2012     100
## 418                   black chokeberry 2017     100
## 419                          bluejoint 2012      50
## 420                          grasspink 2017     100
## 421                          grasspink 2012      75
## 422                  prickly bog sedge 2017      75
## 423                      coastal sedge 2017      75
## 424                         long sedge 2012      50
## 425                         long sedge 2017      50
## 426                   boreal bog sedge 2017      50
## 427                    fewflower sedge 2017      25
## 428                  threeseeded sedge 2012      75
## 429                  threeseeded sedge 2017      75
## 430                        leatherleaf 2012     100
## 431                        leatherleaf 2017     100
## 432                 bunchberry dogwood 2012     100
## 433                 bunchberry dogwood 2017     100
## 434                   spoonleaf sundew 2012      50
## 435                   spoonleaf sundew 2017      75
## 436                  round-leaf sundew 2012     100
## 437                  round-leaf sundew 2017     100
## 438                    black crowberry 2017     100
## 439                    black crowberry 2012     100
## 440                   tall cottongrass 2012     100
## 441                   tall cottongrass 2017     100
## 442                  tawny cottongrass 2017      75
## 443                  tawny cottongrass 2012      25
## 444                    low rough aster 2017      50
## 445                 creeping snowberry 2012      75
## 446                 creeping snowberry 2017      50
## 447                  black huckleberry 2012      50
## 448                  black huckleberry 2017     100
## 449                  dwarf huckleberry 2012     100
## 450                   fowl manna grass 2017      50
## 451                         mannagrass 2012      50
## 452            mountain holly/catberry 2017     100
## 453            mountain holly/catberry 2012     100
## 454                 harlequin blueflag 2017      75
## 455                 harlequin blueflag 2012      75
## 456                       sheep laurel 2012     100
## 457                       sheep laurel 2017     100
## 458                         bog laurel 2017      25
## 459                           tamarack 2017     100
## 460                           tamarack 2012     100
## 461            exotic bush honeysuckle 2017      50
## 462 threeleaf false lily of the valley 2012      75
## 463 threeleaf false lily of the valley 2017     100
## 464                narrowleaf cowwheat 2012      50
## 465                          sweetgale 2012     100
## 466                          sweetgale 2017     100
## 467                          bog aster 2017      75
## 468                          bog aster 2012     100
## 469                      Blake's aster 2017      50
## 470                      cinnamon fern 2017      25
## 471                       black spruce 2017     100
## 472                       black spruce 2012     100
## 473                       rose pogonia 2012     100
## 474                       rose pogonia 2017     100
## 475                            rhodora 2012     100
## 476                            rhodora 2017     100
## 477                   bog Labrador tea 2012     100
## 478                   bog Labrador tea 2017     100
## 479                    white beaksedge 2017     100
## 480                    white beaksedge 2012     100
## 481                       shining rose 2017      50
## 482                         swamp rose 2012      25
## 483                  northern dewberry 2012      25
## 484                   bristly dewberry 2017      50
## 485                purple pitcherplant 2017     100
## 486                purple pitcherplant 2012     100
## 487                      bog goldenrod 2017      75
## 488                      bog goldenrod 2012     100
## 489                  white meadowsweet 2017      25
## 490                     New York aster 2017      75
## 491                      skunk cabbage 2017     100
## 492                      skunk cabbage 2012     100
## 493                     tufted bulrush 2012      75
## 494                         starflower 2012     100
## 495                         starflower 2017      75
## 496                 horned bladderwort 2012      50
## 497                 horned bladderwort 2017      75
## 498                  lowbush blueberry 2012      50
## 499                  lowbush blueberry 2017      75
## 500                 highbush blueberry 2017      50
## 501                 highbush blueberry 2012      50
## 502                    small cranberry 2017     100
## 503                    small cranberry 2012     100
## 504                        lingonberry 2017      25
## 505               northern wild raisin 2017     100
## 506               northern wild raisin 2012     100
## 507         northern yellow-eyed-grass 2017      50
## 508         northern yellow-eyed-grass 2012      50

Return first 5 rows and a subset of columns of the data frame

ACAD_wetland[1:5, c("Site_Name", "Latin_Name", "Common", "Year", "PctFreq")]
View R output
##   Site_Name          Latin_Name           Common Year PctFreq
## 1    SEN-01         Acer rubrum        red maple 2011       0
## 2    SEN-01         Amelanchier     serviceberry 2011      20
## 3    SEN-01 Andromeda polifolia     bog rosemary 2011      80
## 4    SEN-01    Arethusa bulbosa   dragon's mouth 2011      40
## 5    SEN-01  Aronia melanocarpa black chokeberry 2011     100

Return all rows and first 4 columns of the data frame

ACAD_sub <- ACAD_wetland[ , 1:4] # works, but risky
ACAD_sub2 <- 
  ACAD_wetland[,c("Site_Name", "Site_Type", "Latin_Name", "Common")] #same result, but better
# compare the two data frames to the original
head(ACAD_wetland)
head(ACAD_sub)
head(ACAD_sub2)

Coding Tip: As shown above, you can specify columns by name or by column number. However, it’s almost always best to refer to columns by name. It makes your code easier to read and prevents it from breaking if columns get reordered.


Test your skills!

CHALLENGE: How would you look at the the first 4 even rows (2, 4, 6, 8), and first 2 columns of the ACAD_wetland data frame?

Answer
Answer that works
ACAD_wetland[c(2, 4, 6, 8), c(1, 2)]
##   Site_Name Site_Type
## 2    SEN-01  Sentinel
## 4    SEN-01  Sentinel
## 6    SEN-01  Sentinel
## 8    SEN-01  Sentinel
Better answer that’s more stable
names(ACAD_wetland) # get the names of the first 2 columns
##  [1] "Site_Name"  "Site_Type"  "Latin_Name" "Common"     "Year"      
##  [6] "PctFreq"    "Ave_Cov"    "Invasive"   "Protected"  "X_Coord"   
## [11] "Y_Coord"
ACAD_wetland[c(2, 4, 6, 8), c("Site_Name", "Site_Type")]
##   Site_Name Site_Type
## 2    SEN-01  Sentinel
## 4    SEN-01  Sentinel
## 6    SEN-01  Sentinel
## 8    SEN-01  Sentinel

Advanced Bracketry You can do more than just subset by row numbers and column names. Pattern matching to return certain rows or columns is a common and more advanced used of brackets.

Using = vs == vs %in%
A key point in R is knowing when to use a single = or a double ==.
  • When using the equals to assign a value to a new object or column (stay tuned for tomorrow), use the single =.
  • When you’re using the equals symbol to match a value, you use the double ==.
  • Conversely, != is interpreted as not equal to for similar use.
  • Another operator is %in%. This operator works just like ==, but for multiple conditions. The == operator is not designed to take more than 1 condition, even though it won’t give you an error. Instead, it will stop after it makes the first match.

As you get more comfortable with R, this will become natural. If you forget, R will error and may even give you a hint when you used = instead of ==.

Pattern match (filter) to return a data frame of species that are not invasive and return all columns

head(ACAD_wetland)
View R output
##   Site_Name Site_Type          Latin_Name           Common Year PctFreq Ave_Cov
## 1    SEN-01  Sentinel         Acer rubrum        red maple 2011       0    0.02
## 2    SEN-01  Sentinel         Amelanchier     serviceberry 2011      20    0.02
## 3    SEN-01  Sentinel Andromeda polifolia     bog rosemary 2011      80    2.22
## 4    SEN-01  Sentinel    Arethusa bulbosa   dragon's mouth 2011      40    0.04
## 5    SEN-01  Sentinel  Aronia melanocarpa black chokeberry 2011     100    2.64
## 6    SEN-01  Sentinel        Carex exilis    coastal sedge 2011      60    6.60
##   Invasive Protected  X_Coord Y_Coord
## 1    FALSE     FALSE 574855.5 4911909
## 2    FALSE     FALSE 574855.5 4911909
## 3    FALSE     FALSE 574855.5 4911909
## 4    FALSE      TRUE 574855.5 4911909
## 5    FALSE     FALSE 574855.5 4911909
## 6    FALSE     FALSE 574855.5 4911909

ACAD_nat <- ACAD_wetland[ACAD_wetland$Invasive == FALSE, ]
table(ACAD_wetland$Invasive) # 9 T
View R output
## 
## FALSE  TRUE 
##   499     9

table(ACAD_nat$Invasive) # No T
View R output
## 
## FALSE 
##   499

Filter data to only return the Latin_Name column of rows where Invasive is TRUE. Click on R Output below to view results.

ACAD_wetland$Latin_Name[ACAD_wetland$Invasive == TRUE]
## [1] "Berberis thunbergii"   "Berberis thunbergii"   "Berberis thunbergii"  
## [4] "Celastrus orbiculatus" "Rhamnus frangula"      "Rhamnus frangula"     
## [7] "Rhamnus frangula"      "Rhamnus frangula"      "Lonicera - Exotic"
ACAD_wetland[ACAD_wetland$Invasive == TRUE, "Latin_Name"] # equivalent
## [1] "Berberis thunbergii"   "Berberis thunbergii"   "Berberis thunbergii"  
## [4] "Celastrus orbiculatus" "Rhamnus frangula"      "Rhamnus frangula"     
## [7] "Rhamnus frangula"      "Rhamnus frangula"      "Lonicera - Exotic"

Filter data to return any plot where Arethusa bulbosa, Calopogon tuberosus, or Pogonia ophioglossoides were detected.

orchid_spp <- c("Arethusa bulbosa", "Calopogon tuberosus", "Pogonia ophioglossoides")
ACAD_orchid_plots <- ACAD_wetland[ACAD_wetland$Latin_Name %in% orchid_spp, 
                                  c("Site_Name", "Year", "Latin_Name")]
ACAD_orchid_plots
View R output
##     Site_Name Year              Latin_Name
## 4      SEN-01 2011        Arethusa bulbosa
## 36     SEN-02 2011        Arethusa bulbosa
## 246    RAM-53 2017 Pogonia ophioglossoides
## 415    RAM-05 2012        Arethusa bulbosa
## 416    RAM-05 2017        Arethusa bulbosa
## 420    RAM-05 2017     Calopogon tuberosus
## 421    RAM-05 2012     Calopogon tuberosus
## 473    RAM-05 2012 Pogonia ophioglossoides
## 474    RAM-05 2017 Pogonia ophioglossoides

Coding Tip: There are often multiple ways to perform a task. The best code is code that 1) works, 2) is easy to follow, and 3) is unlikely to break (e.g. use column names instead of numbers). That still means there are typically multiple equally valid approaches. There are other ways to judge good code as you advance, but for now, aspire to write code that meets these three qualities.


Functions unique(), sort(), length()

Determining the number of records that match a certain condition can useful too. Say we want to know how many unique sites were sampled in the ACAD_wetland data frame. We can use a combination of brackets and other functions to summarize that, like below.

Sort alphabetically a list of unique site names.

# Return a vector of unique site names, sorted alphabetically
sites_unique <- sort(unique(ACAD_wetland[,"Site_Name"]))
sites_unique
## [1] "RAM-05" "RAM-41" "RAM-44" "RAM-53" "RAM-62" "SEN-01" "SEN-02" "SEN-03"

Determine number of unique sites

# Returns the number of elements in sites_unique vector
length(sites_unique) # 8
View R output
## [1] 8

CHALLENGE: How many unique species are there in the ACAD_wetland data frame?

Answer
# Option 1
length(unique(ACAD_wetland[, "Latin_Name"]))
# Option 2
length(unique(ACAD_wetland$Latin_Name)) # equivalent
## [1] 133

CHALLENGE: Which sites have species that are considered protected on them (Protected = TRUE)?

Answer
# Option 1 - used unique to just return unique site name
unique(ACAD_wetland$Site_Name[ACAD_wetland$Protected == TRUE])
# Option 2
unique(ACAD_wetland[ACAD_wetland$Protected == TRUE, "Site_Name"])
## [1] "SEN-01" "SEN-02" "RAM-53" "RAM-05"


Data Exploration

Exploring the data

We’ve already explored the wetland data a bit using head(), str(), names(), and View(). These are functions that you will use over and over as you work with data in R. Below, I’m going to show how I get to know a data set in R.

Read in example NETN tree data

trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")

Look at first few records

head(trees)
##   ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode    TSN ScientificName
## 1     MIMA       12  6/16/2025  FALSE       2025      13 183385  Pinus strobus
## 2     MIMA       12  6/16/2025  FALSE       2025      12  28728    Acer rubrum
## 3     MIMA       12  6/16/2025  FALSE       2025      11  28728    Acer rubrum
## 4     MIMA       12  6/16/2025  FALSE       2025       2  28728    Acer rubrum
## 5     MIMA       12  6/16/2025  FALSE       2025      10  28728    Acer rubrum
## 6     MIMA       12  6/16/2025  FALSE       2025       7  28728    Acer rubrum
##   DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 1  24.9             AS              5           <NA>
## 2  10.9             AB              5           <NA>
## 3  18.8             AS              3           <NA>
## 4  51.2             AS              3           <NA>
## 5  38.2             AS              3           <NA>
## 6  22.5             AS              4           <NA>

Look at structure of each column

str(trees)
## 'data.frame':    164 obs. of  12 variables:
##  $ ParkUnit      : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
##  $ PlotCode      : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ SampleDate    : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
##  $ IsQAQC        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SampleYear    : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ TagCode       : int  13 12 11 2 10 7 5 9 1 3 ...
##  $ TSN           : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
##  $ ScientificName: chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
##  $ DBHcm         : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
##  $ TreeStatusCode: chr  "AS" "AB" "AS" "AS" ...
##  $ CrownClassCode: int  5 5 3 3 3 4 NA NA NA NA ...
##  $ DecayClassCode: chr  NA NA NA NA ...

Look at summary of the columns

summary(trees)
##    ParkUnit            PlotCode      SampleDate          IsQAQC       
##  Length:164         Min.   :11.00   Length:164         Mode :logical  
##  Class :character   1st Qu.:14.00   Class :character   FALSE:164      
##  Mode  :character   Median :16.50   Mode  :character                  
##                     Mean   :16.05                                     
##                     3rd Qu.:19.00                                     
##                     Max.   :20.00                                     
##                                                                       
##    SampleYear      TagCode          TSN         ScientificName    
##  Min.   :2025   Min.   : 1.0   Min.   : 19049   Length:164        
##  1st Qu.:2025   1st Qu.: 7.0   1st Qu.: 24764   Class :character  
##  Median :2025   Median :12.5   Median : 28728   Mode  :character  
##  Mean   :2025   Mean   :13.6   Mean   : 62361                     
##  3rd Qu.:2025   3rd Qu.:19.0   3rd Qu.: 32929                     
##  Max.   :2025   Max.   :36.0   Max.   :565478                     
##                                                                   
##      DBHcm        TreeStatusCode     CrownClassCode  DecayClassCode    
##  Min.   : 10.00   Length:164         Min.   :1.000   Length:164        
##  1st Qu.: 13.12   Class :character   1st Qu.:3.000   Class :character  
##  Median : 19.00   Mode  :character   Median :5.000   Mode  :character  
##  Mean   : 25.47                      Mean   :4.165                     
##  3rd Qu.: 28.45                      3rd Qu.:5.000                     
##  Max.   :443.00                      Max.   :6.000                     
##                                      NA's   :25

Check for complete cases for the first 10 columns that should always have data.

table(complete.cases(trees[,1:10]))# all true
View R output
## 
## TRUE 
##  164

There’s a lot to digest from the summary results.
  • We can see that Park_Unit and ScientificName are treated as characters, which makes sense.
  • PlotCode is numeric, the range of plot numbers is 1 to 20, and there are no blanks (NAs)
  • SampleDate is being interpreted as a character, not date. We’ll fix that later.
  • IsQAQC is being treated as TRUE/FALSE. We’ll use that to filter out QAQC visits.
  • SampleYear is all 2022.
  • DBH ranges from 10 to 443.0, with 14 blanks (NAs).
  • DecayClassCode is reading in as a character, not a number. That’s weird. We will look deeper into that next.


Sidebar on NAs

To keep data frames rectangular, R treats missing data (i.e. blanks) as NA (stands for not available). A foundational philosophy of R is that the user must tell R functions what do to if NAs are in the data. Ideally that forces the user to investigate the NAs to determine their reason for being there, whether there’s a way to fix it, if those records should be dropped, etc. If you try to calculate the mean of a column that has a blank in it, and you don’t tell R what to do with NAs, the returned value will be NA. Most summary functions in R have an argument na.rm, which is logical (TRUE/FALSE). To drop NAs, you include na.rm = TRUE.

It’s important every time you have NAs in your data to think about what they mean and how best to treat them. Sometimes, it’s best to drop them. Other times, converting the blanks to 0 is the best approach. It depends entirely on your data and what you intend to do with it.

Test NA use with mean() function

x <- c(1, 3, 8, 3, 5, NA)
mean(x) # returns NA
## [1] NA
mean(x, na.rm = TRUE) 
## [1] 4


Fix the data The steps we are going to take with the NETN tree data were:
  1. Replace decay class code “PM” with NA (blank).
  2. Convert SampleDate (character) to Date (date-time).
  3. Rename the ScientificName column to Species.
  4. Create a Plot_Name column that’s a combination of ParkUnit and PlotCode.
  5. Change tree DBH that’s 443.0 to 44.3.
  6. Drop records that are QAQC visits.
Fix DecayClassCode

Look at unique values for DecayClassCode.

sort(unique(trees$DecayClassCode)) # sorts the unique values in the column
## [1] "1"  "2"  "3"  "PM"
table(trees$DecayClassCode) # shows the number of records per value - very handy
## 
##  1  2  3 PM 
##  9  6  8  2

There are 2 records called “PM”, which stands for Permanently Missing in our forest data. We will convert PM to a blank, which R calls NA, and create a new decay class column that is converted to numeric.

Convert “PM” to blank. I will first make a copy of the data frame.

trees2 <- trees
trees2$DecayClassCode[trees2$DecayClassCode == "PM"] <- NA
trees2$DecayClassCode_num <- as.numeric(trees2$DecayClassCode)

# check that it worked
str(trees2) # DecayClassCode_num is numeric
View R output
## 'data.frame':    164 obs. of  13 variables:
##  $ ParkUnit          : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
##  $ PlotCode          : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ SampleDate        : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
##  $ IsQAQC            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SampleYear        : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ TagCode           : int  13 12 11 2 10 7 5 9 1 3 ...
##  $ TSN               : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
##  $ ScientificName    : chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
##  $ DBHcm             : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
##  $ TreeStatusCode    : chr  "AS" "AB" "AS" "AS" ...
##  $ CrownClassCode    : int  5 5 3 3 3 4 NA NA NA NA ...
##  $ DecayClassCode    : chr  NA NA NA NA ...
##  $ DecayClassCode_num: num  NA NA NA NA NA NA 1 3 2 3 ...

sort(unique(trees2$DecayClassCode_num)) # Only numbers show in table
View R output
## [1] 1 2 3

Remove QAQC visits

Using the trees2 data frame, which fixed the decay class column by making the DecayClassCode_num field numeric, we’re now going to drop visits that were for QAQC using a new base R function called subset(). The subset() function allows you to reduce the dimensions of a data frame. You can reduce rows, columns, or both in the same function call. I will also show the bracket approach.

Remove QAQC visits (IsQAQC == TRUE) and drop the DecayClassCode column

trees3 <- subset(trees2, IsQAQC == FALSE, select = -DecayClassCode) # Note the importance of FALSE all caps
trees3 <- subset(trees2, IsQAQC != TRUE, select = -DecayClassCode) # equivalent
trees3 <- trees2[trees2$IsQAQC == FALSE, -12] #equivalent but not as easy to follow
Convert SampleDate to Date

Convert SampleDate into a date-time instead of character.

# Look at the sample date format
head(trees3$SampleDate) # month/day/year
View R output
## [1] "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025"

# Create new column called Date
trees3$Date <- as.Date(trees3$SampleDate, format = "%m/%d/%Y")
str(trees3)
View R output
## 'data.frame':    164 obs. of  13 variables:
##  $ ParkUnit          : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
##  $ PlotCode          : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ SampleDate        : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
##  $ IsQAQC            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SampleYear        : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ TagCode           : int  13 12 11 2 10 7 5 9 1 3 ...
##  $ TSN               : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
##  $ ScientificName    : chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
##  $ DBHcm             : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
##  $ TreeStatusCode    : chr  "AS" "AB" "AS" "AS" ...
##  $ CrownClassCode    : int  5 5 3 3 3 4 NA NA NA NA ...
##  $ DecayClassCode_num: num  NA NA NA NA NA NA 1 3 2 3 ...
##  $ Date              : Date, format: "2025-06-16" "2025-06-16" ...

Rename ScientificName to Species for shorter typing

Renaming columns in base R is kind of a pain, to the point that I have to look it up every time I need to do it. I’ll show you an easier way to do that tomorrow.

Rename ScientificName column

names(trees3) # original names
View R output
##  [1] "ParkUnit"           "PlotCode"           "SampleDate"        
##  [4] "IsQAQC"             "SampleYear"         "TagCode"           
##  [7] "TSN"                "ScientificName"     "DBHcm"             
## [10] "TreeStatusCode"     "CrownClassCode"     "DecayClassCode_num"
## [13] "Date"

names(trees3)[names(trees3) == "ScientificName"] <- "Species"
names(trees3) # check that it worked
View R output
##  [1] "ParkUnit"           "PlotCode"           "SampleDate"        
##  [4] "IsQAQC"             "SampleYear"         "TagCode"           
##  [7] "TSN"                "Species"            "DBHcm"             
## [10] "TreeStatusCode"     "CrownClassCode"     "DecayClassCode_num"
## [13] "Date"

Create a Plot_Name column via paste()

The paste() and paste0() functions are very handy for creating new columns that are combinations of existing functions. The code below will create a new column named Plot_Name that’s a combination of ParkUnit and PlotCode.

Create new Plot_Name column

trees3$Plot_Name <- paste(trees3$ParkUnit, trees3$PlotCode, sep = "-")
trees3$Plot_Name <- paste0(trees3$ParkUnit, "-", trees3$PlotCode) #equivalent- by default no separation between elements of paste.

Coding Tip: In most cases, it does not matter whether you use single ’ or double “, as long as you open and close with the same. The cases where it matters are where you have quotes within quotes. There you have to alternate your usage, like print("Text in outer quote 'text printed as being within quotes' end with closing quote").


Test your skills!
CHALLENGE: How many trees are on Plot MIMA-12?
Answer
mima12 <- subset(trees3, Plot_Name == "MIMA-12")
nrow(mima12) # 12
View R output
## [1] 12

CHALLENGE: How many trees with a TreeStatusCode of “AS” (alive standing) are on Plot MIMA-12?
Answer

Option 1. Subset data then calculate number of rows

mima12_as <- subset(trees3, Plot_Name == "MIMA-12" & TreeStatusCode == "AS")
nrow(mima12_as) # 6
## [1] 6

Option 2. Subset the data with brackets and use the table() function to tally status codes.

# OPTION 2
mima12 <- trees3[trees3$Plot_Name == "MIMA-12",]
table(mima12$TreeStatusCode) # 6
## 
## AB AS DB DM DS 
##  1  6  3  1  1

CHALLENGE: Find the DBH record that’s > 400cm DBH.

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees3)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees3$DBHcm, na.rm = TRUE)
trees3[trees3$DBHcm == max_dbh,]
##    ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN       Species
## 26     MIMA       16  6/17/2025  FALSE       2025       1 19447 Quercus robur
##    DBHcm TreeStatusCode CrownClassCode DecayClassCode_num       Date Plot_Name
## 26   443             AS              3                 NA 2025-06-17   MIMA-16

CHALLENGE: What is the exact value of the largest DBH, and which record does it belong to?

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees3$DBHcm, na.rm = TRUE)
max_dbh #443
View R output
## [1] 443

trees[trees3$DBHcm == max_dbh,]
View R output
##    ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN ScientificName
## 26     MIMA       16  6/17/2025  FALSE       2025       1 19447  Quercus robur
##    DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 26   443             AS              3           <NA>

# Plot MIMA-016, TagCode = 1.

CHALLENGE: Fix the DBH typo by replacing 443.0 with 44.3.

Answer

Let’s say that you looked at the datasheet, and the actual DBH for that tree was 44.3 instead of 443.0. You can change that value in the original CSV by hand. But even better is to document that change in code. There are multiple ways to do this. Two examples are below.

But first, it’s good to create a new data frame when modifying the original data frame, so you can refer back to the original if needed. I also use a really specific filter to make sure I’m not accidentally changing other data.

Replace 443 with 44.3

# create copy of trees data
trees_fix <- trees3

# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

Check that it worked by showing the range of the original and fixed data frames.

range(trees$DBHcm)
## [1]  10 443
range(trees_fix$DBHcm)
## [1]  10 443


Basic Plotting

Basic plotting

Visualizing the data is also important to get a sense for the data and look for potential errors and outliers. Base R has plotting functions that allow you to create quick plots without having to know a lot of code. I often use Base R plot functions when I’m exploring data but not making plots I plan to use for publication. When I need to create more complex plots, I use ggplot2, which we’ll cover on Day 2 and 3.

Histograms are a great start. The code below generates a basic histogram plot of a specific column in the dataframe using the hist() function.

Plot histogram of DBH measurements

hist(x = trees$DBHcm)


Looking at the histogram, it looks like all of the measurements are below 100cm except for one that’s way out in 400 range. You can also make a scatterplot of the data. If you only specify one column, the x axis will be the row number for each record, and the y axis will be the specified column.

Make point plot of DBH measurements

plot(trees$DBHcm)


Again, you can see there’s one value that’s greater than all of the others.

We can also plot two variables in a scatterplot.

Make scatterplot of crown class vs. DBH measurements (Option 1)

plot(trees$DBHcm ~ trees$CrownClassCode)

Make scatterplot of crown class vs. DBH measurements (Option 2- better axis labels)

plot(DBHcm ~ CrownClassCode, data = trees) # equivalent but cleaner axis titles


Again, you can see there’s one value that’s greater than all of the others, and it’s crown class code 3 (codominant).

CHALLENGE: Plot a histogram of percent cover (Ave_Cov) in the ACAD_wetland

Answer
hist(ACAD_wetland$Ave_Cov)
View R plot



Day 2: Wrangling and Viz I

Day 2 Goals

Goals for Day 2:
  1. Understanding of tidy data format (rows are observations; columns are variables).
  2. Exposure to the main tidyverse packages and philosophy behind it.
  3. Comfortable filtering and selecting data, and renaming and creating new columns in dplyr.
  4. How to use ifelse() and case_when() conditional statements.
  5. Comfortable grouping and summarizing data in dplyr.
  6. Difference between summarize() and mutate().
  7. Data visualization best practices.
  8. Understanding the building blocks of ggplot2 plotting package.

Feedback: Please leave feedback in the training feedback form. You can submit feedback multiple times and don’t need to answer every question. Responses are anonymous.

friends with tidy
Artwork by @allison_horst


Tidyverse

Tidyverse background

Tidyverse packages From tidyverse.org

We are now going to learn how to subset rows and columns and other common data wrangling tasks using packages in the tidyverse. Taken directly from tidyverse.org: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Why I like the tidyverse:
  • Function names and arguments are well-named, clear, and consistent, making code easier to read/understand than base R.
  • Functions assume the tidy data format (rows = observations; columns = variables).
  • Availability of tidyverse help and free online learning materials is excellent.
  • Functions take data as the first argument, making it easy to pipe together multiple tasks (more on pipes later).
  • Column names don’t need to be quoted to refer to them in function arguments (called non-standard evaluation), making typing faster.
  • Taken together, these features make the learning curve much shallower than base R.

You should have installed all of the tidyverse packages in preparation for this training. If you missed that step, install tidyverse packages using code below. It can take a few minutes for all the packages to install.

Only run if you haven’t installed these packages yet

install.packages('tidyverse')

Load the tidyverse

library(tidyverse)

Coding Tip: When you type library(tidyverse), you’re loading all nine the packages in the tidyverse. If you’re only using one or two packages, it’s better to just load those to packages. It’s clearer to the user which packages are needed to run your code and reduces dependencies. For this session, we’re only going to use dplyr, so I will just load that.

library(dplyr)

Tidy Data Format For most of the tasks you will do in R, you will want your data organized in a particular format, often referred to as Tidy Data (see below). Tidy data is organized such that columns are variables, like plot_number, date, measurement, and rows are observations. Each cell is a value. Base R and most R packages are designed to work with data in this format. Always following this approach saves mental and coding time in trying to figure out how to organize your data. The tidyverse suite of packages, which we’ll talk about tomorrow, are especially optimized to work with data in this format.

Tidy Data
Figure From R for Data Science


Core tidyverse packages (ordered from my most to least used):
  • ggplot2: plotting package based on The Grammar of Graphics (more on that later).
  • dplyr: filtering, selecting, renaming, and summarizing based on SQL.
  • purrr: contains functions like map() that allow you to iterate functions or processes like a for loop.
  • tidyr: reshaping data from wide to long and long to wide.
  • stringr: functions that help you work with strings, such as extracting specific patterns, splitting strings using a specific character (e.g. the ’_’ in ACAD_101), left pad a number by 0s and turning into a string, etc. There’s generally always a base R version of a stringr function, but stringr functions tend to be easier to use and read.
  • readr: includes read functions for csv, and other formats. The read_csv() function, for example has more bells and whistles than the base R read.csv() function. I’ve never needed those extra features, so I just use read.csv().
  • lubridate: package for working with dates easier. However, the functions are stricter than base R date functions. I tend to prefer base R for that reason.
  • forcats: package to work with factors, which are special character columns that are categorical with defined levels.
  • tibble: these are basically data frames with more checks on the data. When you create a new object from with a tidyverse function, the default result will be a tibble instead of a data frame. 99% of the time, it won’t matter whether your data object is a data frame or tibble. The 1% of the time it does matter, has made me strongly dislike tibbles. From tidyverse.org: “Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code”. I also prefer the base R data frame format of head(data.frame) over the format for head(tibble).


Wrangling with dplyr

Introduction to dplyr

The dplyr package is perhaps the single most useful package in R for working with your data. dplyr filter Artwork by @allison_horst

Commonly used dplyr functions and their use:
  • filter(): filters data for observations that meet specific criteria.
  • select(): subsets columns by either selecting or removing them.
  • arrange(): sorts data by specified column.
  • mutate(): adds new column(s) to a data frame.
  • slice(): slices the data based on specified number of rows
  • group_by() |> summarize(): summarizes by groups/factor levels and returns data for each group (e.g. mean cover grouped by plot).
  • group_by() |> mutate(): summarizes by groups/factor levels and returns data for each observation.
  • rename(): renames columns

Now, using the dplyr package in the tidyverse, we’re going to do the same operations we did yesterday with brackets.

The steps we took with the NETN tree data were:
  1. Replace decay class code “PM” with NA (blank).
  2. Convert SampleDate (character) to Date (date-time).
  3. Rename the ScientificName column to Species.
  4. Create a Plot_Name column that’s a combination of ParkUnit and PlotCode.
  5. Change tree DBH that’s 443.0 to 44.3.
  6. Drop records that are QAQC visits.


Wrangle in dplyr

Read in example NETN tree data

trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
  1. Replace decay class code “PM” with NA (blank).

  2. # Base R
    trees2 <- trees
    trees2$DecayClassCode[trees2$DecayClassCode == "PM"] <- NA
    trees2$DecayClassCode_num <- as.numeric(trees2$DecayClassCode)
    # dplyr approach with mutate
    trees2 <- mutate(trees, DecayClassCode_num = as.numeric(replace(DecayClassCode, DecayClassCode == "PM", NA)))
    str(trees2)
    View R output
    ## 'data.frame':    164 obs. of  13 variables:
    ##  $ ParkUnit          : chr  "MIMA" "MIMA" "MIMA" "MIMA" ...
    ##  $ PlotCode          : int  12 12 12 12 12 12 12 12 12 12 ...
    ##  $ SampleDate        : chr  "6/16/2025" "6/16/2025" "6/16/2025" "6/16/2025" ...
    ##  $ IsQAQC            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
    ##  $ SampleYear        : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
    ##  $ TagCode           : int  13 12 11 2 10 7 5 9 1 3 ...
    ##  $ TSN               : int  183385 28728 28728 28728 28728 28728 28728 28728 28728 28728 ...
    ##  $ ScientificName    : chr  "Pinus strobus" "Acer rubrum" "Acer rubrum" "Acer rubrum" ...
    ##  $ DBHcm             : num  24.9 10.9 18.8 51.2 38.2 22.5 26.4 42.9 12.3 49 ...
    ##  $ TreeStatusCode    : chr  "AS" "AB" "AS" "AS" ...
    ##  $ CrownClassCode    : int  5 5 3 3 3 4 NA NA NA NA ...
    ##  $ DecayClassCode    : chr  NA NA NA NA ...
    ##  $ DecayClassCode_num: num  NA NA NA NA NA NA 1 3 2 3 ...

  3. Convert SampleDate (character) to Date (date-time).

  4. # Base R
    trees3$Date <- as.Date(trees3$SampleDate, format = "%m/%d/%Y")
    # dplyr approach with mutate
    trees3 <- mutate(trees2, Date = as.Date(SampleDate, format = "%m/%d/%Y"))
  5. Rename the ScientificName column to Species.

  6. # Base R code
    names(trees2)[names(trees2) == "ScientificName"] <- "Species"
    # dplyr approach with rename
    trees2 <- rename(trees2, "Species" = "ScientificName")
    names(trees2)
    View R output
    ##  [1] "ParkUnit"           "PlotCode"           "SampleDate"        
    ##  [4] "IsQAQC"             "SampleYear"         "TagCode"           
    ##  [7] "TSN"                "Species"            "DBHcm"             
    ## [10] "TreeStatusCode"     "CrownClassCode"     "DecayClassCode"    
    ## [13] "DecayClassCode_num"

  7. Create a Plot_Name column that’s a combination of ParkUnit and PlotCode.

  8. # Base R
    trees2$Plot_Name <- paste(trees2$ParkUnit, trees2$PlotCode, sep = "-")
    # dplyr approach with mutate
    trees2 <- mutate(trees2, Plot_Name = paste(ParkUnit, PlotCode, sep = "-"))
  9. Drop records that are QAQC visits.

  10. # Base R
    trees3 <- subset(trees2, IsQAQC == FALSE, select = -DecayClassCode) # Note the importance of FALSE all caps
    # dplyr
    trees3a <- filter(trees2, IsQAQC == FALSE)
    trees3 <- select(trees3a, -DecayClassCode)
    
    head(trees3)
    View R output
    ##   ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode    TSN       Species
    ## 1     MIMA       12  6/16/2025  FALSE       2025      13 183385 Pinus strobus
    ## 2     MIMA       12  6/16/2025  FALSE       2025      12  28728   Acer rubrum
    ## 3     MIMA       12  6/16/2025  FALSE       2025      11  28728   Acer rubrum
    ## 4     MIMA       12  6/16/2025  FALSE       2025       2  28728   Acer rubrum
    ## 5     MIMA       12  6/16/2025  FALSE       2025      10  28728   Acer rubrum
    ## 6     MIMA       12  6/16/2025  FALSE       2025       7  28728   Acer rubrum
    ##   DBHcm TreeStatusCode CrownClassCode DecayClassCode_num Plot_Name
    ## 1  24.9             AS              5                 NA   MIMA-12
    ## 2  10.9             AB              5                 NA   MIMA-12
    ## 3  18.8             AS              3                 NA   MIMA-12
    ## 4  51.2             AS              3                 NA   MIMA-12
    ## 5  38.2             AS              3                 NA   MIMA-12
    ## 6  22.5             AS              4                 NA   MIMA-12

Note that subsetting data frames in R, which refers to reducing rows, columns, or both, is split between 2 functions in dplyr. The filter() function reduces rows. The select() function reduces columns.


The magic pipe |>

The pipe (|> or %>%) makes dplyr and other tidyverse packages even more powerful. The pipe |> allows you to string together commands. So, taking all of the code above, we can do it all in the same function call.

Wrangle tree data with pipes

trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(replace(DecayClassCode, DecayClassCode == "PM", NA)),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode) |> 
  arrange(Plot_Name, TagCode)

head(trees_final)  
View R output
##   ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN
## 1     MIMA       11  6/16/2025  FALSE       2025       1 28728
## 2     MIMA       11  6/16/2025  FALSE       2025       2 28728
## 3     MIMA       11  6/16/2025  FALSE       2025       3 28728
## 4     MIMA       11  6/16/2025  FALSE       2025       4 28728
## 5     MIMA       11  6/16/2025  FALSE       2025       5 19281
## 6     MIMA       11  6/16/2025  FALSE       2025       6 19300
##             Species DBHcm TreeStatusCode CrownClassCode DecayClassCode_num
## 1       Acer rubrum  21.5             AS              5                 NA
## 2       Acer rubrum  10.4             DS             NA                  2
## 3       Acer rubrum  16.8             AS              5                 NA
## 4       Acer rubrum  13.6             AS              5                 NA
## 5 Quercus palustris  61.8             AS              1                 NA
## 6   Quercus bicolor  15.5             AS              5                 NA
##   Plot_Name       Date
## 1   MIMA-11 2025-06-16
## 2   MIMA-11 2025-06-16
## 3   MIMA-11 2025-06-16
## 4   MIMA-11 2025-06-16
## 5   MIMA-11 2025-06-16
## 6   MIMA-11 2025-06-16

The warning in the console tells us that in converting DecayClassCode to numeric, some NAs were introduced. This means that any row in DecayClassCode that had text in it was converted to a NA. In this case it’s the ‘PM’ records, and we’re expecting this warning.

The arrange() line just shows how to order the data by plot name and tree tag number.

Hopefully you agree that pipes are amazing! They allow for more efficient coding in relatively easy to follow the steps, and make the dplyr functions, like mutate() so much more useful. Outside of pipes for example, mutate() doesn’t feel more useful than base R for creating a new column. From now on, I will use pipes regularly in the code.

If you’ve ever seen the %>%, that also functions as a pipe with code. The %>% pipe was the original pipe that was introduced by the tidyverse in the magrittr package. The magrittr pipe was so popular, that starting in R 4.0, a base R pipe was introduced (|>). It’s supposed to be better optimized for order of operations and reduces a package you need to install. So, in general, use the base R pipe |>. It’s also why I had you set the default pipe in Global Options to the |>. A useful keyboard shortcut for the pipe is Ctrl + Shift + M. You should see the |> pipe in your script when you type that shortcut. If you get the %>% pipe instead, you need to change that default setting in Global Options (see Day 1 > R and RStudio > RStudio Global Options > Step 3. Change default pipe.)

Coding Tip: While the number of steps you can pipe together is virtually endless, piping many tasks, especially complex ones, can make code hard to read and troubleshoot. It’s best to limit number of pipes to 3-4, and/or to do complex tasks that might fail or require checking on their own.


Test your skills with dplyr!
CHALLENGE: How many trees are on Plot MIMA-12 (using trees_final)?
Answer
trees_final |> filter(Plot_Name == "MIMA-12") |> nrow()
View R output
## [1] 12

CHALLENGE: How many trees with a TreeStatusCode of “AS” (alive standing) are on Plot MIMA-12 (using trees_final)?
Answer
trees_final |> filter(Plot_Name == "MIMA-12" & TreeStatusCode == "AS") |> nrow()
## [1] 6

CHALLENGE: What is the exact value of the largest DBH, and which record does it belong to?

Answer
# Base R and dplyr combo
max_dbh <- max(trees_final$DBHcm, na.rm = TRUE)
trees_final |> 
  filter(DBHcm == max_dbh) |> 
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)
View R output
##   Plot_Name SampleYear TagCode       Species DBHcm
## 1   MIMA-16       2025       1 Quercus robur   443

# dplyr with slice
trees_final |> 
  arrange(desc(DBHcm)) |> # arrange DBHcm high to low via desc()
  slice(1) |> # slice the top record
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)
View R output
##   Plot_Name SampleYear TagCode       Species DBHcm
## 1   MIMA-16       2025       1 Quercus robur   443

CHALLENGE: Fix the DBH typo by replacing 443.0 with 44.3.

Answer
# Base R
# create copy of trees data
trees_fix <- trees3
# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

# dplyr via replace
trees_fix <- trees |> mutate(DBHcm = replace(DBHcm, DBHcm == 443.0, 44.3))

Check that it worked by showing the range of the original and fixed data frames.

range(trees$DBHcm)
## [1]  10 443
range(trees_fix$DBHcm)
## [1] 10.0 81.5


Conditional Functions

Conditionals 101

Conditional functions ifelse(), if(){ }else{ }, and case_when() allow you to return results that depends on specified conditions.

The main differences between the most common conditional functions:
  • ifelse(): Primarily for use with data frames. Takes 3 arguments: 1) the condition to test; 2) the value to return if condition is true; 3) the value to return of the condition is false. Function can only handle 2 possible outcomes, although nested ifelse() statements are possible (see example below). This function is vectorized, which means it’s optimized for working on columns in data frames. Of the 3 conditionals, it tends to perform the fastest on large data sets.
  • case_when(): Primarily for use with data frames. Can take any number of condition statements and their value to return. Requires dplyr package to be loaded. Syntax is a bit tricky to figure out at first, but once you have it, it’s about as easy as using ifelse(). This function is akin to SQL CASE WHEN. On large data sets, it consistently performs slower than ifelse().
  • if(){ }else{ }: Can be used with data frames, but is more commonly used for operations outside of data frames. An example would be only running a chunk of code if a certain condition is met (e.g., if the data frame has > 0 rows, run next line of code.)
ifelse()

The ifelse() function takes 3 arguments organized like: ifelse(condition == TRUE, return this, return this instead). The first is the condition you’re testing. The second argument is what to return if the condition is met. The third is what to return if the condition is not met. You can also nest ifelse() to include more than 2 conditions, but it can quickly get out of hand and hard to follow (see below).

Let’s start by adding a column to the NETN tree data that uses the TreeStatusCode to create a new column called status that is either live or dead conditional on the abbreviated code in TreeStatusCode in trees_final.

Create status column conditioning on TreeStatusCode

# Check the levels of TreeStatusCode
sort(unique(trees_final$TreeStatusCode))
alive <- c("AB", "AL", "AS", "RS")
dead <- c("DB", "DM", "DS")

trees_final <- trees_final |> 
  mutate(status = ifelse(TreeStatusCode %in% alive, "live", "dead"))

# nested ifelse to make alive, dead, and recruit 
trees_final <- trees_final |> 
  mutate(status2 = ifelse(TreeStatusCode %in% dead, "dead",
                          ifelse(TreeStatusCode %in% "RS", "recruit", 
                                 "live")))
Remember that we used %in% instead of == because alive has multiple status codes. We used == in TreeStatusCode == "RS" because there’s only one status code considered a recruit. We could have also used %in% with the same results, but that would not be true for matching the alive status codes.


case_when()

The case_when() function allows you to have multiple conditions, each with their own return. The syntax is a bit different than ifelse() to allow for the multiple conditions and returns. Using the same approach as above, we’ll create a third status code with case_when(). We’ll specify the output for when a tree status code is in the dead category, recruit, and live category. We’ll then add a fourth output for status codes that don’t match any of the previous conditions and set that as ‘unknown’. Basically the TRUE just means, any records left are assigned ‘unknown’. Note also the order of operations in case_when(). The alive group includes “RS”, but is not assigned “live”, because it was alreay matched. The case_when() function starts with the first statement at the top (i.e. dead trees). Any record that matches the first statement is then dropped from additional statements. The second statement considers all trees not matched as dead. The third statement considers all trees not matched as dead or recruit. Then the fourth statement considers any trees not matched as dead, recruit, or live. Rather than relying on this function behavior, it’s better to not have overlapping categories (e.g. not include “RS” in alive). I include it here to demonstrate the point.

Create status column conditioning on TreeStatusCode

# Check the levels of TreeStatusCode
alive <- c("AB", "AL", "AS", "RS")
dead <- c("DB", "DM", "DS")

trees_final <- trees_final |> 
  mutate(status3 = case_when(TreeStatusCode %in% dead ~ 'dead',
                             TreeStatusCode %in% 'RS' ~ 'recruit',
                             TreeStatusCode %in% alive ~ 'live', 
                             TRUE ~ 'unknown'))

table(trees_final$status2, trees_final$status3) # check that the output is the same
View R output
##          
##           dead live recruit
##   dead      25    0       0
##   live       0  130       0
##   recruit    0    0       9


if(){ }else{ }

This style of if(){ }else{ }, hereafter called if/else, conditionals is best used for operations outside of data frames, like turning code on or off based on specific conditions. I use if/else with ggplot (graphing R package we’ll cover later) to turn certain features on or off based on a condition in the data or a condition I set. If/else statements are also helpful for bug handling in your code. For example, if you want the code to send a warning when your data frame is empty (no rows), you can have an if/else statement that prints to the console. You can string together multiple conditions to test by adding more else{ } statements.

Print warning in console that indicates if invasive species are found in wetland data

inv <- ACAD_wetland |> filter(Invasive == TRUE)

if(nrow(inv) > 0){print("Invasive species were detected in the data.")
  } else {print("No invasive species were detected in the data.")}
## [1] "Invasive species were detected in the data."

Force the else statement to print, by filtering out invasive species before testing. I added another potential else statement just to show that syntax.

native_only <- ACAD_wetland |> filter(Invasive == FALSE) 
inv2 <- native_only |> filter(Invasive == TRUE)

if(nrow(inv2) > 0){print("Invasive species were detected in the data.")
  } else if(nrow(inv2) == 0){print("No invasive species were detected in the data.")
    } else {"Invasive species detections unclear"}
## [1] "No invasive species were detected in the data."


Test your skills with conditionals
CHALLENGE: Using the ACAD_wetland data, create a new column called Status that has “protected” for Protected = TRUE and “public” values for Protected = FALSE.
Answer
# read in wetland data if you don't already have it loaded.
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Base R using the with() function
ACAD_wetland$Status <- with(ACAD_wetland, ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected)
# Tidyverse
ACAD_wetland <- ACAD_wetland |> mutate(Status = ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected) # check your work
View R output
##            
##             FALSE TRUE
##   protected     0    9
##   public      499    0

CHALLENGE: Using the ACAD_wetland data, create a new column called abundance_cat that has levels High, Medium, Low, based on Ave_Cov, where “High” is >50%, “Medium” is 10-50%, and “Low” is < 10%.
Answer
# Base R using the with() function and nested ifelse()
ACAD_wetland$abundance_cat <- with(ACAD_wetland, ifelse(Ave_Cov < 10, "Low",
                                                        ifelse(Ave_Cov >= 10 & Ave_Cov <= 50, "Medium", "High")))
# Tidyverse using case_when() and between
ACAD_wetland <- ACAD_wetland |> mutate(abundance_cat = case_when(Ave_Cov < 10 ~ "Low",
                                                                 between(Ave_Cov, 10, 50) ~ "Medium", 
                                                                 TRUE ~ "High"))
table(ACAD_wetland$abundance_cat)
View R output
## 
##   High    Low Medium 
##      6    464     38

Note the use of the between() function that saves typing. This function matches as >= and <=.



Summarizing with dplyr

Using group_by() and summarize()

Yesterday, we used functions like mean(), min(), and max() to summarize entire datasets. Now we’re going to use those same functions to summarize data by grouping variables, such as park, year, plot, etc. The process is similar to using Totals in Access or subtotals in Excel, although it is more flexible and efficient in R.

Difference between summarize() and mutate():
  • mutate() returns the same number of rows as the original data frame. This function also returns all of the rows that were in the original data frame.
  • summarize() returns the same number of rows as there are grouping levels in the original data frame. This function only returns the rows that were part of the group_by() and that were created in the summarize() function.
Common functions used to summarize:
  • mean(): calculate the group means
  • min(): calculate the group minimums
  • max(): calculate the group maximums
  • sum(): calculate the group sums
  • sd(): calculate the group standard deviations
  • n(): tally the number of rows within each group

Sum the number of trees per plot, year, and species using mutate.

Note the use of n(), which counts the number of rows within a group. Be careful here. If there are NAs in a group, they are counted by n(). Whether that’s okay or not depends on the data.

num_trees_mut <- trees_final |> 
  group_by(Plot_Name, SampleYear, Species) |> 
  mutate(num_trees = n()) |> 
  select(Plot_Name, SampleYear, Species, num_trees)

nrow(trees_final) #164
View R output
## [1] 164

nrow(num_trees_mut) #164
View R output
## [1] 164

head(num_trees_mut)
View R output
## # A tibble: 6 × 4
## # Groups:   Plot_Name, SampleYear, Species [3]
##   Plot_Name SampleYear Species           num_trees
##   <chr>          <int> <chr>                 <int>
## 1 MIMA-11         2025 Acer rubrum               6
## 2 MIMA-11         2025 Acer rubrum               6
## 3 MIMA-11         2025 Acer rubrum               6
## 4 MIMA-11         2025 Acer rubrum               6
## 5 MIMA-11         2025 Quercus palustris         1
## 6 MIMA-11         2025 Quercus bicolor           1

Note how the format of head(num_trees_mut) differs from output for a data frame. This is how you know your dataset has been turned into a tibble.

Sum the number of trees per plot, year, and species using mutate.

num_trees_sum <- trees_final |> 
  group_by(Plot_Name, SampleYear, Species) |> 
  summarize(num_trees = n()) 

nrow(trees_final) #164
View R output
## [1] 164

nrow(num_trees_sum) #164
View R output
## [1] 41

head(num_trees_sum)
View R output
## # A tibble: 6 × 4
## # Groups:   Plot_Name, SampleYear [2]
##   Plot_Name SampleYear Species                num_trees
##   <chr>          <int> <chr>                      <int>
## 1 MIMA-11         2025 Acer rubrum                    6
## 2 MIMA-11         2025 Fraxinus pennsylvanica         1
## 3 MIMA-11         2025 Quercus bicolor                1
## 4 MIMA-11         2025 Quercus palustris              1
## 5 MIMA-12         2025 Acer rubrum                    9
## 6 MIMA-12         2025 Fraxinus                       1

The group_by() + mutate() approach is helpful if you’re trying to standardize values within your group. But in most cases, the group_by() + summarize() approach, which collapses to the group level, is what you’re looking for.

Note the warning that summarize() gave you in the console. The tidyverse is chatty. They have a lot of checks built into their functions based on how the developers think you should be using their functions. The warning with summarize() is particularly annoying. To turn the warning off, you can specify .groups = 'drop'. I’ll show that next.

There’s also a new .by argument in summarize that allows you to skip the group_by() step. It came after I learned dplyr, so I often forget to use it. Most examples I see online use the original way, I include both approaches below. Interestingly, using .by returns a data frame, whereas group_by() returns a tibble. The differences are small and not something to be too concerned about.

Summarize the average and standard error of tree DBH by plot and year

tree_dbh <- trees_final |> 
  group_by(Plot_Name, SampleYear) |> 
  summarize(mean_dbh = mean(DBHcm),
            num_trees = n(),
            se_dbh = sd(DBHcm)/sqrt(num_trees),
            .groups = 'drop') # prevents warning in console

tree_dbh2 <- trees_final |> 
  summarize(mean_dbh = mean(DBHcm),
            num_trees = n(),
            se_dbh = sd(DBHcm)/sqrt(num_trees),
           .by = c(Plot_Name, SampleYear))

tree_dbh == tree_dbh2 # tests that all the values in 1 data frame match the 2nd. 
View R output
##       Plot_Name SampleYear mean_dbh num_trees se_dbh
##  [1,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [2,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [3,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [4,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [5,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [6,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [7,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [8,]      TRUE       TRUE     TRUE      TRUE   TRUE
##  [9,]      TRUE       TRUE     TRUE      TRUE   TRUE
## [10,]      TRUE       TRUE     TRUE      TRUE   TRUE


Test your summarizing skills
CHALLENGE: Using the ACAD_wetland data, sum the percent cover of native vs. invasive species per plot (use the Ave_Cov column). Note that Invasive = TRUE is invasive and FALSE is native.
Answer
# Using group_by()
ACAD_inv <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(Pct_Cov = sum(Ave_Cov), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv)
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive Pct_Cov
##   <chr>     <int> <lgl>      <dbl>
## 1 RAM-05     2012 FALSE     155.  
## 2 RAM-05     2017 FALSE     152.  
## 3 RAM-05     2017 TRUE        0.06
## 4 RAM-41     2012 FALSE      48.6 
## 5 RAM-41     2017 FALSE     107.  
## 6 RAM-41     2017 TRUE       10.2

# Using summarize(.by)
ACAD_inv2 <- ACAD_wetland |> 
  summarize(Pct_Cov = sum(Ave_Cov), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv2) # should be the same as ACAD_inv
View R output
##   Site_Name Year Invasive Pct_Cov
## 1    RAM-05 2012    FALSE  155.42
## 2    RAM-05 2017    FALSE  152.04
## 3    RAM-05 2017     TRUE    0.06
## 4    RAM-41 2017    FALSE  107.04
## 5    RAM-41 2012    FALSE   48.56
## 6    RAM-41 2017     TRUE   10.20

CHALLENGE: Using the ACAD_wetland data, count the number of native vs. invasive species per plot. Note that Invasive = TRUE is invasive and FALSE is native.
Answer
# Using group_by()
ACAD_spp <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(num_spp = n(), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp)
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive num_spp
##   <chr>     <int> <lgl>      <int>
## 1 RAM-05     2012 FALSE         44
## 2 RAM-05     2017 FALSE         53
## 3 RAM-05     2017 TRUE           1
## 4 RAM-41     2012 FALSE         33
## 5 RAM-41     2017 FALSE         39
## 6 RAM-41     2017 TRUE           1

# Using summarize(.by)
ACAD_spp2 <- ACAD_wetland |> 
  summarize(num_spp = n(), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp2) # should be the same as ACAD_inv
View R output
##   Site_Name Year Invasive num_spp
## 1    RAM-05 2012    FALSE      44
## 2    RAM-05 2017    FALSE      53
## 3    RAM-05 2017     TRUE       1
## 4    RAM-41 2017    FALSE      39
## 5    RAM-41 2012    FALSE      33
## 6    RAM-41 2017     TRUE       1

CHALLENGE: Using ACAD_wetland data, calculate relative % cover of species within each site. This one is challenging!
Answer

Most efficient solution figured out during training

# using the .by within mutate (newer solution)
ACAD_wetland <- ACAD_wetland |> 
  mutate(Site_Cover = sum(Ave_Cov), 
         .by = c(Site_Name, Year)) |> 
  mutate(rel_cov = (Ave_Cov/Site_Cover)*100,
         .by = c(Site_Name, Year, Latin_Name, Common))

Original Solution: First sum site-level cover using mutate to return a value for every original row.

ACAD_wetland <- ACAD_wetland |> group_by(Site_Name, Year) |> 
  mutate(Site_Cover = sum(Ave_Cov)) |> 
  ungroup() # good practice to ungroup after group.

table(ACAD_wetland$Site_Name, ACAD_wetland$Site_Cover) # check that each site has a unique value.
View R output
##         
##          48.56 70.6 104.78 106.72 111.4 117.24 152.1 153.8 155.42 165.64 178.34
##   RAM-05     0    0      0      0     0      0    54     0     44      0      0
##   RAM-41    33    0      0      0     0     40     0     0      0      0      0
##   RAM-44     0    0     45      0     0      0     0     0      0      0     34
##   RAM-53     0    0      0      0     0      0     0     0      0      0      0
##   RAM-62     0    0      0      0     0      0     0    26      0     26      0
##   SEN-01     0   34      0      0     0      0     0     0      0      0      0
##   SEN-02     0    0      0     41     0      0     0     0      0      0      0
##   SEN-03     0    0      0      0    33      0     0     0      0      0      0
##         
##          188.84 196.52
##   RAM-05      0      0
##   RAM-41      0      0
##   RAM-44      0      0
##   RAM-53     48     50
##   RAM-62      0      0
##   SEN-01      0      0
##   SEN-02      0      0
##   SEN-03      0      0

head(ACAD_wetland)
View R output
## # A tibble: 6 × 15
##   Site_Name Site_Type Latin_Name Common  Year PctFreq Ave_Cov Invasive Protected
##   <chr>     <chr>     <chr>      <chr>  <int>   <int>   <dbl> <lgl>    <lgl>    
## 1 SEN-01    Sentinel  Acer rubr… red m…  2011       0    0.02 FALSE    FALSE    
## 2 SEN-01    Sentinel  Amelanchi… servi…  2011      20    0.02 FALSE    FALSE    
## 3 SEN-01    Sentinel  Andromeda… bog r…  2011      80    2.22 FALSE    FALSE    
## 4 SEN-01    Sentinel  Arethusa … drago…  2011      40    0.04 FALSE    TRUE     
## 5 SEN-01    Sentinel  Aronia me… black…  2011     100    2.64 FALSE    FALSE    
## 6 SEN-01    Sentinel  Carex exi… coast…  2011      60    6.6  FALSE    FALSE    
## # ℹ 6 more variables: X_Coord <dbl>, Y_Coord <dbl>, Status <chr>,
## #   abundance_cat <chr>, Site_Cover <dbl>, rel_cov <dbl>

Next calculate relative cover grouped on Site_Name, Year, Latin_Name, and Common

# Create new dataset because collapsing rows on grouping variables
# Using group_by() and summarize()
ACAD_wetland_relcov <- ACAD_wetland |> group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = (Ave_Cov/Site_Cover)*100,
            .groups = 'drop') |> 
  ungroup()

Check that relative cover sums to 100% within each site

# Using summarize(.by = ) 
ACAD_wetland_relcov2 <- ACAD_wetland |> #group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = Ave_Cov/Site_Cover, 
            .by = c("Site_Name", "Year", "Latin_Name", "Common"))

# Check that your relative cover sums to 100 for each site
relcov_check <- ACAD_wetland_relcov2 |> group_by(Site_Name, Year) |> 
  summarize(tot_relcov = sum(rel_cov)*100, .groups = 'drop')
table(relcov_check$tot_relcov) # they should all be 100
View R output
## 
## 100 
##  13


Data Viz Best Practices

The Power of (Good) Data Visualizations

Data are useful only when used. Data are used only when understood.

Consider three example data visualizations that demonstrate how some approaches are more effective than others in conveying patterns.

Example 1. Plots convey messages faster than tables

Most people can understand this figure of daily Covid cases faster than they can understand the table of daily Covid cases.

Figure 1. Average daily Covid cases per 100k people, by region (Sources: State and local health agencies [cases]; Census Bureau [population data]

Table 1. Daily Covid cases and population numbers by state (only showing first 7 records)
state timestamp cases total_population
AK 2022-01-25T04:00:00Z 203110 731545
AL 2022-01-25T04:00:00Z 1153149 4903185
AR 2022-01-25T04:00:00Z 738652 3017804
AZ 2022-01-25T04:00:00Z 1767303 7278717
CA 2022-01-25T04:00:00Z 7862003 39512223
CO 2022-01-25T04:00:00Z 1207991 5758736
CT 2022-01-25T04:00:00Z 683731 3565287


Example 2. Plots reveal patterns and highlight extremes

This table shows average monthly revenue for Acme products.

Table 2. Average monthly revenue (in $1000’s) from Acme product sales, 1950 - 2020
category product Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
party supplies balloons 892 1557 1320 972 1309 1174 1153 1138 1275 1178 1325 1422
party supplies confetti 1271 1311 829 1020 1233 1061 1088 1395 1376 1152 1568 1412
party supplies party hats 1338 1497 1445 956 1372 1482 1048 877 1404 1030 1458 1547
party supplies wrapping paper 1396 1026 932 891 1364 896 900 1221 1146 967 1394 1507
school supplies backpacks 1802 1773 1611 1723 1799 1730 1813 1676 1748 1652 1819 1759
school supplies notebooks 1153 1471 1541 1371 1592 1514 1725 1702 1457 1604 1729 1279
school supplies pencils 1679 1304 1054 1259 1425 1608 1972 1811 1610 1004 1417 1283
school supplies staplers 1074 1708 1439 1154 1551 1099 1793 1601 1647 1666 1389 1511

Use the table above to answer these questions:

  1. What product and month had the highest average monthly revenue?
  2. What product and month had the lowest average monthly revenue?

Now let’s display the same table as a heat map, with larger numbers represented by darker color cells. How quickly can we answer those same two questions? What patterns can we see in the heat map that were not obvious in the table above?

Figure 2. Heat map of average monthly revenue (in $1000’s) from Acme product sales, 1950 - 2020


Example 3. Plots provide insights that statistics obscure

In 1973, Francis Anscombe published “Graphs in statistical analysis”, a paper describing four bivariate datasets with identical means, variances, and correlations.

Table 3. Anscombe’s Quartet - Four bivariate datasets with identical summary statistics
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Table 4. Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.
x1 y1 x2 y2 x3 y3 x4 y4
mean 9 7.50 9 7.50 9 7.50 9 7.50
var 11 4.13 11 4.13 11 4.12 11 4.12


Anscombe data as plots: Despite their identical statistics, when we plot the data we see the four datasets are actually very different. Anscombe’s point was to understand the data, we must plot the data.

)


Guidelines for Effective Data Visualizations Anscombe used clever thinking and simple plots to demonstrate the importance of data visualizations. But it’s not enough to just plot the data. To have impact, a plot must convey a clear message. How can we do this?
  • Have a purpose. Every data visualization should tell a story.
  • Consider your audience. Avoid scientific names, acronyms, or jargon unless your audience is well-versed in that language. Use color-blind friendly colors.
  • Use an appropriate visualization. For example:
    • Line graphs work well for showing changes in continuous data over time
    • Bar charts compare counts or proportions in categorical data (pie charts get a bad rap in the data viz world, but can be useful in certain situations)
    • For statistics (e.g., means) with confidence intervals, point plots with error bars are preferred over bar charts (nice explanation here)
    • Scatterplots are useful for showing the relationship (correlation) between two continuous variables.
    • Matrix heat maps can efficiently compare the magnitude of numbers when we have lots of data structured in table format, especially when colors have a clear connection to the numbers (e.g., scorecard data)
    • Box plots, violin plots, and histograms show distributions and outliers for continuous data. Dot plots are a useful alternative when sample sizes are small.
  • Keep it simple:
    • Every plot element and aesthetic should have a purpose.
    • Avoid 3D charts unless you have good reason otherwise.
    • Don’t try to cram everything into one plot (e.g., juxtapose two plots instead of adding a secondary y-axis. Nice explanation here).
  • Use appropriate font size for axis labels. The audience should not need a magnifying glass to read the axis labels on your figure. This is particularly true for powerpoint slides. Either make the font bigger in R, or manually fix that in powerpoint.
  • Figures should not require 5 minutes of explaining before people can understand them. If that’s the case for your figure, think about how you can simplify or more clearly convey information.
  • Use informative text and arrows wisely. Clear, meaningful titles, subtitles, axes titles/labels, and annotations help convey the message of a plot. Use lines and arrows (sparingly but effectively) to emphasize important thresholds, data points or other plot features. Fonts should be large enough with good contrast (against the background) and sufficient white space to be easily readable.


Intro to ggplot2

Intro to ggplot2

The ggplot2 package is the most popular R package for plotting. It takes a little effort to learn how the pieces of a ggplot object fit together. However, once you get the hang of it, you can create and customize a large variety of attractive plots with just a few lines of R code. The package is called ggplot2 because originally there was ggplot. The developer, Hadley Wickham, didn’t want to break the original package to improve the package, so created ggplot2.

The ggplot2 online book and cheat sheets can be very helpful while you are learning to use the ggplot2 package.

The ggplot2 package was developed using the grammar of graphics as the underlying philosophy, which basically breaks a plot up into individual building blocks related to aesthetics (e.g., color, size, shape), geometries (e.g. points, lines, boxes), and themes (e.g. axis label font size, legend placement, etc.).

Important concepts with ggplot2:
  • ggplot(data, aes()): Every ggplot object starts with this line, which tells R the data you’re plotting, and which variables you’re plotting where in aes() argument.
  • data: every plot requires data. The first argument for every ggplot()call is the data. This also means you can pipe data into a ggplot object.
  • aes: Short for aesthetics. This is where you tell ggplot what your x and y variables are. If you want aesthetics, like color, fill, or size to vary by the data (e.g., color code a figure by park, use a dashed vs. solid line to distinguish between significant/non-significant trend, etc.), those variables are specified within the aes() argument either at the ggplot() level, or within the specific geom.
  • geom: Short for geometries, geoms represent what you see in the plot, such as the points in a scatter plot, the lines in a trend plot, or the boxes in a boxplot.
  • scale: Scales go hand in hand with aes(). If you specify aes(color = park), then scale is where you can specify a custom color for each park instead of ggplot’s default color scheme. The scale is where you can set the labels of groups in a legend (if different than how the data are labeled). You can also customize axis ranges, breaks, and labels with different scales.
  • theme: This is where you can change the format of the figure, such as removing the gridlines in the default ggplot format, making axis labels larger, changing the position of the legend, etc.
  • facet: Faceting the data allows you to graph multiple plots based on a grouping variable (e.g., site, species, year, etc.) in the same ggplot object.


Plot Park Visitation in Acadia NP

We will build our first ggplot object step-by-step to demonstrate how each component contributes to the final plot. Our first ggplot object will create a line graph of trends in visits to Acadia National Park from 1994 to 2024.

Step 1. Import the visitation data and load ggplot2

library(ggplot2)
library(dplyr) # for filter 
visits <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_annual_visits.csv")

Step 2. Look over the data for any potential problems.

# Examine the data to understand data structure, data types, and potential problems
head(visits) 
summary(visits)
table(visits$Year) 
table(complete.cases(visits))
str(visits)
## 'data.frame':    31 obs. of  3 variables:
##  $ Park         : chr  "ACAD" "ACAD" "ACAD" "ACAD" ...
##  $ Year         : int  1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 ...
##  $ Annual_Visits: chr  "2,710,749" "2,845,378" "2,704,831" "2,760,306" ...

Step 3. Fix the annual visitation variable

Note that Annual_Visits is treated like a character because there are “,” thousands separators. We have to fix that before we can plot the data. There are multiple ways to do this. I’m going to use the gsub() function that behaves like gsub([find "pattern"], [replace with "pattern"], [column to search]). The empty “” removes the “,”. The as.numeric() converts a character to numeric. As long as you don’t get the “NAs introduced through coersion” warning in the console, all rows in the column of interest were successfully converted to a number.

# Base R
visits$Annual_Visits <- as.numeric(gsub(",", "", visits$Annual_Visits))

# Tidyverse
library(dplyr) # load package first
visits <- visits |> mutate(Annual_Visits = as.numeric(gsub(",", "", Annual_Visits)))
str(visits) #check that it worked
View R output
## 'data.frame':    31 obs. of  3 variables:
##  $ Park         : chr  "ACAD" "ACAD" "ACAD" "ACAD" ...
##  $ Year         : int  1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 ...
##  $ Annual_Visits: num  2710749 2845378 2704831 2760306 2594497 ...

Step 4. Create the ggplot template of annual visitation per 1000 visitors over time

p <- ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000))
p

Step 5. Add line and point geometry, starting with default, and ending with customized symbols

The plot below has custom the shape, color (outline), fill, and size of the points, but all points are the same. Note the hexcode used for the fill color. This is a 6 digit code that gives maximum flexibility on selecting colors. I often use HTML color codes to find colors and their associated hexcode.

Note that I usually specify the geom_line() before the geom_point(), because the geoms are drawn in the order they’re specified. Specifying the opposite order will make the lines cross over the points and doesn’t look as nice. I also made the linewidth a bit thicker than default.

p1a <- p + geom_line() + geom_point() # default color and shape to points
p1a

p1 <- p + 
  geom_line(linewidth = 0.6) + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24)
p1

Step 6. Fine tune the Y and X axis breaks

p2 <- p1 + scale_y_continuous(name = "Annual visitors in 1000's",
                              limits = c(2000, 4500),
                              breaks = seq(2000, 4500, by = 500)) + # label at 2000, 2500, ... up to 4500
           scale_x_continuous(limits = c(1994, 2024),
                              breaks = c(seq(1994, 2024, by = 5))) # label at 1994, 1999, ... up to 2024
p2

Step 7. Change the labels

p3 <- p2 + labs(x = "Year", 
                title = "Annual visitation/1000 people in Acadia NP 1994 - 2024")
p3


Note how the axis labels can be specified either in the scale (step 4) or in the labs() function.

Step 8. Modify the theme using built-in themes

There are built in themes that change the default formatting of the plot. Here I show theme_bw(), but there are many options. Play around with the other themes to get a feel for the options: theme_linedraw(), theme_light(), theme_dark(), theme_minimal(), theme_classic(), theme_void(). The two I use most often are theme_bw() and theme_classic(). There’s a package called ggthemes that you can install with even more themes. You can also create your own themes.

p4 <- p3 + theme_bw() 
p4

Or change theme elements manually

Note that ?theme takes you to the help that shows all the options.

p4b <- p3 + theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # make x axis text bigger and angle
                  panel.grid.major = element_blank(), # turns of major grids
                  panel.grid.minor = element_blank(), # turns off minor grids
                  panel.background = element_rect(fill = 'white', color = 'dimgrey'), # panel white w/ grey border
                  plot.margin = margin(2, 3, 2, 3), # increase white margin around plot 
                  title = element_text(size = 10) # reduce title size 
                  )

p4b


Note the order of margins in ggplot is Top, Right, Bottom, Left, which sounds like trouble.

Putting it all together

ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line() + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24) +
  labs(x = "Year", title = "Annual visitation/1000 people in Acadia NP 1994 - 2024") +
  scale_y_continuous(name = "Annual visitors in 1000's",
                     limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # make x axis text bigger and angle
        panel.grid.major = element_blank(), # turns of major grids
        panel.grid.minor = element_blank(), # turns off minor grids
        panel.background = element_rect(fill = 'white', color = 'dimgrey'), # make panel white w/ grey border
        plot.margin = margin(2, 3, 2, 3), # increase white margin around plot 
        title = element_text(size = 10) # reduce title size 
        )

CHALLENGE: Recreate the plot below (or customize your own plot). Note that the fill color is “#0080FF”, and the shape is 21, and the theme is classic. The linewidth = 0.75, and linetype = ‘dashed’.

Answer
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line(linewidth = 0.75, linetype = 'dashed') + 
  geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
  labs(x = "Year", y = "Annual visits in 1,000s") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme_classic()


Other geometries

So far, the data we’ve plotted doesn’t include any kind of variance. In many cases, we do want to show the distribution of data among years, parks or other categories. That’s where plots like bar plots with error bars and boxplots can be handy.

First we’ll look at how to make barplots with error bars. The visitation data doesn’t have error bars, so we’ll use a water chemistry dataset, and filter it to only include Jordan Pond and Dissolved Oxygen.

Prep the data

Load the data and packages

library(dplyr)
library(ggplot2)
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")

Filter on Jordan Pond and Dissolved Oxygen and extract the month from the date in a VERY hacky way.

We’ll learn about working with dates in a more formal way on Day 3.

jordDO <- chem |> 
  filter(SiteCode == "ACJORD") |> 
  filter(Parameter == "DO_mgL") |> 
  mutate(month = as.numeric(gsub("/", "", substr(EventDate, 1, 2))))
head(jordDO)
unique(jordDO$SiteCode) # check filter worked
unique(jordDO$Parameter) # check filter worked

Let’s assume that calculating mean and standard error assuming the data were normally distributed is appropriate. The code below calculates mean and standard error by month, so we can plot error bars later.

Calculate standard error of DO by month

jordDO_sum <- jordDO |> group_by(month) |> 
  summarize(mean_DO = mean(Value),
            num_meas = n(),
            se_DO = sd(Value)/sqrt(num_meas))
jordDO_sum
View R output
## # A tibble: 6 × 4
##   month mean_DO num_meas  se_DO
##   <dbl>   <dbl>    <int>  <dbl>
## 1     5   11.1        17 0.168 
## 2     6    9.54       19 0.0908
## 3     7    8.72       19 0.0445
## 4     8    8.73       19 0.0570
## 5     9    9.46       19 0.0902
## 6    10   10.3        19 0.0837

Bar Plots

Bar charts with error bars are a common way to show mean and variance. Bar charts y-axes should always start at 0, which doesn’t necessarily allow you to see the patterns in the data all that well. Boxplots are often a better approach in that case, which we’ll show next.

A couple of notes on the code below:
  • For bar charts, if you want the value (mean_DO) to be the bar height, use geom_col(). If instead, you want the bar chart to be proportional to the number of cases in each group, use geom_bar(stat = 'count').
  • The width = 0.75 makes the bar width narrower allowing some white space between the bars.
  • Setting x to NULL in the labs() means the x axis won’t include a label.
  • This isn’t the best use of bar plots, but you get the idea on how to use them.

Bar plot of Jordan Pond average DO with 95% CI error bars

ggplot(data = jordDO_sum, aes(x = month, y = mean_DO)) +
  geom_col(fill = "#74AAE3", color = "dimgrey", width = 0.75) +
  geom_errorbar(aes(ymin = mean_DO - 1.96*se_DO, ymax = mean_DO + 1.96*se_DO), 
                width = 0.75) +
  theme_bw() +
  labs(x = NULL, y = "Dissolved Oxygen mg/L") +
  scale_x_continuous(limits = c(4, 11),
                     breaks = c(seq(5, 10, by = 1)),
                     labels = c("May", "Jun", "Jul", 
                                "Aug", "Sep", "Oct"))
Boxplots

Boxplots are another way to show variance in responses. In most cases, the middle line of a boxplot is the median. The lower and upper limits of the box represent the 25th and 75th percentiles. The lower and upper whiskers are 1.5 times the interquartile range (the 25th to 75th percentiles), or the min/max of the data, whichever is smaller. Points beyond the whiskers are considered outlying points. In this case, I turned off the outlying points, because I plotted the actual data behind the boxplots as slightly transparent. This is good practice to see how points are distributed.

Boxplots of Jordan Pond DO

ggplot(data = jordDO, aes(x = month, y = Value, group = month)) +
  geom_boxplot(outliers = F) + 
  geom_point(alpha = 0.2) +
  theme_bw() +
  labs(x = NULL, y = "Dissolved Oxygen mg/L") +
  scale_x_continuous(limits = c(4, 11),
                     breaks = c(seq(5, 10, by = 1)),
                     labels = c("May", "Jun", "Jul", 
                                "Aug", "Sep", "Oct"))


Day 3: Wrangling and Viz II

Day 3 Goals

Goals for Day 3:
Poor file management illustration Artwork by @allison_horst

  1. More advanced data wrangling
    • Learn how to pivot data from long to wide and wide to long
    • Learn how to join tables and apply the different join types
    • Working with dates and times
  2. More advanced ggplot2:
    • Custom colors and shapes by grouping variables
    • Customizing axes
    • Combining plots with facets and patchwork
    • Working with legends
    • Color palettes
  3. Coding best practices
    • Commenting code
    • Putting packages, datasets, and parameters on top of script
    • Using consistent coding style
    • Logical object naming
    • Using projects instead of stand alone scripts where possible
    • How to choose R packages


Feedback: Please leave feedback in the training feedback form. You can submit feedback multiple times and don’t need to answer every question. Responses are anonymous.


Pivoting Tables

Reshaping 101

Reshaping data from long to wide and wide to long is a common task with our data. Datasets are usually described as long, or wide. The long form, which is the structure database tables often take, consists of each row being an observation, and each column being a variable (i.e. tidy format). However, in summary tables, we often want to reshape the data to be wide for better digestion.

We’ll work with a fake bat capture dataset to see how this works. To get started, load the dataset and packages, as shown below.

library(dplyr)
library(stringr) # for word 
bat_cap <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/example_bat_capture_data.csv")
head(bat_cap)
str(bat_cap)


Prep the data

The example dataset contains simplified capture data in long form. Every row represents an individual bat that was captured at a given site and date. We want to turn this into a wide data frame that has a column for every species, and the number of individuals of that species that were caught for each year and site combination.

Before we start, we’ll make a species code that will be easier to work with (R doesn’t like spaces). We’ll also summarize the data, so we have a count of the number of individuals of each species found per year.

Create sppcode

bat_cap <- bat_cap |> 
  mutate(genus = toupper(word(Latin, 1)), # capitalize and extract first word in Latin
         species = toupper(word(Latin, 2)), # capitalize and extract second word in Latin
         sppcode = paste0(substr(genus, 1, 3), # combine first 3 characters of genus and species
                            substr(species, 1, 3))) |> 
  select(-genus, -species) # drop temporary columns         

head(bat_cap)
View R output
##       Site Julian Year                  Latin              Common sppcode
## 1 site_001    195 2025          Myotis leibii        small-footed  MYOLEI
## 2 site_001    214 2025          Myotis leibii        small-footed  MYOLEI
## 3 site_001    237 2022          Myotis leibii        small-footed  MYOLEI
## 4 site_001    230 2021       Myotis lucifugus        little brown  MYOLUC
## 5 site_001    201 2022      Lasiurus cinereus               hoary  LASCIN
## 6 site_001    230 2020 Myotis septentrionalis northern long-eared  MYOSEP

Summarize # individuals per species, site and year

bat_sum <- bat_cap |> 
  summarize(num_indiv = sum(!is.na(sppcode)), # I prefer this over n()
            .by = c("Site", "Year", "sppcode")) |> 
  arrange(Site, Year, sppcode) # helpful for ordering the future wide columns

Note that I’m using a trick in R with logicals to calculate num_indiv. Logical expressions (TRUE/FALSE) are treated as 1/0 values under the hood in R. Remember that ! is interpreted in R as “not”, so !is.na() reads as “not blank”. Every row in sppcode is checked to see if it’s blank or not. If it’s not blank, that returns a TRUE statement, which is then treated as 1. The sum() function is then summing all of the 1s in the data. This is a way to perform a count of rows that meet a certain condition, and is safer than using n() in my opinion.


Pivot from long to wide

Now that we have bat_sum, we’re going to pivot the data wide to make each species be a separate column and the values in each cell be the number of individuals captured. The code below is pretty straightforward with names_from being the column you want to turn into column names, and the values_from being the value you want in the cells.

Pivot bat summary data to wide

bat_wide <- bat_sum |> pivot_wider(names_from = sppcode, values_from = num_indiv)
head(bat_wide)
View R output
## # A tibble: 6 × 6
##   Site      Year LASCIN MYOLEI MYOSEP MYOLUC
##   <chr>    <int>  <int>  <int>  <int>  <int>
## 1 site_001  2019      1     NA     NA     NA
## 2 site_001  2020     NA      1      1     NA
## 3 site_001  2021     NA      1     NA      1
## 4 site_001  2022      1      2     NA      1
## 5 site_001  2023     NA      1     NA     NA
## 6 site_001  2024     NA     NA     NA      2

That was pretty simple. But there are a lot of blanks where a species wasn’t caught in a give year and site. We can use the values_fill argument to save us time filling blanks as 0s.

Pivot bat summary data to wide filling blanks as 0

bat_wide <- bat_sum |> pivot_wider(names_from = sppcode, 
                                   values_from = num_indiv, 
                                   values_fill = 0)
head(bat_wide)
View R output
## # A tibble: 6 × 6
##   Site      Year LASCIN MYOLEI MYOSEP MYOLUC
##   <chr>    <int>  <int>  <int>  <int>  <int>
## 1 site_001  2019      1      0      0      0
## 2 site_001  2020      0      1      1      0
## 3 site_001  2021      0      1      0      1
## 4 site_001  2022      1      2      0      1
## 5 site_001  2023      0      1      0      0
## 6 site_001  2024      0      0      0      2

table(complete.cases(bat_wide)) # all true; no blanks
View R output
## 
## TRUE 
##   29

Now we see that every cell has a value. Another useful argument in pivot_wider() is names_prefix. That allows you to add a string before the column names that are generated in the pivot. This is helpful if you’re pivoting on a number column, like year or plot number. R doesn’t like column names that start with a number. The names_prefix is a quick way to fix that. I’ll just show it with the bat capture data as an example, even though it wasn’t needed.

bat_wide2 <- bat_sum |> pivot_wider(names_from = sppcode, 
                                    values_from = num_indiv, 
                                    values_fill = 0, 
                                    names_prefix = "spp_")
head(bat_wide2)
View R output
## # A tibble: 6 × 6
##   Site      Year spp_LASCIN spp_MYOLEI spp_MYOSEP spp_MYOLUC
##   <chr>    <int>      <int>      <int>      <int>      <int>
## 1 site_001  2019          1          0          0          0
## 2 site_001  2020          0          1          1          0
## 3 site_001  2021          0          1          0          1
## 4 site_001  2022          1          2          0          1
## 5 site_001  2023          0          1          0          0
## 6 site_001  2024          0          0          0          2

CHALLENGE: Pivot the bat_sum data frame on year instead of species, so that you have a column for every year of captures. Remember to avoid column names starting with a number.

Answer
bat_wide_yr <- pivot_wider(bat_sum, 
                           names_from = Year, 
                           values_from = num_indiv, 
                           values_fill = 0, 
                           names_prefix = "yr")
head(bat_wide_yr)
## # A tibble: 6 × 9
##   Site     sppcode yr2019 yr2020 yr2021 yr2022 yr2023 yr2024 yr2025
##   <chr>    <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1 site_001 LASCIN       1      0      0      1      0      0      0
## 2 site_001 MYOLEI       0      1      1      2      1      0      2
## 3 site_001 MYOSEP       0      1      0      0      0      0      0
## 4 site_001 MYOLUC       0      0      1      1      0      2      0
## 5 site_002 LASCIN       1      0      1      0      0      0      1
## 6 site_002 MYOLEI       1      2      1      0      0      2      0


Pivot wide to long

We can reshape the capture data back to long, which will give us a similar data as before with 0s are added into the data. For the pivot_long() function, you have to tell it which columns to pivot on. If you don’t specify, it will make the entire dataset into 2 long columns, which you typically don’t want. Here I tell R not to pivot on Site and Year columns, because I know they’re in the data frame and unlikely to change. If I instead specified the sppcodes to pivot on, if a new species were found in the next year of sampling, I’d have to update this code to include that new species.

bat_long <- bat_wide |> pivot_longer(cols = -c(Site, Year), 
                                     names_to = "sppcode", 
                                     values_to = "num_indiv")
head(bat_long)
View R output
## # A tibble: 6 × 4
##   Site      Year sppcode num_indiv
##   <chr>    <int> <chr>       <int>
## 1 site_001  2019 LASCIN          1
## 2 site_001  2019 MYOLEI          0
## 3 site_001  2019 MYOSEP          0
## 4 site_001  2019 MYOLUC          0
## 5 site_001  2020 LASCIN          0
## 6 site_001  2020 MYOLEI          1

CHALLENGE: Pivot the resulting data frame from the previous question to long on the years columns, and remove the “yr” from the year names using names_prefix = 'yr'.

Answer
bat_long_yr <- pivot_longer(bat_wide_yr, 
                            cols = -c(Site, sppcode),
                            names_to = "Year", 
                            values_to = "num_indiv", 
                            names_prefix = "yr") # drops this string from values


Joining Tables

Joining tables 101

We often need to combine data from separate tables in our work (e.g., relational database tables). In R we do this using either the merge() function in base R or join_() functions in dplyr. Because I find dplyr join functions to be more intuitive and to perform faster than base R’s merge, I’m going to show how to use dplyr. If you understand the basic concepts if the join functions, you can figure out how to merge in base R.

Joining tables requires that the two datasets to join have at least one column in common, which is referred to as the key. The key is used to match records. The join type will determine whether all rows from both datasets are returned, or if only a portion are returned based on values in either or both of the two datasets. We think of joining as consisting of a left and right dataset. The left is specified first in the function argument, and the right is specified second. It generally doesn’t matter which you make left or right, just that you know which is left or right. In general, I put site-level datasets on the left, and sample event datasets on the right.


Types of joins
  • Full Join: keeps all observations that appear in both the left and right dataset. Any key value in the left dataset that is not found in the right dataset will return NAs for columns coming from the right data frame, and vice versa. The full join treats both the left and right datasets equally.
  • full join
    Figure from R for Data Science

  • Inner Join: keeps only observations that are matched in both the left and right dataset. Any key value in the left dataset that is not found in the right dataset will be dropped, and vice versa. The inner join treats the left and right datasets equally.
  • inner join
    Figure from R for Data Science

  • Left Join: keeps all observations that appear in the left dataset (first specified) and only those matched in the right dataset. Any key value in the left dataset that is not found in the right dataset will return NAs for columns coming from the right dataset. Any key value in the right dataset not found in the left data frame will be dropped.
  • left join
    Figure from R for Data Science

  • Right Join: keeps all observations that appear in the right dataset (second specified) and only those matched in the left dataset. Any key value in the right dataset that is not found in the left dataset will return NAs for columns coming from the left dataset. Any key value in the left dataset not found in the right dataset will be dropped.
  • right join
    Figure from R for Data Science
  • Anti Join: Returns records in the left dataset that are not found in the right dataset. This is one direction, so to find all records not in common, you have to do an anti join with both combinations of data being on left or right side. I use anti joins when I’m trying to check that two datasets have the same sites or years represented.
  • anti join
    Figure from R for Data Science


Joins in practice

To demonstrate the different joins, we’ll join the bat_wide capture data frame we just created with a dataset that includes more information about the bat capture sites.

Read in bat site data

bat_sites <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/bat_site_info.csv")
sort(unique(bat_sites$Site)) # Sites 1, 2, 3, 4, 5
## [1] "site_001" "site_002" "site_003" "site_004" "site_005"
sort(unique(bat_wide$Site)) # Sites 1, 2, 3, 5, 6
## [1] "site_001" "site_002" "site_003" "site_005" "site_006"

The key in the two bat datasets is the “Site” column. In the bat_sites data frame, there are 5 unique sites, numbered 1:5. In the bat_wide data there are 5 unique sites, numbered 1, 2, 3, 5, 6. Therefore site_004 is only found in bat_sites and site_006 is only found in bat_wide.

Full join

bat_full <- full_join(bat_sites, bat_wide, by = "Site")
table(bat_full$Site)
## 
## site_001 site_002 site_003 site_004 site_005 site_006 
##        7        7        7        1        7        1
View R output
Site Unit X Y SiteName Year LASCIN MYOLEI MYOSEP MYOLUC
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2019 1 0 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2020 0 1 1 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2021 0 1 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2022 1 2 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2023 0 1 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2024 0 0 0 2
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2025 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2019 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2020 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2021 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2022 0 0 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2023 0 0 1 0
site_002 Schoodic 574712 4909721 SERC Campus 2024 0 2 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2025 1 0 1 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2019 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2020 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2021 0 3 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2022 0 2 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2023 0 1 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2024 0 2 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2025 0 1 0 1
site_004 Mount Desert Island 549931 4903409 Western Mtns NA NA NA NA NA
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2019 0 1 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2020 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2021 0 1 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2022 0 0 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2023 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2024 0 0 0 2
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2025 0 1 0 0
site_006 NA NA NA NA 2025 0 0 1 0

Note how site_004, which was not in the bat_wide capture data, but was in the bat_site data is included with NAs for the columns that came from the bat_wide data. Additionally, site_006, which was only in the bat_wide capture data but not in the bat_site data has NAs for the columns that came from the bat_site data.

Inner join

bat_inner <- inner_join(bat_sites, bat_wide, by = "Site")
table(bat_inner$Site)
## 
## site_001 site_002 site_003 site_005 
##        7        7        7        7
View R output
Site Unit X Y SiteName Year LASCIN MYOLEI MYOSEP MYOLUC
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2019 1 0 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2020 0 1 1 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2021 0 1 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2022 1 2 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2023 0 1 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2024 0 0 0 2
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2025 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2019 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2020 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2021 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2022 0 0 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2023 0 0 1 0
site_002 Schoodic 574712 4909721 SERC Campus 2024 0 2 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2025 1 0 1 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2019 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2020 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2021 0 3 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2022 0 2 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2023 0 1 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2024 0 2 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2025 0 1 0 1
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2019 0 1 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2020 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2021 0 1 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2022 0 0 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2023 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2024 0 0 0 2
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2025 0 1 0 0

The inner join only returns records from both datasets that have site in common. Therefore, site_004 in the bat_site data and site_006 in the bat_wide capture data were dropped.

Left join

bat_left <- left_join(bat_sites, bat_wide, by = "Site")
table(bat_left$Site)
## 
## site_001 site_002 site_003 site_004 site_005 
##        7        7        7        1        7
View R output
Site Unit X Y SiteName Year LASCIN MYOLEI MYOSEP MYOLUC
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2019 1 0 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2020 0 1 1 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2021 0 1 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2022 1 2 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2023 0 1 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2024 0 0 0 2
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2025 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2019 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2020 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2021 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2022 0 0 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2023 0 0 1 0
site_002 Schoodic 574712 4909721 SERC Campus 2024 0 2 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2025 1 0 1 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2019 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2020 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2021 0 3 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2022 0 2 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2023 0 1 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2024 0 2 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2025 0 1 0 1
site_004 Mount Desert Island 549931 4903409 Western Mtns NA NA NA NA NA
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2019 0 1 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2020 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2021 0 1 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2022 0 0 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2023 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2024 0 0 0 2
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2025 0 1 0 0

The left join is taking every row in the left data, bat_sites, and only the rows in the right data, bat_wide, that have a matching site. Note how site_004, which is only in the bat_sites, is included with NAs for the columns that came from the bat_wide data that didn’t have a match. Site_006, which was only in the bat_wide data was dropped.

Coding tip: I use left joins more than any other join because I’m usually joining tables that have a 1-to-many relationship, where the left dataset has 1 row for 1 or more rows in the right dataset. For example, say I have a dataset that only includes data for plots where an invasive species was detected and I want to do summary statistics that require the full number of plots. Using a left join, where the left dataset is a table of all of the plots and the right dataset is the invasive detections, will return the full set of plots to calculate summary statistics from. You may also have to fill 0s where NAs are introduced in the data before generating summary statistics, which should be done wisely.

Right join

bat_right <- right_join(bat_sites, bat_wide, by = "Site")
table(bat_right$Site)
## 
## site_001 site_002 site_003 site_005 site_006 
##        7        7        7        7        1
View R output
Site Unit X Y SiteName Year LASCIN MYOLEI MYOSEP MYOLUC
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2019 1 0 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2020 0 1 1 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2021 0 1 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2022 1 2 0 1
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2023 0 1 0 0
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2024 0 0 0 2
site_001 Mount Desert Island 559205 4907461 Jordan Pond 2025 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2019 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2020 0 2 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2021 1 1 0 0
site_002 Schoodic 574712 4909721 SERC Campus 2022 0 0 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2023 0 0 1 0
site_002 Schoodic 574712 4909721 SERC Campus 2024 0 2 0 1
site_002 Schoodic 574712 4909721 SERC Campus 2025 1 0 1 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2019 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2020 0 1 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2021 0 3 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2022 0 2 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2023 0 1 0 0
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2024 0 2 0 1
site_003 Mount Desert Island 554607 4895800 Bass Harbor 2025 0 1 0 1
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2019 0 1 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2020 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2021 0 1 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2022 0 0 1 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2023 1 0 0 0
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2024 0 0 0 2
site_005 Mount Desert Island 563101 4912371 Sieur de Monts 2025 0 1 0 0
site_006 NA NA NA NA 2025 0 0 1 0

The right join is taking every row in the right data, bat_wide, and only the rows in the left data, bat_sites, that have a matching site. Note how Site_006, which is only in the bat_wide, is included with NAs for the columns that came from the bat_sites data that didn’t have a match. Site_004, which was only in the bat_sites data was dropped.

Anti join to find sites not in bat_wide

anti_join(bat_sites, bat_wide, by = "Site")
##       Site                Unit      X       Y     SiteName
## 1 site_004 Mount Desert Island 549931 4903409 Western Mtns

Anti join to find sites not in bat_sites

anti_join(bat_wide, bat_sites, by = "Site")
## # A tibble: 1 × 6
##   Site      Year LASCIN MYOLEI MYOSEP MYOLUC
##   <chr>    <int>  <int>  <int>  <int>  <int>
## 1 site_006  2025      0      0      1      0


Test your skills!
CHALLENGE: Join the NETN_tree_data.csv and NETN_tree_species_table.csv to connect the common name to the tree data.

Import tree and species tables

spp_tbl <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_species_table.csv")
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
Answer
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
View R output
## [1] "TSN"            "ScientificName"

# left join species to trees, because don't want to include species not found in tree data
trees_spp <- left_join(trees, 
                       spp_tbl |> select(TSN, ScientificName, CommonName), 
                       by = c("TSN", "ScientificName"))

head(trees_spp)
View R output
##   ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode    TSN ScientificName
## 1     MIMA       12  6/16/2025  FALSE       2025      13 183385  Pinus strobus
## 2     MIMA       12  6/16/2025  FALSE       2025      12  28728    Acer rubrum
## 3     MIMA       12  6/16/2025  FALSE       2025      11  28728    Acer rubrum
## 4     MIMA       12  6/16/2025  FALSE       2025       2  28728    Acer rubrum
## 5     MIMA       12  6/16/2025  FALSE       2025      10  28728    Acer rubrum
## 6     MIMA       12  6/16/2025  FALSE       2025       7  28728    Acer rubrum
##   DBHcm TreeStatusCode CrownClassCode DecayClassCode         CommonName
## 1  24.9             AS              5           <NA> eastern white pine
## 2  10.9             AB              5           <NA>          red maple
## 3  18.8             AS              3           <NA>          red maple
## 4  51.2             AS              3           <NA>          red maple
## 5  38.2             AS              3           <NA>          red maple
## 6  22.5             AS              4           <NA>          red maple

CHALLENGE: Find a species in the NETN tree data that doesn’t have a match in the species table.
Answer
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
View R output
## [1] "TSN"            "ScientificName"

# anti join of trees against species table, selecting only columns of interest
anti_join(trees, spp_tbl, by = c("TSN", "ScientificName")) |> 
  select(ParkUnit, PlotCode, SampleYear, ScientificName)
View R output
##   ParkUnit PlotCode SampleYear ScientificName
## 1     MIMA       16       2025  Quercus robur


Rolling joins

There are a number of other more advanced joins out there, the rolling join being one of them. For more information on all possible joins, refer to Chapter 19 in R for Data Science.

Rolling joins can come in handy if the key values in your two datasets don’t perfectly match, and you want to join on the closest match. An example of where I’ve used rolling joins is to relate timing of high tide to the nearest water temperature measurement from a HOBO logger. You can allow for the nearest match in both directions or specify the direction (e.g., => or <=). rolling join
Figure from R for Data Science


Unfortunately, dplyr’s rolling join doesn’t perform the way I’ve needed it. It only matches in one direction, like the closest temperature measurement after high tide, or the closest temperature measurement before high tide. If you need to do a rolling join, the data.table package is your best bet. It requires learning a new syntax and coding approach, so I’m not covering it here. But it’s helpful to know that if you’re working with huge datasets, data.table tends to perform much faster than dplyr and may have more features for joining and summarizing your data than dplyr.


Dates and Times

Dates and times 101

Dates, times and date-times are all special types of data in R. When you read in a dataset that has any of these, they typically will read in as a character. You then have to convert it into a date/time to do anything meaningful with it. The first place to start is knowing the code R uses to define year, month, day, hours, minutes, and seconds. For the full list, check out the help for strptime by running: ?strptime. The codes below are the ones you’re most likely to come across, either to define a date/time format, or to return a specific format (like day of the week, month written in full, Julian day, etc.)

Code Definition
%a Abbreviated weekday name in the current locale on this platform.
%A Full weekday name in the current locale.
%b Abbreviated month name in the current locale on this platform. Case-insensitive on input.
%B Full month name in the current locale. Case-insensitive on input.
%d Day of the month as decimal number (01-31).
%H Hours as decimal number (00-23). As a special exception strings such as ??24:00:00?? are accepted for input.
%I Hours as decimal number (01-12).
%j Day of year (Julian) as decimal number (001-366): For input, 366 is only valid in a leap year.
%m Month as decimal number (01-12).
%M Minute as decimal number (00-59).
%p AM/PM indicator in the locale. Used in conjunction with %I and not with %H. For input the match is case-insensitive.
%S Second as integer (00-61)
%u Weekday as a decimal number (1-7, Monday is 1).
%y Year without century (00-99).
%Y Year with century.

Look at current time and date output.

Sys.time()
## [1] "2026-03-30 13:28:42 EDT"
class(Sys.time()) # POSIXct POSIXt
## [1] "POSIXct" "POSIXt"
Sys.Date()
## [1] "2026-03-30"
class(Sys.Date()) # Date
## [1] "Date"


Dates in R

For date only columns, you convert to a Date type. A few different versions of defining dates are below, based on the different format of the input date. This requires matching the format exactly. So, if there are - between day, month, year, or /, you need to specify the right symbol. If the output returns NA instead of a Date, something was wrong either in how you specified the format, or the column you’re trying to format may have more than 1 format represented.

Example formatting for dates

# date with slashes and full year
date_chr1 <- "3/12/2026"
date1 <- as.Date(date_chr1, format = "%m/%d/%Y")
str(date1)
# date with dashes and 2-digit year
date_chr2 <- "3-12-26"
date2 <- as.Date(date_chr2, format = "%m-%d-%y")
str(date2)
# date written out
date_chr3 <- "March 12, 2026"
date3 <- as.Date(date_chr3, format = "%b %d, %Y")
str(date3)
##  Date[1:1], format: "2026-03-12"

Extract information about dates

#Julian date as numeric
as.numeric(format(date1, format = "%j"))
## [1] 71
#Return day of week
format(date1, format = "%A") 
## [1] "Thursday"
#Return abbreviated day of week
format(date1, format = "%a") 
## [1] "Thu"
#Return written out date with month name
format(date1, format = "%B %d, %Y") 
## [1] "March 12, 2026"
#Return abbreviated written out date with month name
format(date1, format = "%b %d, %Y") 
## [1] "Mar 12, 2026"

Do math with dates

date1 + 1 # add a day
## [1] "2026-03-13"
date1 + 7 # add a week
## [1] "2026-03-19"

Create a vector of evenly spaced dates.

This can be helpful for setting up axis labels where one axis is dates.

date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
# by 15 days
seq.Date(date_list[1], date_list[2], by = "15 days")
##  [1] "2026-01-01" "2026-01-16" "2026-01-31" "2026-02-15" "2026-03-02"
##  [6] "2026-03-17" "2026-04-01" "2026-04-16" "2026-05-01" "2026-05-16"
## [11] "2026-05-31" "2026-06-15" "2026-06-30" "2026-07-15" "2026-07-30"
## [16] "2026-08-14" "2026-08-29" "2026-09-13" "2026-09-28" "2026-10-13"
## [21] "2026-10-28" "2026-11-12" "2026-11-27" "2026-12-12" "2026-12-27"
# by month
seq.Date(date_list[1], date_list[2], by = "1 month")
##  [1] "2026-01-01" "2026-02-01" "2026-03-01" "2026-04-01" "2026-05-01"
##  [6] "2026-06-01" "2026-07-01" "2026-08-01" "2026-09-01" "2026-10-01"
## [11] "2026-11-01" "2026-12-01"
# by 6 months
seq.Date(date_list[1], date_list[2], by = "6 months")
## [1] "2026-01-01" "2026-07-01"
Question: How would you return date1 as YYYYMMDD (20260312)?
Answer
format(date1, format = "%Y%m%d")
View R output
## [1] "20260312"

Question: How would you create a list of dates in 2026 that are evenly spaced by 3 months?
Answer
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "3 months")
View R output
## [1] "2026-01-01" "2026-04-01" "2026-07-01" "2026-10-01"

Question: How would you create a list of dates in 2026 that are evenly spaced by 1 week?
Answer
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "1 week")
View R output
##  [1] "2026-01-01" "2026-01-08" "2026-01-15" "2026-01-22" "2026-01-29"
##  [6] "2026-02-05" "2026-02-12" "2026-02-19" "2026-02-26" "2026-03-05"
## [11] "2026-03-12" "2026-03-19" "2026-03-26" "2026-04-02" "2026-04-09"
## [16] "2026-04-16" "2026-04-23" "2026-04-30" "2026-05-07" "2026-05-14"
## [21] "2026-05-21" "2026-05-28" "2026-06-04" "2026-06-11" "2026-06-18"
## [26] "2026-06-25" "2026-07-02" "2026-07-09" "2026-07-16" "2026-07-23"
## [31] "2026-07-30" "2026-08-06" "2026-08-13" "2026-08-20" "2026-08-27"
## [36] "2026-09-03" "2026-09-10" "2026-09-17" "2026-09-24" "2026-10-01"
## [41] "2026-10-08" "2026-10-15" "2026-10-22" "2026-10-29" "2026-11-05"
## [46] "2026-11-12" "2026-11-19" "2026-11-26" "2026-12-03" "2026-12-10"
## [51] "2026-12-17" "2026-12-24" "2026-12-31"


Times in R Date-time variables (e.g. a HOBO logger timestamp) need to be converted into a POSIX type before you can work with them. POSIX (Portable Operating System Interface) is an international standard for handling things like dates and times and comes from Unix. The idea being that POSIX date/times are transferable across software. There are 2 POSIX types for date-times in R.
  1. POSIXct: is lighter weight and only stores the date-time as the number of seconds since January 1, 1970. Times prior to 1970 are stored as negative numbers.
  2. POSIXlt: type stores more information that’s easily accessible, including min, hour, sec, mday (day of the month), month (mon), year, yday (Julian day), etc.

If your dataset is huge, working with the lighter weight POSIXct may be best. Outside of that, whatever you choose may not matter too much in your workflow. We will use the lighter weight POSIXct version for our examples.

Look under the hood of the info stored by the two POSIX types

unclass(as.POSIXct("2026-03-12 01:30:00", "%Y-%m-%d %H:%M:%S", tz = "America/New_York"))
## [1] 1773293400
## attr(,"tzone")
## [1] "America/New_York"
unclass(as.POSIXlt("2026-03-12 01:30:00", "%Y-%m-%d %H:%M:%S", tz = "America/New_York"))
## $sec
## [1] 0
## 
## $min
## [1] 30
## 
## $hour
## [1] 1
## 
## $mday
## [1] 12
## 
## $mon
## [1] 2
## 
## $year
## [1] 126
## 
## $wday
## [1] 4
## 
## $yday
## [1] 70
## 
## $isdst
## [1] 1
## 
## $zone
## [1] "EDT"
## 
## $gmtoff
## [1] NA
## 
## attr(,"tzone")
## [1] "America/New_York"
## attr(,"balanced")
## [1] TRUE
Note the use of timezone in the code above. Here I specified the eastern timezone. There are two handy ways to check timezones in R.

Check the timezone of your computer

Sys.timezone()
## [1] "America/New_York"

Check the timezones built into base R

OlsonNames()
View R output
##   [1] "Africa/Abidjan"                   "Africa/Accra"                    
##   [3] "Africa/Addis_Ababa"               "Africa/Algiers"                  
##   [5] "Africa/Asmara"                    "Africa/Asmera"                   
##   [7] "Africa/Bamako"                    "Africa/Bangui"                   
##   [9] "Africa/Banjul"                    "Africa/Bissau"                   
##  [11] "Africa/Blantyre"                  "Africa/Brazzaville"              
##  [13] "Africa/Bujumbura"                 "Africa/Cairo"                    
##  [15] "Africa/Casablanca"                "Africa/Ceuta"                    
##  [17] "Africa/Conakry"                   "Africa/Dakar"                    
##  [19] "Africa/Dar_es_Salaam"             "Africa/Djibouti"                 
##  [21] "Africa/Douala"                    "Africa/El_Aaiun"                 
##  [23] "Africa/Freetown"                  "Africa/Gaborone"                 
##  [25] "Africa/Harare"                    "Africa/Johannesburg"             
##  [27] "Africa/Juba"                      "Africa/Kampala"                  
##  [29] "Africa/Khartoum"                  "Africa/Kigali"                   
##  [31] "Africa/Kinshasa"                  "Africa/Lagos"                    
##  [33] "Africa/Libreville"                "Africa/Lome"                     
##  [35] "Africa/Luanda"                    "Africa/Lubumbashi"               
##  [37] "Africa/Lusaka"                    "Africa/Malabo"                   
##  [39] "Africa/Maputo"                    "Africa/Maseru"                   
##  [41] "Africa/Mbabane"                   "Africa/Mogadishu"                
##  [43] "Africa/Monrovia"                  "Africa/Nairobi"                  
##  [45] "Africa/Ndjamena"                  "Africa/Niamey"                   
##  [47] "Africa/Nouakchott"                "Africa/Ouagadougou"              
##  [49] "Africa/Porto-Novo"                "Africa/Sao_Tome"                 
##  [51] "Africa/Timbuktu"                  "Africa/Tripoli"                  
##  [53] "Africa/Tunis"                     "Africa/Windhoek"                 
##  [55] "America/Adak"                     "America/Anchorage"               
##  [57] "America/Anguilla"                 "America/Antigua"                 
##  [59] "America/Araguaina"                "America/Argentina/Buenos_Aires"  
##  [61] "America/Argentina/Catamarca"      "America/Argentina/ComodRivadavia"
##  [63] "America/Argentina/Cordoba"        "America/Argentina/Jujuy"         
##  [65] "America/Argentina/La_Rioja"       "America/Argentina/Mendoza"       
##  [67] "America/Argentina/Rio_Gallegos"   "America/Argentina/Salta"         
##  [69] "America/Argentina/San_Juan"       "America/Argentina/San_Luis"      
##  [71] "America/Argentina/Tucuman"        "America/Argentina/Ushuaia"       
##  [73] "America/Aruba"                    "America/Asuncion"                
##  [75] "America/Atikokan"                 "America/Atka"                    
##  [77] "America/Bahia"                    "America/Bahia_Banderas"          
##  [79] "America/Barbados"                 "America/Belem"                   
##  [81] "America/Belize"                   "America/Blanc-Sablon"            
##  [83] "America/Boa_Vista"                "America/Bogota"                  
##  [85] "America/Boise"                    "America/Buenos_Aires"            
##  [87] "America/Cambridge_Bay"            "America/Campo_Grande"            
##  [89] "America/Cancun"                   "America/Caracas"                 
##  [91] "America/Catamarca"                "America/Cayenne"                 
##  [93] "America/Cayman"                   "America/Chicago"                 
##  [95] "America/Chihuahua"                "America/Ciudad_Juarez"           
##  [97] "America/Coral_Harbour"            "America/Cordoba"                 
##  [99] "America/Costa_Rica"               "America/Creston"                 
## [101] "America/Cuiaba"                   "America/Curacao"                 
## [103] "America/Danmarkshavn"             "America/Dawson"                  
## [105] "America/Dawson_Creek"             "America/Denver"                  
## [107] "America/Detroit"                  "America/Dominica"                
## [109] "America/Edmonton"                 "America/Eirunepe"                
## [111] "America/El_Salvador"              "America/Ensenada"                
## [113] "America/Fort_Nelson"              "America/Fort_Wayne"              
## [115] "America/Fortaleza"                "America/Glace_Bay"               
## [117] "America/Godthab"                  "America/Goose_Bay"               
## [119] "America/Grand_Turk"               "America/Grenada"                 
## [121] "America/Guadeloupe"               "America/Guatemala"               
## [123] "America/Guayaquil"                "America/Guyana"                  
## [125] "America/Halifax"                  "America/Havana"                  
## [127] "America/Hermosillo"               "America/Indiana/Indianapolis"    
## [129] "America/Indiana/Knox"             "America/Indiana/Marengo"         
## [131] "America/Indiana/Petersburg"       "America/Indiana/Tell_City"       
## [133] "America/Indiana/Vevay"            "America/Indiana/Vincennes"       
## [135] "America/Indiana/Winamac"          "America/Indianapolis"            
## [137] "America/Inuvik"                   "America/Iqaluit"                 
## [139] "America/Jamaica"                  "America/Jujuy"                   
## [141] "America/Juneau"                   "America/Kentucky/Louisville"     
## [143] "America/Kentucky/Monticello"      "America/Knox_IN"                 
## [145] "America/Kralendijk"               "America/La_Paz"                  
## [147] "America/Lima"                     "America/Los_Angeles"             
## [149] "America/Louisville"               "America/Lower_Princes"           
## [151] "America/Maceio"                   "America/Managua"                 
## [153] "America/Manaus"                   "America/Marigot"                 
## [155] "America/Martinique"               "America/Matamoros"               
## [157] "America/Mazatlan"                 "America/Mendoza"                 
## [159] "America/Menominee"                "America/Merida"                  
## [161] "America/Metlakatla"               "America/Mexico_City"             
## [163] "America/Miquelon"                 "America/Moncton"                 
## [165] "America/Monterrey"                "America/Montevideo"              
## [167] "America/Montreal"                 "America/Montserrat"              
## [169] "America/Nassau"                   "America/New_York"                
## [171] "America/Nipigon"                  "America/Nome"                    
## [173] "America/Noronha"                  "America/North_Dakota/Beulah"     
## [175] "America/North_Dakota/Center"      "America/North_Dakota/New_Salem"  
## [177] "America/Nuuk"                     "America/Ojinaga"                 
## [179] "America/Panama"                   "America/Pangnirtung"             
## [181] "America/Paramaribo"               "America/Phoenix"                 
## [183] "America/Port-au-Prince"           "America/Port_of_Spain"           
## [185] "America/Porto_Acre"               "America/Porto_Velho"             
## [187] "America/Puerto_Rico"              "America/Punta_Arenas"            
## [189] "America/Rainy_River"              "America/Rankin_Inlet"            
## [191] "America/Recife"                   "America/Regina"                  
## [193] "America/Resolute"                 "America/Rio_Branco"              
## [195] "America/Rosario"                  "America/Santa_Isabel"            
## [197] "America/Santarem"                 "America/Santiago"                
## [199] "America/Santo_Domingo"            "America/Sao_Paulo"               
## [201] "America/Scoresbysund"             "America/Shiprock"                
## [203] "America/Sitka"                    "America/St_Barthelemy"           
## [205] "America/St_Johns"                 "America/St_Kitts"                
## [207] "America/St_Lucia"                 "America/St_Thomas"               
## [209] "America/St_Vincent"               "America/Swift_Current"           
## [211] "America/Tegucigalpa"              "America/Thule"                   
## [213] "America/Thunder_Bay"              "America/Tijuana"                 
## [215] "America/Toronto"                  "America/Tortola"                 
## [217] "America/Vancouver"                "America/Virgin"                  
## [219] "America/Whitehorse"               "America/Winnipeg"                
## [221] "America/Yakutat"                  "America/Yellowknife"             
## [223] "Antarctica/Casey"                 "Antarctica/Davis"                
## [225] "Antarctica/DumontDUrville"        "Antarctica/Macquarie"            
## [227] "Antarctica/Mawson"                "Antarctica/McMurdo"              
## [229] "Antarctica/Palmer"                "Antarctica/Rothera"              
## [231] "Antarctica/South_Pole"            "Antarctica/Syowa"                
## [233] "Antarctica/Troll"                 "Antarctica/Vostok"               
## [235] "Arctic/Longyearbyen"              "Asia/Aden"                       
## [237] "Asia/Almaty"                      "Asia/Amman"                      
## [239] "Asia/Anadyr"                      "Asia/Aqtau"                      
## [241] "Asia/Aqtobe"                      "Asia/Ashgabat"                   
## [243] "Asia/Ashkhabad"                   "Asia/Atyrau"                     
## [245] "Asia/Baghdad"                     "Asia/Bahrain"                    
## [247] "Asia/Baku"                        "Asia/Bangkok"                    
## [249] "Asia/Barnaul"                     "Asia/Beirut"                     
## [251] "Asia/Bishkek"                     "Asia/Brunei"                     
## [253] "Asia/Calcutta"                    "Asia/Chita"                      
## [255] "Asia/Choibalsan"                  "Asia/Chongqing"                  
## [257] "Asia/Chungking"                   "Asia/Colombo"                    
## [259] "Asia/Dacca"                       "Asia/Damascus"                   
## [261] "Asia/Dhaka"                       "Asia/Dili"                       
## [263] "Asia/Dubai"                       "Asia/Dushanbe"                   
## [265] "Asia/Famagusta"                   "Asia/Gaza"                       
## [267] "Asia/Harbin"                      "Asia/Hebron"                     
## [269] "Asia/Ho_Chi_Minh"                 "Asia/Hong_Kong"                  
## [271] "Asia/Hovd"                        "Asia/Irkutsk"                    
## [273] "Asia/Istanbul"                    "Asia/Jakarta"                    
## [275] "Asia/Jayapura"                    "Asia/Jerusalem"                  
## [277] "Asia/Kabul"                       "Asia/Kamchatka"                  
## [279] "Asia/Karachi"                     "Asia/Kashgar"                    
## [281] "Asia/Kathmandu"                   "Asia/Katmandu"                   
## [283] "Asia/Khandyga"                    "Asia/Kolkata"                    
## [285] "Asia/Krasnoyarsk"                 "Asia/Kuala_Lumpur"               
## [287] "Asia/Kuching"                     "Asia/Kuwait"                     
## [289] "Asia/Macao"                       "Asia/Macau"                      
## [291] "Asia/Magadan"                     "Asia/Makassar"                   
## [293] "Asia/Manila"                      "Asia/Muscat"                     
## [295] "Asia/Nicosia"                     "Asia/Novokuznetsk"               
## [297] "Asia/Novosibirsk"                 "Asia/Omsk"                       
## [299] "Asia/Oral"                        "Asia/Phnom_Penh"                 
## [301] "Asia/Pontianak"                   "Asia/Pyongyang"                  
## [303] "Asia/Qatar"                       "Asia/Qostanay"                   
## [305] "Asia/Qyzylorda"                   "Asia/Rangoon"                    
## [307] "Asia/Riyadh"                      "Asia/Saigon"                     
## [309] "Asia/Sakhalin"                    "Asia/Samarkand"                  
## [311] "Asia/Seoul"                       "Asia/Shanghai"                   
## [313] "Asia/Singapore"                   "Asia/Srednekolymsk"              
## [315] "Asia/Taipei"                      "Asia/Tashkent"                   
## [317] "Asia/Tbilisi"                     "Asia/Tehran"                     
## [319] "Asia/Tel_Aviv"                    "Asia/Thimbu"                     
## [321] "Asia/Thimphu"                     "Asia/Tokyo"                      
## [323] "Asia/Tomsk"                       "Asia/Ujung_Pandang"              
## [325] "Asia/Ulaanbaatar"                 "Asia/Ulan_Bator"                 
## [327] "Asia/Urumqi"                      "Asia/Ust-Nera"                   
## [329] "Asia/Vientiane"                   "Asia/Vladivostok"                
## [331] "Asia/Yakutsk"                     "Asia/Yangon"                     
## [333] "Asia/Yekaterinburg"               "Asia/Yerevan"                    
## [335] "Atlantic/Azores"                  "Atlantic/Bermuda"                
## [337] "Atlantic/Canary"                  "Atlantic/Cape_Verde"             
## [339] "Atlantic/Faeroe"                  "Atlantic/Faroe"                  
## [341] "Atlantic/Jan_Mayen"               "Atlantic/Madeira"                
## [343] "Atlantic/Reykjavik"               "Atlantic/South_Georgia"          
## [345] "Atlantic/St_Helena"               "Atlantic/Stanley"                
## [347] "Australia/ACT"                    "Australia/Adelaide"              
## [349] "Australia/Brisbane"               "Australia/Broken_Hill"           
## [351] "Australia/Canberra"               "Australia/Currie"                
## [353] "Australia/Darwin"                 "Australia/Eucla"                 
## [355] "Australia/Hobart"                 "Australia/LHI"                   
## [357] "Australia/Lindeman"               "Australia/Lord_Howe"             
## [359] "Australia/Melbourne"              "Australia/North"                 
## [361] "Australia/NSW"                    "Australia/Perth"                 
## [363] "Australia/Queensland"             "Australia/South"                 
## [365] "Australia/Sydney"                 "Australia/Tasmania"              
## [367] "Australia/Victoria"               "Australia/West"                  
## [369] "Australia/Yancowinna"             "Brazil/Acre"                     
## [371] "Brazil/DeNoronha"                 "Brazil/East"                     
## [373] "Brazil/West"                      "Canada/Atlantic"                 
## [375] "Canada/Central"                   "Canada/Eastern"                  
## [377] "Canada/Mountain"                  "Canada/Newfoundland"             
## [379] "Canada/Pacific"                   "Canada/Saskatchewan"             
## [381] "Canada/Yukon"                     "CET"                             
## [383] "Chile/Continental"                "Chile/EasterIsland"              
## [385] "CST6CDT"                          "Cuba"                            
## [387] "EET"                              "Egypt"                           
## [389] "Eire"                             "EST"                             
## [391] "EST5EDT"                          "Etc/GMT"                         
## [393] "Etc/GMT-0"                        "Etc/GMT-1"                       
## [395] "Etc/GMT-10"                       "Etc/GMT-11"                      
## [397] "Etc/GMT-12"                       "Etc/GMT-13"                      
## [399] "Etc/GMT-14"                       "Etc/GMT-2"                       
## [401] "Etc/GMT-3"                        "Etc/GMT-4"                       
## [403] "Etc/GMT-5"                        "Etc/GMT-6"                       
## [405] "Etc/GMT-7"                        "Etc/GMT-8"                       
## [407] "Etc/GMT-9"                        "Etc/GMT+0"                       
## [409] "Etc/GMT+1"                        "Etc/GMT+10"                      
## [411] "Etc/GMT+11"                       "Etc/GMT+12"                      
## [413] "Etc/GMT+2"                        "Etc/GMT+3"                       
## [415] "Etc/GMT+4"                        "Etc/GMT+5"                       
## [417] "Etc/GMT+6"                        "Etc/GMT+7"                       
## [419] "Etc/GMT+8"                        "Etc/GMT+9"                       
## [421] "Etc/GMT0"                         "Etc/Greenwich"                   
## [423] "Etc/UCT"                          "Etc/Universal"                   
## [425] "Etc/UTC"                          "Etc/Zulu"                        
## [427] "Europe/Amsterdam"                 "Europe/Andorra"                  
## [429] "Europe/Astrakhan"                 "Europe/Athens"                   
## [431] "Europe/Belfast"                   "Europe/Belgrade"                 
## [433] "Europe/Berlin"                    "Europe/Bratislava"               
## [435] "Europe/Brussels"                  "Europe/Bucharest"                
## [437] "Europe/Budapest"                  "Europe/Busingen"                 
## [439] "Europe/Chisinau"                  "Europe/Copenhagen"               
## [441] "Europe/Dublin"                    "Europe/Gibraltar"                
## [443] "Europe/Guernsey"                  "Europe/Helsinki"                 
## [445] "Europe/Isle_of_Man"               "Europe/Istanbul"                 
## [447] "Europe/Jersey"                    "Europe/Kaliningrad"              
## [449] "Europe/Kiev"                      "Europe/Kirov"                    
## [451] "Europe/Kyiv"                      "Europe/Lisbon"                   
## [453] "Europe/Ljubljana"                 "Europe/London"                   
## [455] "Europe/Luxembourg"                "Europe/Madrid"                   
## [457] "Europe/Malta"                     "Europe/Mariehamn"                
## [459] "Europe/Minsk"                     "Europe/Monaco"                   
## [461] "Europe/Moscow"                    "Europe/Nicosia"                  
## [463] "Europe/Oslo"                      "Europe/Paris"                    
## [465] "Europe/Podgorica"                 "Europe/Prague"                   
## [467] "Europe/Riga"                      "Europe/Rome"                     
## [469] "Europe/Samara"                    "Europe/San_Marino"               
## [471] "Europe/Sarajevo"                  "Europe/Saratov"                  
## [473] "Europe/Simferopol"                "Europe/Skopje"                   
## [475] "Europe/Sofia"                     "Europe/Stockholm"                
## [477] "Europe/Tallinn"                   "Europe/Tirane"                   
## [479] "Europe/Tiraspol"                  "Europe/Ulyanovsk"                
## [481] "Europe/Uzhgorod"                  "Europe/Vaduz"                    
## [483] "Europe/Vatican"                   "Europe/Vienna"                   
## [485] "Europe/Vilnius"                   "Europe/Volgograd"                
## [487] "Europe/Warsaw"                    "Europe/Zagreb"                   
## [489] "Europe/Zaporozhye"                "Europe/Zurich"                   
## [491] "GB"                               "GB-Eire"                         
## [493] "GMT"                              "GMT-0"                           
## [495] "GMT+0"                            "GMT0"                            
## [497] "Greenwich"                        "Hongkong"                        
## [499] "HST"                              "Iceland"                         
## [501] "Indian/Antananarivo"              "Indian/Chagos"                   
## [503] "Indian/Christmas"                 "Indian/Cocos"                    
## [505] "Indian/Comoro"                    "Indian/Kerguelen"                
## [507] "Indian/Mahe"                      "Indian/Maldives"                 
## [509] "Indian/Mauritius"                 "Indian/Mayotte"                  
## [511] "Indian/Reunion"                   "Iran"                            
## [513] "Israel"                           "Jamaica"                         
## [515] "Japan"                            "Kwajalein"                       
## [517] "Libya"                            "MET"                             
## [519] "Mexico/BajaNorte"                 "Mexico/BajaSur"                  
## [521] "Mexico/General"                   "MST"                             
## [523] "MST7MDT"                          "Navajo"                          
## [525] "NZ"                               "NZ-CHAT"                         
## [527] "Pacific/Apia"                     "Pacific/Auckland"                
## [529] "Pacific/Bougainville"             "Pacific/Chatham"                 
## [531] "Pacific/Chuuk"                    "Pacific/Easter"                  
## [533] "Pacific/Efate"                    "Pacific/Enderbury"               
## [535] "Pacific/Fakaofo"                  "Pacific/Fiji"                    
## [537] "Pacific/Funafuti"                 "Pacific/Galapagos"               
## [539] "Pacific/Gambier"                  "Pacific/Guadalcanal"             
## [541] "Pacific/Guam"                     "Pacific/Honolulu"                
## [543] "Pacific/Johnston"                 "Pacific/Kanton"                  
## [545] "Pacific/Kiritimati"               "Pacific/Kosrae"                  
## [547] "Pacific/Kwajalein"                "Pacific/Majuro"                  
## [549] "Pacific/Marquesas"                "Pacific/Midway"                  
## [551] "Pacific/Nauru"                    "Pacific/Niue"                    
## [553] "Pacific/Norfolk"                  "Pacific/Noumea"                  
## [555] "Pacific/Pago_Pago"                "Pacific/Palau"                   
## [557] "Pacific/Pitcairn"                 "Pacific/Pohnpei"                 
## [559] "Pacific/Ponape"                   "Pacific/Port_Moresby"            
## [561] "Pacific/Rarotonga"                "Pacific/Saipan"                  
## [563] "Pacific/Samoa"                    "Pacific/Tahiti"                  
## [565] "Pacific/Tarawa"                   "Pacific/Tongatapu"               
## [567] "Pacific/Truk"                     "Pacific/Wake"                    
## [569] "Pacific/Wallis"                   "Pacific/Yap"                     
## [571] "Poland"                           "Portugal"                        
## [573] "PRC"                              "PST8PDT"                         
## [575] "ROC"                              "ROK"                             
## [577] "Singapore"                        "Turkey"                          
## [579] "UCT"                              "Universal"                       
## [581] "US/Alaska"                        "US/Aleutian"                     
## [583] "US/Arizona"                       "US/Central"                      
## [585] "US/East-Indiana"                  "US/Eastern"                      
## [587] "US/Hawaii"                        "US/Indiana-Starke"               
## [589] "US/Michigan"                      "US/Mountain"                     
## [591] "US/Pacific"                       "US/Samoa"                        
## [593] "UTC"                              "W-SU"                            
## [595] "WET"                              "Zulu"                            
## attr(,"Version")
## [1] "2025a"

If you understand how to set up a Date type in R, setting up date-times aren’t that different. It just takes a bit more attention to get the format right. To demonstrate, we’ll read in HOBO temperature data and set the timestamp column as a POSIXct date-time. There’s usually a bit of cleaning required of HOBO data beyond setting the timestamp as POSIXct date-time. I’ll show the whole process below.

Read in temperature data and look at it

temp_data1 <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv")
head(temp_data1)
View R output
##   Plot.Title.HOBO_temp_example.csv                    X
## 1                                # Date Time, GMT-05:00
## 2                                1      7/18/2021 10:26
## 3                                2      7/18/2021 11:26
## 4                                3      7/18/2021 12:26
## 5                                4      7/18/2021 13:26
## 6                                5      7/18/2021 14:26
##                                               X.1
## 1 Temp, °F (LGR S/N: 20672839, SEN S/N: 20672839)
## 2                                          58.842
## 3                                          58.712
## 4                                          58.109
## 5                                          56.208
## 6                                          56.208
##                                    X.2                                  X.3
## 1 Coupler Detached (LGR S/N: 20672839) Coupler Attached (LGR S/N: 20672839)
## 2                               Logged                                     
## 3                                                                          
## 4                                                                          
## 5                                                                          
## 6                                                                          
##                           X.4                             X.5
## 1 Stopped (LGR S/N: 20672839) End Of File (LGR S/N: 20672839)
## 2                                                            
## 3                                                            
## 4                                                            
## 5                                                            
## 6

Note the extra row on top showing the file name. HOBO data often has some metadata in the first row. The next code chunk imports a cleaner version of the data by skipping the first row, only pulling in the first 3 columns (we don’t care about the columns that report Logged), and cleaning up the column names.

Clean up non-date HOBO data

temp_data <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv", skip = 1)[,1:3]
colnames(temp_data) <- c("index", "date_time", "tempF")
View(temp_data)
First 50 rows of temp_data
index date_time tempF
1 7/18/2021 10:26 58.842
2 7/18/2021 11:26 58.712
3 7/18/2021 12:26 58.109
4 7/18/2021 13:26 56.208
5 7/18/2021 14:26 56.208
6 7/18/2021 15:26 55.342
7 7/18/2021 16:26 55.602
8 7/18/2021 17:26 55.949
9 7/18/2021 18:26 55.602
10 7/18/2021 19:26 55.733
11 7/18/2021 20:26 55.819
12 7/18/2021 21:26 55.776
13 7/18/2021 22:26 56.469
14 7/18/2021 23:26 56.642
15 7/19/2021 0:26 56.556
16 7/19/2021 1:26 55.863
17 7/19/2021 2:26 55.819
18 7/19/2021 3:26 55.733
19 7/19/2021 4:26 55.733
20 7/19/2021 5:26 55.733
21 7/19/2021 6:26 55.949
22 7/19/2021 7:26 55.776
23 7/19/2021 8:26 56.035
24 7/19/2021 9:26 56.079
25 7/19/2021 10:26 56.901
26 7/19/2021 11:26 63.090
27 7/19/2021 12:26 63.732
28 7/19/2021 13:26 57.420
29 7/19/2021 14:26 56.685
30 7/19/2021 15:26 56.383
31 7/19/2021 16:26 56.469
32 7/19/2021 17:26 56.512
33 7/19/2021 18:26 56.815
34 7/19/2021 19:26 56.122
35 7/19/2021 20:26 57.074
36 7/19/2021 21:26 56.469
37 7/19/2021 22:26 56.122
38 7/19/2021 23:26 56.772
39 7/20/2021 0:26 57.979
40 7/20/2021 1:26 57.807
41 7/20/2021 2:26 56.469
42 7/20/2021 3:26 56.728
43 7/20/2021 4:26 56.295
44 7/20/2021 5:26 56.035
45 7/20/2021 6:26 56.079
46 7/20/2021 7:26 56.079
47 7/20/2021 8:26 56.165
48 7/20/2021 9:26 56.469
49 7/20/2021 10:26 57.031
50 7/20/2021 11:26 57.979

Convert date_time to POSIXct

We can see that the date is formatted as M/D/YYY, then there’s a space, then the time is formatted with HH:MM, with hours following the 0-23 pattern, minutes 00-59. There are no seconds.

temp_data$timestamp <- as.POSIXct(temp_data$date_time, 
                                  format = "%m/%d/%Y %H:%M", 
                                  tz = "America/New_York")
head(temp_data)
##   index       date_time  tempF           timestamp
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00

Extract the YYYYMMDD date, month, Julian day, time, and hour of the timestamp.

temp_data$date <- format(temp_data$timestamp, "%Y%m%d") 
temp_data$month <- format(temp_data$timestamp, "%b")
temp_data$time <- format(temp_data$timestamp, "%I:%M") 
temp_data$hour <- as.numeric(format(temp_data$timestamp, "%I")) 
head(temp_data)
##   index       date_time  tempF           timestamp     date month  time hour
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00 20210718   Jul 10:26   10
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00 20210718   Jul 11:26   11
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00 20210718   Jul 12:26   12
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00 20210718   Jul 01:26    1
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00 20210718   Jul 02:26    2
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00 20210718   Jul 03:26    3
CHALLENGE: How would you extract the month as a number ranging from 1-12 in temp_data?.
Answer
temp_data$month_num <- as.numeric(format(temp_data$timestamp, "%m"))
head(temp_data)
View R output
##   index       date_time  tempF           timestamp     date month  time hour
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00 20210718   Jul 10:26   10
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00 20210718   Jul 11:26   11
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00 20210718   Jul 12:26   12
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00 20210718   Jul 01:26    1
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00 20210718   Jul 02:26    2
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00 20210718   Jul 03:26    3
##   month_num
## 1         7
## 2         7
## 3         7
## 4         7
## 5         7
## 6         7

CHALLENGE: How would you extract the Julian date in temp_data?.
Answer
temp_data$julian <- as.numeric(format(temp_data$timestamp, "%j"))
head(temp_data)
View R output
##   index       date_time  tempF           timestamp     date month  time hour
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00 20210718   Jul 10:26   10
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00 20210718   Jul 11:26   11
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00 20210718   Jul 12:26   12
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00 20210718   Jul 01:26    1
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00 20210718   Jul 02:26    2
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00 20210718   Jul 03:26    3
##   month_num julian
## 1         7    199
## 2         7    199
## 3         7    199
## 4         7    199
## 5         7    199
## 6         7    199


Customizing ggplot

Load and prep data

For this section, we’re going to use NETN water quality data to customize ggplot objects. This is an abbreviated dataset from the NETN water package. The data contains surface lake measurements recorded using a YSI in a subset of lakes in Acadia NP. We’ll filter the data to plot at different combinations of parameters and sites.

library(dplyr)
library(ggplot2)
library(patchwork) # for arranging ggplot objects
library(RColorBrewer) # for palettes
library(viridis) # for palettes
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")
str(chem)
View R output
## 'data.frame':    6088 obs. of  13 variables:
##  $ SiteCode     : chr  "ACBUBL" "ACBUBL" "ACBUBL" "ACBUBL" ...
##  $ SiteName     : chr  "Bubble Pond" "Bubble Pond" "Bubble Pond" "Bubble Pond" ...
##  $ UnitCode     : chr  "ACAD" "ACAD" "ACAD" "ACAD" ...
##  $ SubUnitCode  : logi  NA NA NA NA NA NA ...
##  $ EventDate    : chr  "5/23/2006" "5/23/2006" "5/23/2006" "5/23/2006" ...
##  $ SiteType     : chr  "Lake" "Lake" "Lake" "Lake" ...
##  $ Project      : chr  "NETN_LS" "NETN_LS" "NETN_LS" "NETN_LS" ...
##  $ QCtype       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SampleDepth_m: num  0.995 0.995 0.995 0.995 0.995 0.995 0.503 0.503 0.503 0.503 ...
##  $ Parameter    : chr  "DO_mgL" "DOsat_pct" "SpCond_uScm" "Temp_C" ...
##  $ Value        : num  10.4 99.2 29 13.4 56.1 ...
##  $ ValueFlag    : logi  NA NA NA NA NA NA ...
##  $ FlagComments : logi  NA NA NA NA NA NA ...

To start with, we’re going to plot temperature data for 8 lakes monitored in Acadia NP. Before we start plotting, we need to convert the EventDate column to a date type, and will extract a year, month, and day of year column for easier plotting later on.

Add date columns to chem, then filter on sites and Temp_F.

chem <- chem |> mutate(date = as.Date(EventDate, "%m/%d/%Y"),
                       year = as.numeric(format(date, "%Y")),
                       mon = as.numeric(format(date, "%m")),
                       doy = as.numeric(format(date, "%j"))) 

ACAD_lakes <- c("ACBUBL", "ACEAGL", "ACECHO", "ACJORD", 
                "ACLONG", "ACSEAL", "ACUHAD", "ACWHOL")

lakes_temp <- chem |> filter(SiteCode %in% ACAD_lakes) |> 
  filter(Parameter %in% "Temp_F") 

head(lakes_temp)
View R output
##   SiteCode    SiteName UnitCode SubUnitCode  EventDate SiteType   Project
## 1   ACBUBL Bubble Pond     ACAD          NA  5/23/2006     Lake   NETN_LS
## 2   ACBUBL Bubble Pond     ACAD          NA  6/21/2006     Lake   NETN_LS
## 3   ACBUBL Bubble Pond     ACAD          NA  7/20/2006     Lake   NETN_LS
## 4   ACBUBL Bubble Pond     ACAD          NA  8/10/2006     Lake   NETN_LS
## 5   ACBUBL Bubble Pond     ACAD          NA  9/26/2006     Lake   NETN_LS
## 6   ACBUBL Bubble Pond     ACAD          NA 10/17/2006     Lake NETN+ACID
##   QCtype SampleDepth_m Parameter  Value ValueFlag FlagComments       date year
## 1      0        0.9950    Temp_F 56.102        NA           NA 2006-05-23 2006
## 2      0        0.5030    Temp_F 67.100        NA           NA 2006-06-21 2006
## 3      0        1.5405    Temp_F 75.281        NA           NA 2006-07-20 2006
## 4      0        0.5020    Temp_F 71.969        NA           NA 2006-08-10 2006
## 5      0        1.0720    Temp_F 61.988        NA           NA 2006-09-26 2006
## 6      0        0.9670    Temp_F 54.482        NA           NA 2006-10-17 2006
##   mon doy
## 1   5 143
## 2   6 172
## 3   7 201
## 4   8 222
## 5   9 269
## 6  10 290


Customizing shapes and colors manually

Now that we have the data set up, we’re going to make a line and point time series plot of temperature for the different sites.

Make a generic plot with the black and white built in theme

ggplot(lakes_temp, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_point() 

Set color and symbol by SiteCode using default colors and shapes

ggplot(lakes_temp, 
       aes(x = date, y = Value, color = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point() 
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 8 values. Consider specifying shapes manually if you need
##   that many of them.
## Warning: Removed 222 rows containing missing values or values outside the scale range
## (`geom_point()`).

Note the warning that we have 8 groups, but ggplot by default only provides 6 different shapes. The 220 rows removed corresponds to the points that were dropped that belonged to the 7th and 8th sites (ACUHAD and ACWHOL). To use 8 symbols, you have to specify them manually, which we’ll do next.

In addition, the default colors in ggplot aren’t great. Whenever you see these colors used in publications, you kind of know the author either barely knows ggplot or is lazy. We’re going to start by specifying our own colors and shapes manually. Then we’ll use color palettes from different packages.

Before you start plotting, it’s helpful to know what point symbol codes are. To view that, run ?points, or search “pch in R plot” and you’ll get the info below. Note that 0-14 are just lines with no fill. To change their color, use the color aesthetic. Symbols 15-20 are solid, but also use color to change their aesthetic. Symbols 21-25 have both a color (outline) and fill (inside) aesthetic.

R pch symbols

Figure of symbol codes in R.


Specify manual color and shape connected to SiteCode

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, 
                       shape = SiteCode)) + 
  theme_bw() +
  geom_point() +
  scale_color_manual(values = c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
                                "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
                                "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
                                "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")) +
  scale_fill_manual(values = c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
                               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
                               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
                               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")) +
  scale_shape_manual(values = c("ACBUBL" = 21, "ACEAGL" = 23, 
                                "ACECHO" = 24, "ACJORD" = 25, 
                                "ACLONG" = 23, "ACSEAL" = 21, 
                                "ACUHAD" = 25, "ACWHOL" = 24))

The code above gave us a different color/fill and shape for each point, but coding it was cumbersome. Specifying the colors like above gets tedious fast. Imagine needing to make a plot for a dozen different parameters. A more efficient approach is defining them outside of ggplot, and then referencing them in the plot, like the example below.

Specify manual color and shape more efficiently

site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point() +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") 

Similar plot, but make the outline the same color for each point and increase point size

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") 
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

The warning is there because you have color in the aes() and in the geom_point(), but it’s not a problem. The ggplot2 package is pretty chatty in the console. It’s good to read the warnings to make sure you didn’t drop values you wanted to plot (e.g. ACWHOL dropped in previous example), but often they’re not issues.

CHALLENGE: Review labels from Day 2. How would you change the x-axis label to “Year”, and the y-axis label to “Temp. (F)”?
Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(x = "Year", y = "Temp. (F)") # answer
View R plot



Add lines and smoothers

Now we’re going to play with adding lines to the graphs. First we’ll add the geom_line() to see how it looks. Notice the order has the line plotting before the point, so it doesn’t cross over the points.

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_line() + # new line
  geom_point(color = "dimgrey") +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 

It’s not easy to follow the line and not really what we’re looking for at this scale of the data (2006-2025). This is where we have to think about what we’re actually interested in, and in this case, it’s whether temperature is changing over time. This is where the geom_smooth() is really helpful. The geom_smooth() plots a line assuming the y ~ x formula (unless you specify a different formula). By default the method is a LOESS smoother, but you can specify a range of methods, including linear regression by adding method = 'lm' to geom_smooth().

Note that I turned off the standard error ribbon that plots by default using se = FALSE. It’s too busy for this plot. I also don’t use the SE unless I’ve fit an actual model and checked the diagnostics. The status under the hood of geom_smooth() are also pretty black boxy, and I don’t always know if I can trust its calculation of SE.

I added a transparency via alpha = 0.5 to the geom_point(), so the lines show up better.

Add a LOESS smoother and make points more transparent

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = FALSE, span = 0.5, linewidth = 1) + # new line
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 

CHALLENGE: How would you make the points less transparent and the smoothed line thinner?
Answer
# Need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
View R plot


CHALLENGE: How would you change the smooth from LOESS to linear?
Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
View R plot



Facets and grids
Facets

The plot looks okay so far, but there are so many points, it’s hard to see what’s going on. This is where facets are useful. If your data have grouping variables, in this case SiteCode, then you can plot separate panels for each of the grouping levels. The code below plots each site separately. I used ncol = 4 to set the number of columns that result from the facet wrap.

Facet on SiteCode

p_site <- 
  ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                         fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 4)
p_site

Facet on year instead of site

Faceting on year can also be a handy way to see how consistent seasonal patterns are across years. Note that we changed the x variable from date to doy (day of year) in the code below. I also filtered on the dataset within the ggplot line to only include years after 2015, increased the point size, and switched to geom_line() instead of the smoother to just connect the points. We’ll revisit this plot to code more meaningful x-axis labels in the next section.

p_year <- 
  ggplot(lakes_temp |> filter(year > 2015), 
         aes(x = doy, y = Value, color = SiteCode, 
             fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_line(linewidth = 0.7) +
  geom_point(color = "dimgrey", size = 2.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~year, ncol = 3) 
p_year

CHALLENGE: Recreate the plot below Note that the symbol outline is black. The alpha level is 0.6.

Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')
View R plot



Grids in patchwork

Faceting is helpful when your observations are all within the same column. But say you have data in multiple columns (e.g., each water quality parameter is a column) and want to arrange those plots into a grid. Faceting won’t help because the data to plot are in different columns. There are multiple packages that make it easy to arrange multiple plots into a grid to look similar to faceted plots. Packages include grid (and gridExtra), cowplot, ggpubr, and patchwork. We’re going to use patchwork, a relative newcomer, and one of the easiest I’ve found to code and customize. Here we’re going to plot pH, temperature, DO, and conductance for Jordan Pond and arrange them using patchwork.

The patchwork package has a lot of options to customize plot layouts. See the patchwork package website for more information.

Prepare the data to plot

pH <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "pH")
temp <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "Temp_F")
dosat <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "DOsat_pct")
cond <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "SpCond_uScm")

p_pH <-
  ggplot(pH, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "pH", x = "Year")  

p_temp <-
  ggplot(temp, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "Temp (F)", x = "Year")  

p_do <-
  ggplot(dosat, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "DO (%sat.)", x = "Year")  

p_cond <-
  ggplot(cond, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "Spec. Cond. (uScm)", x = "Year")  

Arrange plots using patchwork

This is almost too easy to be true, but it really is this easy with patchwork. The patchwork package includes a bunch of options to customize sizes, add annotation, sharing axes across plots.

library(patchwork)
p_pH + p_temp + p_do + p_cond

Arrange plots using patchwork in column of 4 and share x axis.

You can also collect the legend using a similar approach to collecting the axes.

library(patchwork)
p_pH / p_temp / p_do / p_cond + plot_layout(axes = "collect_x")


Axes for dates

Starting with the plot faceted on SiteCode, I don’t like that the first year in the data (2006) is missing from the axis. The x-axis is a Date type, which gives us some useful options to set breaks and labels. Below, I set the breaks to be every 2 years and to label only years (%Y). I added the theme() for the axis.text.x to make the years plot vertically, and the vjust = 0.5 centers the labels on the tick marks. Note that I’m assigning the code below to the object p, so I don’t have to keep typing the original ggplot code over and over.

Improve x-axis labels

p_site2 <- p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

p_site2

Returning to the p_year plot faceted by year. Instead of the day of year on the x-axis, we want to manually set up date axis labels that include the month and day at the beginning of each month. This is a bit trickier, because doy isn’t a Date type. It’s still doable, there may be easier ways to do this.

Manually set up date labels for doy axis

# Find range of months in the data
range_mon <- range(lakes_temp$mon)
range_mon # 5:10
## [1]  5 10
# Set up the date range as a Date type
range_date <- as.Date(c("5/1/2025", "11/01/2025"), format = "%m/%d/%Y")
axis_dates <- seq.Date(range_date[1], range_date[2], by = "1 month")
axis_dates
## [1] "2025-05-01" "2025-06-01" "2025-07-01" "2025-08-01" "2025-09-01"
## [6] "2025-10-01" "2025-11-01"
axis_dates_label <- format(axis_dates, "%b-%d")

# Find the doy value that matches each of the axis dates
axis_doy <- as.numeric(format(axis_dates, "%j"))
axis_doy
## [1] 121 152 182 213 244 274 305
axis_doy_limits <- c(min(axis_doy)-1, max(axis_doy) + 1)
# Set the limits of the x axis as before and after the last sample, 
# otherwise cuts off May 1 in axis
p_year + 
  scale_x_continuous(limits = axis_doy_limits,
                     breaks = axis_doy, 
                     labels = axis_dates_label) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

CHALLENGE: How would you change the plot called p_site2 so that x-axis labels were every 4 years, and angle is 45 degrees instead of 90?
Hint: start with p_site, so you don’t have to write that much code. You’ll also need to tweak vjust and hjust to make the labels line up properly.
Answer
p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "4 years") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
View R plot



Legends

One of the easier tasks to do with legends is to change the location of it. In the next plot we’ll put the legend on the bottom. If you don’t want to plot a legend (don’t really need it when lakes are faceted, for example, you can turn the legend off using legend.position = 'none').

Move legend to bottom

p_site3 <- p_site2 + theme(legend.position = 'bottom')
p_site3

The plots above made legends seem easy, but legends in ggplot can be really tedious. For example, the legend only shows up if you’re setting grouping variables in the aes(). There was no legend in the very first plot we made because everything was the same color and symbol. If, for example, you want to plot thresholds on a plot, their color needs to be added to the scale_color_manual() to show in the legend. Let’s pretend, for example, that 50F and 75F are lower and upper water quality thresholds that we want to plot.

Add horizontal threshold lines to the plot

p_site3 + geom_hline(yintercept = 75, linetype = 'dashed', linewidth = 1) +
          geom_hline(yintercept = 50, linetype = 'dotted', linewidth = 1)

These lines are not showing in the legend. To make them show, you need to wrap them in aes(), as below. Note the difference in how linetype is specified between the above example, where we are specifying whether the line is dashed or dotted. Inside the aes(), linetype is being used to label the line in the legend. We then have to use scale_linetype_manual() to indicate the type of line to plot. Admittedly, it took me a couple of Stackoverflow posts to figure out how to make this work properly. This stuff can be tedious.

Add horizontal threshold lines to the plot and legend

p_site4 <- p_site3 + 
  geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
  geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
  scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold")

p_site4

Another option is turning certain geometries off in the legend via show.legend = F. Let’s say we don’t like the lines in the legend. I have to go back to the full code to change the geom legend settings. I’m also overriding the alpha level of the symbols, so they show up better in the legend.

Remove smoothed lines from legend and increase alpha of symbols in legend

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5, show.legend = FALSE) + 
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) + 
  
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 4) + 
  scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5), 
        legend.position = 'bottom') + 
  
  geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
  geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
  scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold") +
  
  guides(fill = guide_legend(override.aes = list(alpha = 1)))

Another option is to remove the SiteCode fill, color, and shape from the legend via guides(). You can also do this by turning a geom off using show.legend = F like we did with the smoother. In the code below we are telling ggplot that any shape, color, or fill used in an aes() should not be included in the legend. The WQ thresholds still show up because their aes was linetype.

Finally, the width of the lines in the WQ thresholds makes it appear that the Lower threshold is solid instead of dashed. I’ll use an option in theme to make the key wider.

Remove SiteCode keys from legend and increase key width of lines.

p_site4 + guides(shape = 'none', color = 'none', fill = 'none') +
          theme(legend.key.width = unit(0.8, "cm"))


ggplot Palettes

External ggplot Palettes A number of R packages have color palettes available that are colorblind friendly. The two most commonly used are RColorBrewer and viridis. Both packages include palettes that are for three main types of data:
  • Sequential: for continuous variables that grade from low to high or vice versa. These tend to be one hue that increases in saturation as values increase.
  • Diverging: for data where low values have different colors than high values (e.g. starts red, then grades to blue).
  • Qualitative: data for categorical variables where all colors are a similar saturation but different hues.
RColorBrewer palette

The palettes in RColorBrewer can be viewed by running the code below. The first group shows the sequential palettes (e.g. YlOrRd - Yellow Orange Red). The second group shows the qualitative colors. The last group shows the diverging palettes. The main drawback of these palettes is they are limited by the number of levels in your data. So, if you specify Set2 to color code different levels of a factor, there are only 8 colors available to you. If your factor has more than 8 levels (e.g., 9 sites, 10 parks, etc.), then the levels beyond 8 won’t get plotted and you’ll get a warning in the console similar to what we saw for ggplot’s default number of symbols.

View RColorBrewer palettes

display.brewer.all(colorblindFriendly = TRUE)
RColorBrewer palettes


Going back to the temperature plots we made before, we’ll use RColorBrewer to color code each site instead of doing this manually. We’ll build the plot in the next chunk, that we then change the color palettes with in later plots.


Create basic plot

p_pal <- ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                                fill = SiteCode, shape = SiteCode)) + 
         theme_bw() +
         geom_smooth(se = F, span = 0.5, linewidth = 1) + 
         geom_point(color = "dimgrey", alpha = 0.5, size = 2) + 
  
         scale_shape_manual(values = site_shps, name = "Lake") +
         labs(y = "Temperature (F)", x = "Year") +
  
         facet_wrap(~SiteCode, ncol = 4) + 
  
         scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
         theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + 
  
         geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
         geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
         scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold")


Use Set2 palette on temperature plot and remove transparency of symbols in legend

p_pal +  scale_color_brewer(name = "Lake", palette = "Set2", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # solid symbols in legend

Note how I used the aesthetics in the scale_color_brewer() to set fill and color as the same time. We could have done this in the code above too. I also changed the symbols in the legend to not be transparent, so they’re easier to see using the override.aes in the guide.


Use Dark2 palette on temperature plot

p_pal +  scale_color_brewer(name = "Lake", palette = "Dark2", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend


CHALLENGE: How would you specify the ‘RdYlBu’ palette instead of the ones used above?
Hint: Start with p_pal to save time coding.
Answer
p_pal +  scale_color_brewer(name = "Lake", palette = "RdYlBu", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend
View R plot


viridis palettes

The viridis package comes with 8 palettes. The benefit of viridis is the number of levels is not limited to 8 like RColorBrewer. The palette options are below for 12 levels.

viridis palettes


View viridis palettes with hexcodes

You can view the hexcodes of the different palettes by running the code below. Just change viridis() to one of the other palette names to get the hexcodes for those levels.

# viridis 
scales::show_col(viridis(12), cex_label = 0.45, ncol = 6)
viridis palette

Use viridis default palette on temperature plot

The scale_color_viridis_d() selects the viridis palette option (purple, green, yellow) for discrete values (i.e. categories). For a continuous scale (e.g. temperature), you would specify scale_color_viridis_c().

p_pal + scale_color_viridis_d(name = "Lake", aesthetics = c("fill", "color"))  #default viridis 

Use turbo palette on temperature plot

The scale_color_viridis_d() selects the viridis palette option (purple, green, yellow) for discrete values (i.e. categories). For a continuous scale (e.g. temperature), you would specify scale_color_viridis_c().

p_pal + scale_color_viridis_d(name = "Lake", aesthetics = c("fill", "color"), option = 'turbo') 

Continuous palette with heatmaps

Heatmaps via geom_tile() are a place where viridis palettes are especially helpful producing useful sequential or diverging color palettes. We’ll use the temperature data to plot heatmaps by month for each site. Heatmaps are a bit different than other plots we’ve seen, as the x and y values create a discrete grid, and the color in the cell represents the value for that level of x and y. That means we have to change how the x, y and color aesthetics are specified. Here we will plot temperature by month and year faceted on site.

Basic heatmap code

Note the use of base R’s month.abb to set the labels on the x-axis. The month.abb is a vector of the 12 months abbreviated as 3 letters. By setting 5:10, I’m taking the months May - Oct.

p_heat <- 
ggplot(lakes_temp, aes(x = mon, y = year, color = Value, fill = Value)) + 
  theme_bw() +
  geom_tile() + 
  labs(y = "Year", x = "Month") +
  facet_wrap(~SiteCode, ncol = 4) + 
  scale_x_continuous(breaks = c(5, 6, 7, 8, 9, 10),
                     limits = c(4, 11), 
                     labels = month.abb[5:10]) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Plot heatmap with viridis continuous palette

p_heat + scale_color_viridis_c(name = "Temp. (F)", aesthetics = c("fill", "color")) 

Plot heatmap with plasma continuous palette, reverse scale

p_heat + scale_color_viridis_c(name = "Temp. (F)", aesthetics = c("fill", "color"), 
                               option = "plasma", direction = -1) 


Create your own color ramp

You can also create your own color ramp via scale_color_gradient(), which creates a 2-color gradient, scale_color_gradient2(), which creates a diverging color gradient (low-mid-high), and a scale_color_gradientn(), which creates an n-color gradient.

Create 2-color gradient

p_heat + scale_color_gradient(low = "#FCFC9A", high = "#F54927", 
                              aesthetics = c("fill", 'color'), 
                              name = "Temp. (F)") 

Create diverging gradient

For the divergent palette to be meaningful, you usually need to set the midpoint if it’s not 0.

p_heat + scale_color_gradient2(low = "navy", mid = "#FCFC9A", high = "#F54927", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 

Create diverging gradient with multiple colors

Note the change in the legend by using guide = 'legend'. Default is guide = 'colorbar'. I also customized the breaks into 5-degree bins using breaks() and seq().

p_heat + scale_color_gradientn(colors = c("#805A91", "#406AC2", "#FBFFAD", "#FFA34A", "#AB1F1F"), 
                               aesthetics = c("fill", 'color'),
                               guide = "legend",
                               breaks = c(seq(40, 85, 5)), 
                               name = "Temp. (F)") 

CHALLENGE: Create your own palette with at least three colors.
Hint: Start with p_heat to save time coding.
Answer
p_heat + scale_color_gradient2(low = "#3E693D", mid = "#FDFFC7", high = "#7A6646", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 
View R plot



Coding Best Practices

Background

Knowing how to code is only part of being a good coder. Below are general best practices to make code easier to run, understand, and be more stable with a relatively low maintenance cost. Many of these suggestions come from lessons working with my and other peoples’ code. The R for Data Science also has a lot of great information on coding best practices in


Tips for good code
Thorough commenting. Dependencies, like packages, datasets, and parameters are at the top.
# libraries
library(dplyr) # for mutate and filter

# parameters
analysis_year <- 2017

# data sets
df <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Filtering on RAM sites, create new site as a number column, and only include data from specified year
df2 <- df |> filter(Site_Type == "RAM") |> 
             filter(Year == analysis_year) |> 
             mutate(site_num = as.numeric(substr(Site_Name, 5, 6)))
Descriptive names and consistent case

Object names must start with a letter and can only contain letters, numbers, underscore, and period. Spaces aren’t allowed in object names, and are best avoided in column names of data frames too. Descriptive object names will help you digest code, and often you’ll want more than one word in the name. There are multiple cases that people tend to use, the most common of which tends to be snake_case. Other examples are below.

snake_case # most common in R
camelCase # capitalize new words after the first
period.separation # separate words by periods
whyWOULDyouDOthisTOsomeone # excess capitalization is a pain
Thoughtful word ordering

Ordering words in names, so that objects that are similar or derived from each other sort together. This also makes coding easier, as like objects will sort together in the popups that you see as you code.

# good word order
ACAD_wet <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_wet2 <- ACAD_wet |> filter(year > 2020)
ACAD_wet3 <- ACAD_wet2 |> mutate(plot_type = "RAM")

# bad word order
wet_ACAD <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_after_2020 <- wet_ACAD |> filter(year > 2020)
RAM_ACAD_2020 <- ACAD_after_2020 |> mutate(plot_type = "RAM")
Avoid long names

It’s helpful to balance descriptive names with length. The longer the object name, the more typing you have to do to refer to that object. Coding long names, such as long column names in data frames, is cumbersome and inefficient. Compare the two objects below. While I doubt many would make super long object names like this, I commonly see excessively long column names in data packages. Limiting column names to 12 characters or less is super helpful for coders using those data.

# super long names
ACAD_wetland_sampling_data <- data.frame(years_plots_were_sampled = c(2020:2025), wetland_plots_sampled = c(1:6))
ACAD_wetland_sampling_data2 <- ACAD_wetland_sampling_data |> filter(years_plots_were_sampled > 2020)

# shorter still meaningful
ACAD_wet <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_wet2 <- ACAD_wet |> filter(year > 2020)
Use a consistent code style

Code style refers to consistent use of case, indenting, spacing, line width, etc. There are several style conventions out there. I tend to use the tidyverse style guide, which is based on Google’s R style guide.

Style conventions I follow:
  • Space before and after operators, like <-, =, ==, |>, +, etc.
  • Space after commas
  • Keep line width narrow enough to prevent scrolling to the right to view code
  • Indent code in the same function or list together
  • One pipe per line

Example 1. Style for pipes

# Good code
trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(DecayClassCode),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode) |> 
  arrange(Plot_Name, TagCode)

# Same code, but much harder to follow
trees_final <- trees|>mutate(DecayClassCode_num=as.numeric(DecayClassCode), Plot_Name=paste(ParkUnit,PlotCode,sep = "-"),  Date=as.Date(SampleDate,format="%m/%d/%Y"))|> rename("Species"="ScientificName")|>filter(IsQAQC==FALSE)|>select(-DecayClassCode)|>arrange(Plot_Name,TagCode)

Example 2. Style for ggplot object

# Good code
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line() + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24) +
  labs(x = "Year", 
       y = "Annual visitors in 1000's") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill = 'white', color = 'dimgrey'),
        title = element_text(size = 10) 
        )

# Same code but hard to follow
ggplot(data=visits,aes(x=Year,y=Annual_Visits/1000))+geom_line()+geom_point(color="black",fill="#82C2a3",size=2.5,shape=24) +
labs(x = "Year", y = "Annual visitors in 1000's")+
scale_y_continuous(limits=c(2000,4500),breaks=seq(2000,4500,by=500))+ 
scale_x_continuous(limits=c(1994,2024),breaks=c(seq(1994,2024,by=5)))+ 
theme(axis.text.x=element_text(size=10,angle=45,hjust=1), panel.grid.major=element_blank(), 
panel.grid.minor=element_blank(),panel.background=element_rect(fill='white',color='dimgrey'),
title = element_text(size = 10))

Higher level practices
Logical file naming

Using projects instead of stand alone scripts helps keep the various pieces of an analysis project in one place and more easily transferable across computers. Logical naming of scripts, so they sort easily, is also helpful.

Order and purpose of file names easy to follow

Logical file naming

Hard to know script order and purpose

Logical file naming

Caution choosing packages for core work R packages add a ton of functionality that is not available in base R. They save us a lot of work having to build tasks from scratch and are the product of developers sharing their work for free to the benefit of the rest of us. In that way, R packages are amazing. However, there is a dark side to R packages. While base R code is backwards compatible, meaning anything built in R 1.0, should run without breaking in R 5.0, R packages do not generally come with that promise. The more packages your code uses, the more susceptible your code is to changes in package dependencies that break your code. For one-time tasks I don’t expect to repeat, or if I don’t have time to build the thing I need that a package does for me, I am pretty package promiscuous. For coding tasks I expect to perform repeatedly, and therefore will have a cost to maintaining code, I use packages sparingly. Here’s how I choose whether or not to trust a package:
  • Package is hosted on CRAN repository (CRAN). Hosting packages on CRAN is a high bar. Packages have to meet certain standards and go through rigorous testing before they’re accepted. This means there’s likely a long-term plan for this package to be maintained and updated as needed.
  • If package is instead on GitHub.com or another coding repository, counting on that project is a bit riskier. I still use GitHub packages, but I look for ones that had active development within the past year or so, and that have good help documentation. That again usually means there’s good long-term plan for maintaining this package, and it’s less likely to disappear or have sloppy code-breaking changes.
  • The tidyverse collection of packages is incredible, but there are pros and cons. While some packages and functions have been really stable, like dplyr and ggplot2, there’s a lot of active development in tidyverse packages. Developers do a good job documenting the lifecycle of functions in help documentation, which I encourage you to pay attention to when you’re coding. I have had updates to tidyverse packages break my code, mostly because functions have become stricter in what they accept over time. The lubridate package, for example, has burned me a couple of times, so I primarily use base R code to work with date times.
  • Packages that have a lot of other dependencies add to the risk of code breaking changes. You can view dependencies in the DESCRIPTION file in every package. Those listed under Imports or Depends are the primary dependencies.


Challenges

Day 1 Questions

Load Data

If you’re starting a new R session to answer these questions, you’ll need to read in the wetland and tree data frames again.

Read in example ACAD wetland data from url

ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )

Read in example NETN tree data from url

trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")


Data Structures

How would you look at the the first 4 even rows (2, 4, 6, 8), and first 2 columns?

Answer
ACAD_wetland[c(2, 4, 6, 8), c(1, 2)]
## # A tibble: 4 × 2
##   Site_Name Site_Type
##   <chr>     <chr>    
## 1 SEN-01    Sentinel 
## 2 SEN-01    Sentinel 
## 3 SEN-01    Sentinel 
## 4 SEN-01    Sentinel

How many unique species are there in the ACAD_wetland data frame?

Answer
# Option 1
length(unique(ACAD_wetland[, "Latin_Name"]))
# Option 2
length(unique(ACAD_wetland$Latin_Name)) # equivalent
## [1] 133

Which sites have species that are considered protected on them (Protected = TRUE)?

Answer
# Option 1 - used unique to just return unique site name
unique(ACAD_wetland$Site_Name[ACAD_wetland$Protected == TRUE])
# Option 2
unique(ACAD_wetland[ACAD_wetland$Protected == TRUE, "Site_Name"])
## # A tibble: 4 × 1
##   Site_Name
##   <chr>    
## 1 SEN-01   
## 2 SEN-02   
## 3 RAM-53   
## 4 RAM-05


Data Exploration
How many trees are on Plot 12?
Answer
mima12 <- subset(trees, PlotCode == 12)
nrow(mima12) # 12
View R output
## [1] 12

How many trees with a TreeStatusCode of “AS” (alive standing) are on Plot 12?
Answer

Option 1. Subset data then calculate number of rows

mima12_as <- subset(trees, PlotCode == 12 & TreeStatusCode == "AS")
nrow(mima12_as) # 6
## [1] 6

Option 2. Subset the data with brackets and use the table() function to tally status codes.

mima12 <- trees[trees$Plot_Name == "MIMA-012",]
table(mima12$TreeStatusCode) # 6
## < table of extent 0 >

Find the DBH record that’s > 400cm DBH.

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees$DBHcm, na.rm = TRUE)
trees[trees$DBHcm == max_dbh,]
##    ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN ScientificName
## 26     MIMA       16  6/17/2025  FALSE       2025       1 19447  Quercus robur
##    DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 26   443             AS              3           <NA>

What is the exact value of the largest DBH, and which record does it belong to?

Answer

There are multiple ways to do this. Two examples are below.

Option 1. View the data and sort by DBH.

View(trees)

Option 2. Find the max DBH value and subset the data frame

max_dbh <- max(trees$DBHcm, na.rm = TRUE)
max_dbh #443
View R output
## [1] 443

trees[trees$DBHcm == max_dbh,]
View R output
##    ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode   TSN ScientificName
## 26     MIMA       16  6/17/2025  FALSE       2025       1 19447  Quercus robur
##    DBHcm TreeStatusCode CrownClassCode DecayClassCode
## 26   443             AS              3           <NA>

# Plot MIMA-016, TagCode = 1.

Fix the DBH typo by replacing 443.0 with 44.3.

Answer

Let’s say that you looked at the datasheet, and the actual DBH for that tree was 44.3 instead of 443.0. You can change that value in the original CSV by hand. But even better is to document that change in code. There are multiple ways to do this. Two examples are below.

But first, it’s good to create a new data frame when modifying the original data frame, so you can refer back to the original if needed. I also use a really specific filter to make sure I’m not accidentally changing other data.

Replace 443 with 44.3

# create copy of trees data
trees_fix <- trees

# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

Check that it worked by showing the range of the original and fixed data frames.

range(trees$DBHcm)
## [1]  10 443
range(trees_fix$DBHcm)
## [1]  10 443


Basic Plotting

Plot a histogram of percent cover (Ave_Cov) in the ACAD_wetland

Answer
hist(ACAD_wetland$Ave_Cov)
View R plot



Day 2 Questions

Load Data and Packages If you’re starting a new session to answer these questions, you’ll need to load dplyr and read in the tree and wetland data frames again.

Data and packages for Wrangling, Conditionals, and Summarizing sections

Load dplyr

library(dplyr)

Read in example NETN tree data from url

trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")

Create the tree_final data frame

trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(DecayClassCode),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode)

Read in example ACAD wetland data from url

ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )

Data and packages for plotting with ggplot2

Load packages

library(ggplot2)
library(dplyr) # for mutate

Prep the data

visits <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_annual_visits.csv")
visits <- visits |> mutate(Annual_Visits = as.numeric(gsub(",", "", Annual_Visits)))


Data Wrangling with dplyr

How many trees are on Plot MIMA-12 (using trees_final)?

Answer
trees_final |> filter(Plot_Name == "MIMA-12") |> nrow()
View R output
## [1] 12

How many trees with a TreeStatusCode of “AS” (alive standing) are on Plot MIMA-12 (using trees_final)?
Answer
trees_final |> filter(Plot_Name == "MIMA-12" & TreeStatusCode == "AS") |> nrow()
## [1] 6

What is the exact value of the largest DBH, and which record does it belong to?

Answer
# Base R and dplyr combo
max_dbh <- max(trees_final$DBHcm, na.rm = TRUE)
trees_final |> 
  filter(DBHcm == max_dbh) |> 
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)
View R output
##   Plot_Name SampleYear TagCode       Species DBHcm
## 1   MIMA-16       2025       1 Quercus robur   443

# dplyr with slice
trees_final |> 
  arrange(desc(DBHcm)) |> # arrange DBHcm high to low via desc()
  slice(1) |> # slice the top record
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)
View R output
##   Plot_Name SampleYear TagCode       Species DBHcm
## 1   MIMA-16       2025       1 Quercus robur   443

Fix the DBH typo by replacing 443.0 with 44.3.

Answer
# Base R
# create copy of trees data
trees_fix <- trees3
# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

# dplyr via replace
trees_fix <- trees |> mutate(DBHcm = replace(DBHcm, DBHcm == 443.0, 44.3))

Check that it worked by showing the range of the original and fixed data frames.

range(trees$DBHcm)
## [1]  10 443
range(trees_fix$DBHcm)
## [1] 10.0 81.5


Conditionals

Using the ACAD_wetland data, create a new column called Status that has “protected” for Protected = TRUE and “public” values for Protected = FALSE.

Answer
# read in wetland data if you don't already have it loaded.
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Base R using the with() function
ACAD_wetland$Status <- with(ACAD_wetland, ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected)
# Tidyverse
ACAD_wetland <- ACAD_wetland |> mutate(Status = ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected) # check your work
View R output
##            
##             FALSE TRUE
##   protected     0    9
##   public      499    0

Using the ACAD_wetland data, create a new column called abundance_cat that has levels High, Medium, Low, based on Ave_Cov, where “High” is >50%, “Medium” is 10-50%, and “Low” is < 10%.
Answer
# Base R using the with() function and nested ifelse()
ACAD_wetland$abundance_cat <- with(ACAD_wetland, ifelse(Ave_Cov < 10, "Low",
                                                        ifelse(Ave_Cov >= 10 & Ave_Cov <= 50, "Medium", "High")))
# Tidyverse using case_when() and between
ACAD_wetland <- ACAD_wetland |> mutate(abundance_cat = case_when(Ave_Cov < 10 ~ "Low",
                                                                 between(Ave_Cov, 10, 50) ~ "Medium", 
                                                                 TRUE ~ "High"))
table(ACAD_wetland$abundance_cat)
View R output
## 
##   High    Low Medium 
##      6    464     38

Note the use of the between() function that saves typing. This function matches as >= and <=.


Summarizing
CHALLENGE: Using the ACAD_wetland data, sum the percent cover of native vs. invasive species per plot (use the Ave_Cov column). Note that Invasive = TRUE is invasive and FALSE is native.
Answer
# Using group_by()
ACAD_inv <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(Pct_Cov = sum(Ave_Cov), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv)
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive Pct_Cov
##   <chr>     <int> <lgl>      <dbl>
## 1 RAM-05     2012 FALSE     155.  
## 2 RAM-05     2017 FALSE     152.  
## 3 RAM-05     2017 TRUE        0.06
## 4 RAM-41     2012 FALSE      48.6 
## 5 RAM-41     2017 FALSE     107.  
## 6 RAM-41     2017 TRUE       10.2

# Using summarize(.by)
ACAD_inv2 <- ACAD_wetland |> 
  summarize(Pct_Cov = sum(Ave_Cov), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv2) # should be the same as ACAD_inv
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive Pct_Cov
##   <chr>     <int> <lgl>      <dbl>
## 1 RAM-05     2012 FALSE     155.  
## 2 RAM-05     2017 FALSE     152.  
## 3 RAM-05     2017 TRUE        0.06
## 4 RAM-41     2017 FALSE     107.  
## 5 RAM-41     2012 FALSE      48.6 
## 6 RAM-41     2017 TRUE       10.2

CHALLENGE: Using the ACAD_wetland data, count the number of native vs. invasive species per plot. Note that Invasive = TRUE is invasive and FALSE is native.
Answer
# Using group_by()
ACAD_spp <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(num_spp = n(), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp)
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive num_spp
##   <chr>     <int> <lgl>      <int>
## 1 RAM-05     2012 FALSE         44
## 2 RAM-05     2017 FALSE         53
## 3 RAM-05     2017 TRUE           1
## 4 RAM-41     2012 FALSE         33
## 5 RAM-41     2017 FALSE         39
## 6 RAM-41     2017 TRUE           1

# Using summarize(.by)
ACAD_spp2 <- ACAD_wetland |> 
  summarize(num_spp = n(), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp2) # should be the same as ACAD_inv
View R output
## # A tibble: 6 × 4
##   Site_Name  Year Invasive num_spp
##   <chr>     <int> <lgl>      <int>
## 1 RAM-05     2012 FALSE         44
## 2 RAM-05     2017 FALSE         53
## 3 RAM-05     2017 TRUE           1
## 4 RAM-41     2017 FALSE         39
## 5 RAM-41     2012 FALSE         33
## 6 RAM-41     2017 TRUE           1

Using ACAD_wetland data, calculate relative % cover of species within each site. This one is challenging!
Answer

Most efficient solution figured out during training

# using the .by within mutate (newer solution)
ACAD_wetland <- ACAD_wetland |> 
  mutate(Site_Cover = sum(Ave_Cov), 
         .by = c(Site_Name, Year)) |> 
  mutate(rel_cov = (Ave_Cov/Site_Cover)*100,
         .by = c(Site_Name, Year, Latin_Name, Common))

Original Solution: First sum site-level cover using mutate to return a value for every original row.

# older solution
ACAD_wetland <- ACAD_wetland |> group_by(Site_Name, Year) |> 
  mutate(Site_Cover = sum(Ave_Cov)) |> 
  ungroup() # good practice to ungroup after group.

table(ACAD_wetland$Site_Name, ACAD_wetland$Site_Cover) # check that each site has a unique value.
View R output
##         
##          48.56 70.6 104.78 106.72 111.4 117.24 152.1 153.8 155.42 165.64 178.34
##   RAM-05     0    0      0      0     0      0    54     0     44      0      0
##   RAM-41    33    0      0      0     0     40     0     0      0      0      0
##   RAM-44     0    0     45      0     0      0     0     0      0      0     34
##   RAM-53     0    0      0      0     0      0     0     0      0      0      0
##   RAM-62     0    0      0      0     0      0     0    26      0     26      0
##   SEN-01     0   34      0      0     0      0     0     0      0      0      0
##   SEN-02     0    0      0     41     0      0     0     0      0      0      0
##   SEN-03     0    0      0      0    33      0     0     0      0      0      0
##         
##          188.84 196.52
##   RAM-05      0      0
##   RAM-41      0      0
##   RAM-44      0      0
##   RAM-53     48     50
##   RAM-62      0      0
##   SEN-01      0      0
##   SEN-02      0      0
##   SEN-03      0      0

head(ACAD_wetland)
View R output
## # A tibble: 6 × 15
##   Site_Name Site_Type Latin_Name Common  Year PctFreq Ave_Cov Invasive Protected
##   <chr>     <chr>     <chr>      <chr>  <int>   <int>   <dbl> <lgl>    <lgl>    
## 1 SEN-01    Sentinel  Acer rubr… red m…  2011       0    0.02 FALSE    FALSE    
## 2 SEN-01    Sentinel  Amelanchi… servi…  2011      20    0.02 FALSE    FALSE    
## 3 SEN-01    Sentinel  Andromeda… bog r…  2011      80    2.22 FALSE    FALSE    
## 4 SEN-01    Sentinel  Arethusa … drago…  2011      40    0.04 FALSE    TRUE     
## 5 SEN-01    Sentinel  Aronia me… black…  2011     100    2.64 FALSE    FALSE    
## 6 SEN-01    Sentinel  Carex exi… coast…  2011      60    6.6  FALSE    FALSE    
## # ℹ 6 more variables: X_Coord <dbl>, Y_Coord <dbl>, Status <chr>,
## #   abundance_cat <chr>, Site_Cover <dbl>, rel_cov <dbl>

Next calculate relative cover grouped on Site_Name, Year, Latin_Name, and Common

# Create new dataset because collapsing rows on grouping variables
# Using group_by() and summarize()
ACAD_wetland_relcov <- ACAD_wetland |> group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = (Ave_Cov/Site_Cover)*100,
            .groups = 'drop') |> 
  ungroup()

Check that relative cover sums to 100% within each site

# Using summarize(.by = ) 
ACAD_wetland_relcov2 <- ACAD_wetland |> #group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = Ave_Cov/Site_Cover, 
            .by = c("Site_Name", "Year", "Latin_Name", "Common"))

# Check that your relative cover sums to 100 for each site
relcov_check <- ACAD_wetland_relcov2 |> group_by(Site_Name, Year) |> 
  summarize(tot_relcov = sum(rel_cov)*100, .groups = 'drop')
table(relcov_check$tot_relcov) # they should all be 100
View R output
## 
## 100 
##  13


Plotting with ggplot2

Recreate the plot below (or customize your own plot). Note that the fill color is “#0080FF”, and the shape is 21, and the theme is classic. The linewidth = 0.75, and linetype = ‘dashed’ .

Answer
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line(linewidth = 0.75, linetype = 'dashed') + 
  geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
  labs(x = "Year", y = "Annual visits in 1,000s") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme_classic()


Day 3 Questions

Load Data and Packages


Pivoting tables

Load data and dplyr

library(dplyr)

# bat capture data
bat_cap <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/example_bat_capture_data.csv")

Joining tables

Load data and dplyr

library(dplyr)
# tree data
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
# tree species table
spp_tbl <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_species_table.csv")

Dates and times

Load and prep data.

# Hobo Temp data
temp_data <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv", skip = 1)[,1:3]
colnames(temp_data) <- c("index", "temp_time", "tempF")
temp_data$timestamp_temp <- as.POSIXct(temp_data$temp_time, 
                                       format = "%m/%d/%Y %H:%M", 
                                       tz = "America/New_York")

ggplot sections

Load packages and prep data for ggplot sections

# Water chemistry data for ggplot section
library(dplyr)
library(ggplot2)
library(patchwork) # for arranging ggplot objects
library(RColorBrewer) # for palettes
library(viridis) # for palettes
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")

chem <- chem |> mutate(date = as.Date(EventDate, "%m/%d/%Y"),
                       year = as.numeric(format(date, "%Y")),
                       mon = as.numeric(format(date, "%m")),
                       doy = as.numeric(format(date, "%j"))) 

ACAD_lakes <- c("ACBUBL", "ACEAGL", "ACECHO", "ACJORD", 
                "ACLONG", "ACSEAL", "ACUHAD", "ACWHOL")

lakes_temp <- chem |> filter(SiteCode %in% ACAD_lakes) |> 
  filter(Parameter %in% "Temp_F") 


Pivoting tables

Pivot the bat_sum data frame on year instead of species, so that you have a column for every year of captures. Remember to avoid column names starting with a number.

Answer
bat_wide_yr <- pivot_wider(bat_sum, names_from = Year, 
                           values_from = num_indiv, 
                           values_fill = 0, 
                           names_prefix = "yr")
head(bat_wide_yr)
## # A tibble: 6 × 9
##   Site     sppcode yr2019 yr2020 yr2021 yr2022 yr2023 yr2024 yr2025
##   <chr>    <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1 site_001 LASCIN       1      0      0      1      0      0      0
## 2 site_001 MYOLEI       0      1      1      2      1      0      2
## 3 site_001 MYOSEP       0      1      0      0      0      0      0
## 4 site_001 MYOLUC       0      0      1      1      0      2      0
## 5 site_002 LASCIN       1      0      1      0      0      0      1
## 6 site_002 MYOLEI       1      2      1      0      0      2      0

Pivot the resulting data frame from the previous question to long on the years columns, and remove the “yr” from the year names using names_prefix = 'yr'.

Answer
bat_long_yr <- pivot_longer(bat_wide_yr, 
                            cols = -c(Site, sppcode),
                            names_to = "Year", 
                            values_to = "num_indiv", 
                            names_prefix = "yr") # drops this string from values


Joining Tables

Join the NETN_tree_data.csv and NETN_tree_species_table.csv to connect the common name to the tree data.

Answer
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
View R output
## [1] "TSN"            "ScientificName"

# left join species to trees, because don't want to include species not found in tree data
trees_spp <- left_join(trees, 
                       spp_tbl |> select(TSN, ScientificName, CommonName), 
                       by = c("TSN", "ScientificName"))

head(trees_spp)
View R output
##   ParkUnit PlotCode SampleDate IsQAQC SampleYear TagCode    TSN ScientificName
## 1     MIMA       12  6/16/2025  FALSE       2025      13 183385  Pinus strobus
## 2     MIMA       12  6/16/2025  FALSE       2025      12  28728    Acer rubrum
## 3     MIMA       12  6/16/2025  FALSE       2025      11  28728    Acer rubrum
## 4     MIMA       12  6/16/2025  FALSE       2025       2  28728    Acer rubrum
## 5     MIMA       12  6/16/2025  FALSE       2025      10  28728    Acer rubrum
## 6     MIMA       12  6/16/2025  FALSE       2025       7  28728    Acer rubrum
##   DBHcm TreeStatusCode CrownClassCode DecayClassCode         CommonName
## 1  24.9             AS              5           <NA> eastern white pine
## 2  10.9             AB              5           <NA>          red maple
## 3  18.8             AS              3           <NA>          red maple
## 4  51.2             AS              3           <NA>          red maple
## 5  38.2             AS              3           <NA>          red maple
## 6  22.5             AS              4           <NA>          red maple

Find a species in the NETN tree data that doesn’t have a match in the species table.
Answer
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
View R output
## [1] "TSN"            "ScientificName"

# anti join of trees against species table, selecting only columns of interest
anti_join(trees, spp_tbl, by = c("TSN", "ScientificName")) |> 
  select(ParkUnit, PlotCode, SampleYear, ScientificName)
View R output
##   ParkUnit PlotCode SampleYear ScientificName
## 1     MIMA       16       2025  Quercus robur


Dates and Times
Question: How would you return date1 as YYYYMMDD (20260312)?
Answer
format(date1, format = "%Y%m%d")
View R output
## [1] "20260312"

How would you create a list of dates in 2026 that are evenly spaced by 3 months?
Answer
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "3 months")
View R output
## [1] "2026-01-01" "2026-04-01" "2026-07-01" "2026-10-01"

How would you create a list of dates in 2026 that are evenly spaced by 1 week?
Answer
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "1 week")
View R output
##  [1] "2026-01-01" "2026-01-08" "2026-01-15" "2026-01-22" "2026-01-29"
##  [6] "2026-02-05" "2026-02-12" "2026-02-19" "2026-02-26" "2026-03-05"
## [11] "2026-03-12" "2026-03-19" "2026-03-26" "2026-04-02" "2026-04-09"
## [16] "2026-04-16" "2026-04-23" "2026-04-30" "2026-05-07" "2026-05-14"
## [21] "2026-05-21" "2026-05-28" "2026-06-04" "2026-06-11" "2026-06-18"
## [26] "2026-06-25" "2026-07-02" "2026-07-09" "2026-07-16" "2026-07-23"
## [31] "2026-07-30" "2026-08-06" "2026-08-13" "2026-08-20" "2026-08-27"
## [36] "2026-09-03" "2026-09-10" "2026-09-17" "2026-09-24" "2026-10-01"
## [41] "2026-10-08" "2026-10-15" "2026-10-22" "2026-10-29" "2026-11-05"
## [46] "2026-11-12" "2026-11-19" "2026-11-26" "2026-12-03" "2026-12-10"
## [51] "2026-12-17" "2026-12-24" "2026-12-31"

How would you extract the month as a number ranging from 1-12 in temp_data?.
Answer
temp_data$month_num <- as.numeric(format(temp_data$timestamp_temp, "%m"))
head(temp_data)
View R output
##   index       temp_time  tempF      timestamp_temp month_num
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00         7
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00         7
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00         7
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00         7
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00         7
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00         7

How would you extract the julian date in temp_data?.
Answer
temp_data$julian <- as.numeric(format(temp_data$timestamp_temp, "%j"))
head(temp_data)
View R output
##   index       temp_time  tempF      timestamp_temp month_num julian
## 1     1 7/18/2021 10:26 58.842 2021-07-18 10:26:00         7    199
## 2     2 7/18/2021 11:26 58.712 2021-07-18 11:26:00         7    199
## 3     3 7/18/2021 12:26 58.109 2021-07-18 12:26:00         7    199
## 4     4 7/18/2021 13:26 56.208 2021-07-18 13:26:00         7    199
## 5     5 7/18/2021 14:26 56.208 2021-07-18 14:26:00         7    199
## 6     6 7/18/2021 15:26 55.342 2021-07-18 15:26:00         7    199


Customizing ggplot
CHALLENGE: Review labels from Day 2. How would you change the X-axis label to “Year”, and the Y-axis label to “Temp. (F)”?
Answer
# Will need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)
# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(x = "Year", y = "Temp. (F)") # answer
View R plot


CHALLENGE: How would you make the points less transparent and the smoothed line thinner?
Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
View R plot


CHALLENGE: How would you make the points less transparent and the smoothed line thinner?
Answer
# Need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
View R plot


CHALLENGE: How would you change the smooth from LOESS to linear?
Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
View R plot


CHALLENGE: Recreate the plot below Note that the symbol outline is black. The alpha level is 0.6.

Answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')
View R plot


CHALLENGE: How would you change the plot called p_site2 so that x-axis labels were every 4 years, and angle is 45 degrees instead of 90?
Hint: start with p_site, so you don’t have to write that much code. You’ll also need to tweak vjust and hjust to make the labels line up properly.
Answer
p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "4 years") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
View R plot



ggplot Palettes
CHALLENGE: How would you specify the ‘RdYlBu’ palette instead of the ones used above?
Hint: Start with p_pal to save time coding.
Answer
p_pal +  scale_color_brewer(name = "Lake", palette = "RdYlBu", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend
View R plot


CHALLENGE: Create your own palette with at least three colors.
Hint: Start with p_heat to save time coding.
Answer
p_heat + scale_color_gradient2(low = "#3E693D", mid = "#FDFFC7", high = "#7A6646", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 
View R plot



Getting Help

Help Documentation

There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.

Get help for the functions below

?plot
?dplyr::filter
You can also press F1 while the cursor is on a function name to access the help for that function. Help documents in R are standardized to help you find what you’re looking for.
  • Top left shows the function with the {package}. Base means it’s a function in the base R install.
  • Description: tell you what the function is.
  • Usage: tells you what the arguments are. The “…” means there are other potential arguments, but isn’t something we need to talk about right now.
  • Arguments: define the arguments and what their inputs take. For example, if an argument is TRUE or FALSE, or a text string.
  • Value: Describes more about the function (not always included)
  • See Also: Sometimes functions build on other functions. This section links to similar or building block functions.
  • Examples: Functions that provide good examples are invaluable. Sometimes these are afterthoughts or not included in help documentation, which is too bad. Unfortunately, base R functions tend to have some of the most obscure, hard to understand examples.


Troubleshooting errors

Great online resources to find answers to questions include Stackexchange, and Stackoverflow. Google searches are usually my first step, and I include “in R” and the package name (if applicable) in every search related to R code. If you’re troubleshooting an error message, copying and pasting the error message verbatim into a search engine often helps.

Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused Data Science Community of Practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join.


Common errors and how to fix them
  1. Unmatched parenthesis

  2. mean_x <- mean(c(1, 3, 5, 7, 8, 21) # missing closing parentheses
    mean_x <- mean(c(1, 3, 5, 7, 8, 21)) # correct
  3. Unmatched quotes

  4. birds <- c("black-capped chickadee", "golden-crowned kinglet, "wood thrush") # missing quote after kinglet
    birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected
  5. Missing a comma between elements

  6. birds <- c("black-capped chickadee", "golden-crowned kinglet" "wood thrush") # missing comma after kinglet
    birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected
  7. Misspelled function name

  8. x_mean <- maen(x) # misspelled mean
    x_mean <- mean(x) # Corrected
  9. Incorrect use of dimensions with brackets

  10. # Missing comma to indicate subsetting rows (records)
    ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name)]
    ## Error in `ACAD_wetland[!is.na(ACAD_wetland$Site_Name)]`:
    ## ! Can't subset columns with `!is.na(ACAD_wetland$Site_Name)`.
    ## ✖ Logical subscript `!is.na(ACAD_wetland$Site_Name)` must be size 1 or 15, not 508.
    # Correct
    ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name), ]
Other resources that may help:



Resources

Online Resources

There’s a lot of great online material for learning new applications of R. The ones we’ve used the most are listed below.

Online Books
  • R for Data Science First author is Hadley Wickham, one of the main programmers behind the tidyverse. There’s a lot of good stuff in here. This book is the first place to look for anything you want to follow up on from this training.
  • ggplot2: Elegant Graphics for Data Analysis A great reference on ggplot2 also by Hadley Wickham.
  • Mastering Software Development in R First author is Roger Peng, a Biostatistics professor at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. His intro to ggplot is great. He’s also got a lot of more advanced topics in this book, like making functions and packages.
  • R Packages Another book by Hadley Wickham that teaches you how to build, debug, and test R packages.
  • Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.
  • Mastering Shiny And another Hadley Wickham book on building shiny apps.
Other useful sites
  • NPS_IMD_Data_Science_and_Visualization > Community of Practice is an IMD work group that meets once a month talk about R and Data Science. There are also notes, materials and recordings from previous meetings, a Wiki with helpful tips, and the chat is a great place to post questions or cool tips you’ve come across.
  • STAT545 Jenny Bryan’s site that accompanies the graduate level stats class of the same name. She includes topics on best practices for coding, and not just how to make scripts work. It’s really well done.
  • RStudio home page There’s a ton of info in the Resources tab on this site, including cheat sheets for each package developed by RStudio (ie tidyverse packages), webinars, presentations from past RStudio Conferences, etc.
  • RStudio list of useful R packages by topic
  • patchwork R package tutorial for arranging multiple ggplot figures.
  • R Markdown: The Definitive Guide provides nearly everything you need to know about building R Markdown documents, a really useful way to document your code, output, and notes all in one place. This website was developed in R Markdown, for example.
  • Happy Git with R If you find yourself wanting to go down the path of hosting your code on github, this site will walk you through the process of linking github to RStudio.

Keyboard Shortcuts

Once you get in the swing of coding, you’ll find that minimizing the number of times you have to use your mouse will help you code faster. RStudio has done a great job creating lots of really useful keyboard shortcuts designed to keep your hands on the keyboard instead of having to click through menus. One way to see all of the shortcuts RStudio has built in is to press Alt+Shift+K. A window should appear with a bunch of shortcuts listed. These are also listed on List of RStudio IDE Keyboard Shortcuts. The shortcuts I use the most often are listed below:
  • Undo: Ctrl Z
  • Redo: Ctrl Shift Z
  • Run highlighted code: Ctrl Enter
  • Insert “<-” : Alt -
  • Zoom in to make text bigger: Ctrl roll mouse forward (set in Global Options)
  • Zoom out: Ctrl - or Ctrl roll mouse backward (set in Global Options)
  • Move line of code up or down: Alt arrow up or down
  • Comment out whole line: Ctrl Shift C
  • Duplicate line of code: Ctrl Shift D
  • Move cursor to beginning of line: Home
  • Move cursor to end of line: End
  • View help for a given function: Put cursor on function name and press F1
  • Esc escapes out of the command currently being executed in the console
  • Restart R Session: Ctrl Shift F10
  • Insert pipe (|>): Ctrl Shift M
  • View RStudio’s keyboard shortcuts: Alt Shift K

Advanced topics

Additional skills that can greatly improve your workflow are:
  • Using R Markdown (stand alone websites/docs) or R Shiny (interactive websites) for automated reporting or visualizing your data.
  • Writing your own functions and iterating tasks.
  • Version control using git/GitHub.
  • Building R packages.

While we won’t get to these topics this week, the 2022 Advanced R training has sessions covering all of these topics. The Resources tab includes other online resources that cover these topics as well.

R Markdown wizards
Artwork by @allison_horst

Recordings (NPS only)

Currently, only the folks who participated in the training can view the recordings. If you did not participate and would like access, send Kate Miller an email.


Code printout

knitr::opts_chunk$set(warning=FALSE, message=FALSE)
hooks = knitr::knit_hooks$get()
hook_foldable = function(type) {
  force(type)
  function(x, options) {
    res = hooks[[type]](x, options)
    
    if (isFALSE(options[[paste0("fold.", type)]])) return(res)
    
    paste0(
      "<details><summary class='code2'>View R ", type, "</summary>\n",
      res, "\n\n",
      "</details>",
      "\n\n",
      "<hr style='height:1px; margin-bottom:15px; padding-bottom:15px; padding-top:-15px;margin-top:-15px;visibility:hidden;'>",
      "\n\n"
    
      )
  }
}
knitr::knit_hooks$set(
  output = hook_foldable("output"),
  plot = hook_foldable("plot")
)
body {
  background-color: #EBEBEB;
}

.tab-content {
  background-color: #FAFAF0;
  padding: 0 5px;
}

library(tidyverse)
#------------------------------------
#        Day 0 - prep code 
#------------------------------------

rm(list = ls())
packages <- c("tidyverse", # for Day 2 and 3 data wrangling
              "RColorBrewer", "viridis", "patchwork", # for Day 3 ggplot
              "readxl", "writexl") # for day 1 importing from excel

install.packages(setdiff(packages, rownames(installed.packages())))  

# Check that installation worked
library(tidyverse) # turns on core tidyverse packages
library(RColorBrewer) # palette generator
library(viridis) # more palettes
library(patchwork) # multipanel plots
library(readxl) # reading xlsx
library(writexl) # writing xlsx
#------------------------------------
#     Day 1: Project Setup Code 
#------------------------------------
# forward slash file path approach
"C:/Users/KMMiller/OneDrive = DOI/data/"

# backward slash file path approach
"C:\\Users\\KMMiller\\OneDrive = DOI\\data\\"

dir.create("data")
list.files() # you should see a data folder listed 
#------------------------------------
#     Day 1: Start Coding Code 
#------------------------------------
# Commented text: try this line to generate some basic text and become familiar with where results will appear:
print("Welcome to R!")

# simple math
1+1

(2*3)/4

sqrt(9)

# calculate basal area of tree with 14.6cm diameter; note pi is built in constant in R
(14.6^2)*pi

# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
cos(pi)

# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two 
# variables without having to repeatedly type out the actual numbers:

a*b

(a^b)/((b+a))

sqrt((a^7)/(b*2))

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# equivalent to x <- 1:10

# bad coding
#mean <- mean(x)

# good coding 
mean_x <- mean(x)
mean_x

range_x <- range(x)
range_x
#------------------------------------
#     Day 1: Read and Write Code 
#------------------------------------
# read in the data from ACAD_wetland_data_clean.csv and assign it as a dataframe to the variable "ACAD_wetland"
ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )
# View the ACAD_wetland data frame we just created
View(ACAD_wetland)
# Look at the top 6 rows of the data frame
head(ACAD_wetland)
# Look at the bottom 6 rows of the data frame
tail(ACAD_wetland)
# Write the data frame to your data folder using a relative path. 
# By default, write.csv adds a column with row names that are numbers. I don't
# like that, so I turn that off.
write.csv(ACAD_wetland, "./data/ACAD_wetland_data_clean.csv", row.names = FALSE)
# Read the data frame in using a relative path
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Equivalent code to read in the data frame using full path on my computer, but won't match another user.
ACAD_wetland <- read.csv("C:/Users/KMMiller/OneDrive - DOI/NETN/R_Dev/IMD_R_Training_2026/data/ACAD_wetland_data_clean.csv")

install.packages("readxl") # only need to run once. 
install.packages("writexl")
library(writexl) # saving xlsx
library(readxl) # importing xlsx
write_xlsx(ACAD_wetland, "./data/ACAD_wetland_data_clean.xlsx")
ACAD_wetxls <- read_xlsx(path = "./data/ACAD_wetland_data_clean.xlsx", sheet = "Sheet1") 
head(ACAD_wetxls)
#------------------------------------
#       Day 1: Vectors Code 
#------------------------------------
digits <- c(1:10)  # Use x:y to create a sequence of integers starting at x and ending at y
digits
digits + 1 # note how 1 was added to every element of digits. 

is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd

tree_dbh <- c(12.5, 20.4, 18.1, 38.5, 19.3)
tree_dbh

bird_ids <- c("black-capped chickadee", "dark-eyed junco", "golden-crowned kinglet", "dark-eyed junco")
bird_ids
second_bird <- bird_ids[2]
second_bird
top_two_birds <- bird_ids[c(1,2)]
top_two_birds
sort(unique(bird_ids))
class(bird_ids)
class(tree_dbh)
class(digits)
class(is_odd)
str(ACAD_wetland)
names(ACAD_wetland)
ACAD_wetland$Site_Name
ACAD_wetland$Latin_Name
dim(ACAD_wetland)
nrow(ACAD_wetland) # first dim
ncol(ACAD_wetland) # second dim
ACAD_wetland[1:5,]
ACAD_wetland[c(1, 2, 3, 4, 5),] #equivalent but more typing
ACAD_wetland[, c("Site_Name", "Latin_Name", "Common", "Year", "PctFreq")]
ACAD_wetland[1:5, c("Site_Name", "Latin_Name", "Common", "Year", "PctFreq")]
ACAD_sub <- ACAD_wetland[ , 1:4] # works, but risky
ACAD_sub2 <- 
  ACAD_wetland[,c("Site_Name", "Site_Type", "Latin_Name", "Common")] #same result, but better
# compare the two data frames to the original
head(ACAD_wetland)
head(ACAD_sub)
head(ACAD_sub2)
ACAD_wetland[c(2, 4, 6, 8), c(1, 2)]
names(ACAD_wetland) # get the names of the first 2 columns
ACAD_wetland[c(2, 4, 6, 8), c("Site_Name", "Site_Type")]
head(ACAD_wetland)

ACAD_nat <- ACAD_wetland[ACAD_wetland$Invasive == FALSE, ]
table(ACAD_wetland$Invasive) # 9 T
table(ACAD_nat$Invasive) # No T
ACAD_wetland$Latin_Name[ACAD_wetland$Invasive == TRUE]
ACAD_wetland[ACAD_wetland$Invasive == TRUE, "Latin_Name"] # equivalent

orchid_spp <- c("Arethusa bulbosa", "Calopogon tuberosus", "Pogonia ophioglossoides")
ACAD_orchid_plots <- ACAD_wetland[ACAD_wetland$Latin_Name %in% orchid_spp, 
                                  c("Site_Name", "Year", "Latin_Name")]
ACAD_orchid_plots
# Return a vector of unique site names, sorted alphabetically
sites_unique <- sort(unique(ACAD_wetland[,"Site_Name"]))
sites_unique
# Returns the number of elements in sites_unique vector
length(sites_unique) # 8
# Option 1
length(unique(ACAD_wetland[, "Latin_Name"]))
# Option 2
length(unique(ACAD_wetland$Latin_Name)) # equivalent
# Option 1 - used unique to just return unique site name
unique(ACAD_wetland$Site_Name[ACAD_wetland$Protected == TRUE])

# Option 2
unique(ACAD_wetland[ACAD_wetland$Protected == TRUE, "Site_Name"])
#-----------------------------------------
#     Day 1: Data Exploration Code 
#-----------------------------------------
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
head(trees)
str(trees)
summary(trees)
table(complete.cases(trees[,1:10]))# all true
x <- c(1, 3, 8, 3, 5, NA)
mean(x) # returns NA
mean(x, na.rm = TRUE) 
sort(unique(trees$DecayClassCode)) # sorts the unique values in the column
table(trees$DecayClassCode) # shows the number of records per value - very handy
trees2 <- trees
trees2$DecayClassCode[trees2$DecayClassCode == "PM"] <- NA
trees2$DecayClassCode_num <- as.numeric(trees2$DecayClassCode)

# check that it worked
str(trees2) # DecayClassCode_num is numeric
sort(unique(trees2$DecayClassCode_num)) # Only numbers show in table
trees3 <- subset(trees2, IsQAQC == FALSE, select = -DecayClassCode) # Note the importance of FALSE all caps
trees3 <- subset(trees2, IsQAQC != TRUE, select = -DecayClassCode) # equivalent
trees3 <- trees2[trees2$IsQAQC == FALSE, -12] #equivalent but not as easy to follow
# Look at the sample date format
head(trees3$SampleDate) # month/day/year

# Create new column called Date
trees3$Date <- as.Date(trees3$SampleDate, format = "%m/%d/%Y")
str(trees3)

names(trees3) # original names
names(trees3)[names(trees3) == "ScientificName"] <- "Species"
names(trees3) # check that it worked
trees3$Plot_Name <- paste(trees3$ParkUnit, trees3$PlotCode, sep = "-")
trees3$Plot_Name <- paste0(trees3$ParkUnit, "-", trees3$PlotCode) #equivalent- by default no separation between elements of paste.

mima12 <- subset(trees3, Plot_Name == "MIMA-12")
nrow(mima12) # 12
mima12_as <- subset(trees3, Plot_Name == "MIMA-12" & TreeStatusCode == "AS")
nrow(mima12_as) # 6
# OPTION 2
mima12 <- trees3[trees3$Plot_Name == "MIMA-12",]
table(mima12$TreeStatusCode) # 6

View(trees3)
max_dbh <- max(trees3$DBHcm, na.rm = TRUE)
trees3[trees3$DBHcm == max_dbh,]
View(trees)
max_dbh <- max(trees3$DBHcm, na.rm = TRUE)
max_dbh #443
trees[trees3$DBHcm == max_dbh,]
# Plot MIMA-016, TagCode = 1.
# create copy of trees data
trees_fix <- trees3

# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3
range(trees$DBHcm)
range(trees_fix$DBHcm)
#------------------------------------
#     Day 1: Basic Plotting Code 
#------------------------------------
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
hist(x = trees$DBHcm)
plot(trees$DBHcm)
plot(trees$DBHcm ~ trees$CrownClassCode)
plot(DBHcm ~ CrownClassCode, data = trees) # equivalent but cleaner axis titles

hist(ACAD_wetland$Ave_Cov)
#------------------------------------
#       Day 2: Tidyverse Code 
#------------------------------------
install.packages('tidyverse')
library(tidyverse)
library(dplyr)
#------------------------------------
#     Day 2: Data Wrangling Code 
#------------------------------------
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
# Base R
trees2 <- trees
trees2$DecayClassCode[trees2$DecayClassCode == "PM"] <- NA
trees2$DecayClassCode_num <- as.numeric(trees2$DecayClassCode)
# dplyr approach with mutate
trees2 <- mutate(trees, DecayClassCode_num = as.numeric(replace(DecayClassCode, DecayClassCode == "PM", NA)))
str(trees2)
# Base R
trees3$Date <- as.Date(trees3$SampleDate, format = "%m/%d/%Y")
# dplyr approach with mutate
trees3 <- mutate(trees2, Date = as.Date(SampleDate, format = "%m/%d/%Y"))
# Base R code
names(trees2)[names(trees2) == "ScientificName"] <- "Species"
# dplyr approach with rename
trees2 <- rename(trees2, "Species" = "ScientificName")
names(trees2)
# Base R
trees2$Plot_Name <- paste(trees2$ParkUnit, trees2$PlotCode, sep = "-")
# dplyr approach with mutate
trees2 <- mutate(trees2, Plot_Name = paste(ParkUnit, PlotCode, sep = "-"))
# Base R
trees3 <- subset(trees2, IsQAQC == FALSE, select = -DecayClassCode) # Note the importance of FALSE all caps
# dplyr
trees3a <- filter(trees2, IsQAQC == FALSE)
trees3 <- select(trees3a, -DecayClassCode)

head(trees3)
trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(replace(DecayClassCode, DecayClassCode == "PM", NA)),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode) |> 
  arrange(Plot_Name, TagCode)

head(trees_final)  
trees_final |> filter(Plot_Name == "MIMA-12") |> nrow()
trees_final |> filter(Plot_Name == "MIMA-12" & TreeStatusCode == "AS") |> nrow()
# Base R and dplyr combo
max_dbh <- max(trees_final$DBHcm, na.rm = TRUE)
trees_final |> 
  filter(DBHcm == max_dbh) |> 
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)

# dplyr with slice
trees_final |> 
  arrange(desc(DBHcm)) |> # arrange DBHcm high to low via desc()
  slice(1) |> # slice the top record
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)

# Base R
# create copy of trees data
trees_fix <- trees3
# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

# dplyr via replace
trees_fix <- trees |> mutate(DBHcm = replace(DBHcm, DBHcm == 443.0, 44.3))
range(trees$DBHcm)
range(trees_fix$DBHcm)
#------------------------------------
#      Day 2: Conditionals Code 
#------------------------------------
# Check the levels of TreeStatusCode
sort(unique(trees_final$TreeStatusCode))
alive <- c("AB", "AL", "AS", "RS")
dead <- c("DB", "DM", "DS")

trees_final <- trees_final |> 
  mutate(status = ifelse(TreeStatusCode %in% alive, "live", "dead"))

# nested ifelse to make alive, dead, and recruit 
trees_final <- trees_final |> 
  mutate(status2 = ifelse(TreeStatusCode %in% dead, "dead",
                          ifelse(TreeStatusCode %in% "RS", "recruit", 
                                 "live")))

# Check the levels of TreeStatusCode
alive <- c("AB", "AL", "AS", "RS")
dead <- c("DB", "DM", "DS")

trees_final <- trees_final |> 
  mutate(status3 = case_when(TreeStatusCode %in% dead ~ 'dead',
                             TreeStatusCode %in% 'RS' ~ 'recruit',
                             TreeStatusCode %in% alive ~ 'live', 
                             TRUE ~ 'unknown'))

table(trees_final$status2, trees_final$status3) # check that the output is the same

inv <- ACAD_wetland |> filter(Invasive == TRUE)

if(nrow(inv) > 0){print("Invasive species were detected in the data.")
  } else {print("No invasive species were detected in the data.")}

native_only <- ACAD_wetland |> filter(Invasive == FALSE) 
inv2 <- native_only |> filter(Invasive == TRUE)

if(nrow(inv2) > 0){print("Invasive species were detected in the data.")
  } else if(nrow(inv2) == 0){print("No invasive species were detected in the data.")
    } else {"Invasive species detections unclear"}

# read in wetland data if you don't already have it loaded.
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Base R using the with() function
ACAD_wetland$Status <- with(ACAD_wetland, ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected)
# Tidyverse
ACAD_wetland <- ACAD_wetland |> mutate(Status = ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected) # check your work
# Base R using the with() function and nested ifelse()
ACAD_wetland$abundance_cat <- with(ACAD_wetland, ifelse(Ave_Cov < 10, "Low",
                                                        ifelse(Ave_Cov >= 10 & Ave_Cov <= 50, "Medium", "High")))
# Tidyverse using case_when() and between
ACAD_wetland <- ACAD_wetland |> mutate(abundance_cat = case_when(Ave_Cov < 10 ~ "Low",
                                                                 between(Ave_Cov, 10, 50) ~ "Medium", 
                                                                 TRUE ~ "High"))
table(ACAD_wetland$abundance_cat)
#------------------------------------
#      Day 2: Summarizing Code 
#------------------------------------
num_trees_mut <- trees_final |> 
  group_by(Plot_Name, SampleYear, Species) |> 
  mutate(num_trees = n()) |> 
  select(Plot_Name, SampleYear, Species, num_trees)

nrow(trees_final) #164
nrow(num_trees_mut) #164
head(num_trees_mut)
num_trees_sum <- trees_final |> 
  group_by(Plot_Name, SampleYear, Species) |> 
  summarize(num_trees = n()) 

nrow(trees_final) #164
nrow(num_trees_sum) #164
head(num_trees_sum)
tree_dbh <- trees_final |> 
  group_by(Plot_Name, SampleYear) |> 
  summarize(mean_dbh = mean(DBHcm),
            num_trees = n(),
            se_dbh = sd(DBHcm)/sqrt(num_trees),
            .groups = 'drop') # prevents warning in console

tree_dbh2 <- trees_final |> 
  summarize(mean_dbh = mean(DBHcm),
            num_trees = n(),
            se_dbh = sd(DBHcm)/sqrt(num_trees),
           .by = c(Plot_Name, SampleYear))

tree_dbh == tree_dbh2 # tests that all the values in 1 data frame match the 2nd. 

# Using group_by()
ACAD_inv <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(Pct_Cov = sum(Ave_Cov), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv)

# Using summarize(.by)
ACAD_inv2 <- ACAD_wetland |> 
  summarize(Pct_Cov = sum(Ave_Cov), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv2) # should be the same as ACAD_inv
# Using group_by()
ACAD_spp <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(num_spp = n(), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp)

# Using summarize(.by)
ACAD_spp2 <- ACAD_wetland |> 
  summarize(num_spp = n(), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp2) # should be the same as ACAD_inv
# using the .by within mutate (newer solution)
ACAD_wetland <- ACAD_wetland |> 
  mutate(Site_Cover = sum(Ave_Cov), 
         .by = c(Site_Name, Year)) |> 
  mutate(rel_cov = (Ave_Cov/Site_Cover)*100,
         .by = c(Site_Name, Year, Latin_Name, Common))

ACAD_wetland <- ACAD_wetland |> group_by(Site_Name, Year) |> 
  mutate(Site_Cover = sum(Ave_Cov)) |> 
  ungroup() # good practice to ungroup after group.

table(ACAD_wetland$Site_Name, ACAD_wetland$Site_Cover) # check that each site has a unique value.
head(ACAD_wetland)
# Create new dataset because collapsing rows on grouping variables
# Using group_by() and summarize()
ACAD_wetland_relcov <- ACAD_wetland |> group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = (Ave_Cov/Site_Cover)*100,
            .groups = 'drop') |> 
  ungroup()

# Using summarize(.by = ) 
ACAD_wetland_relcov2 <- ACAD_wetland |> #group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = Ave_Cov/Site_Cover, 
            .by = c("Site_Name", "Year", "Latin_Name", "Common"))

# Check that your relative cover sums to 100 for each site
relcov_check <- ACAD_wetland_relcov2 |> group_by(Site_Name, Year) |> 
  summarize(tot_relcov = sum(rel_cov)*100, .groups = 'drop')
table(relcov_check$tot_relcov) # they should all be 100

#----------------------------------------------
#     Day 2: Data Viz. Best Practices Code 
#----------------------------------------------
library(knitr)
library(kableExtra)
covid_numbers <- read.csv("./data/covid_numbers.csv")
head(covid_numbers, 7) |> 
  knitr::kable(align = "c", caption = "<h6><b>Table 1.</b> Daily Covid cases and population numbers by state (only showing first 7 records)</h6>") |> 
  kableExtra::kable_styling(full_width = F,  html_font = 'Arial', font_size = 12) |> 
  kableExtra::column_spec(1:4, background = 'white', include_thead = T)
acme_in <- read.csv("./data/acme_sales.csv") |> 
  dplyr::arrange(category, product) 
acme_in |> 
  knitr::kable(align = "c", caption = "<h6><b>Table 2. </b>Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020</h6>") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 12) |> 
  kableExtra::column_spec(1:14, background = 'white', include_thead = T)

acme <- acme_in |> 
  pivot_longer(-c(category, product), names_to = "month", values_to = "revenue")
acme$month <- factor(acme$month, levels = month.abb)

ggplot(acme, aes(x=month, y=product, fill=revenue)) + 
  geom_raster() +
  geom_text(aes(label=revenue, color = revenue > 1250)) + # color of text conditional on revenue relative to 1250
  scale_color_manual(guide = "none", values = c("black", "white")) + # set color of text
  scale_fill_viridis_c(direction = -1, name = "Monthly revenue,\nin $1000's") +
  scale_y_discrete(limits=rev) + # reverses order of y-axis bc ggplot reverses it from the data
  labs(#title = "Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020", 
       x = "Month", y = "Product") + 
  theme_bw(base_size = 11) +
  facet_grid(rows = vars(category), scales = "free") # set scales to free so each facet only shows its own levels
ansc <- anscombe |> 
  dplyr::select(x1, y1, x2, y2, x3, y3, x4, y4) 

ansc |> 
  knitr::kable(align = "c", caption = "<h6><b>Table 3.</b> Anscombe's Quartet - Four bivariate datasets with identical summary statistics</h6>") |> 
  kableExtra::column_spec (c(2,4,6),border_left = F, border_right = T) |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 12) |> 
  kableExtra::column_spec(1:8, background = 'white', include_thead = T)

sapply(ansc, function(x) c(mean=round(mean(x), 2), var=round(var(x), 2))) |> 
  knitr::kable(align = "c", caption = "<h6><b>Table 4. </b>Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.</h6>") |> 
  kableExtra::column_spec (c(1,3,5,7), border_left = F, border_right = T) |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 12) |> 
  kableExtra::column_spec(1:9, background = 'white', include_thead = T)
#------------------------------------
#    Day 2: Intro to ggplot Code 
#------------------------------------
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align = 'center', fig.height = 3, fig.width = 5)
library(ggplot2)
library(dplyr) # for filter 
visits <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_annual_visits.csv")
library(ggplot2)
visits <- read.csv("./data/ACAD_annual_visits.csv")
# Examine the data to understand data structure, data types, and potential problems
head(visits) 
summary(visits)
table(visits$Year) 
table(complete.cases(visits))
str(visits)
# Base R
visits$Annual_Visits <- as.numeric(gsub(",", "", visits$Annual_Visits))

# Tidyverse
library(dplyr) # load package first
visits <- visits |> mutate(Annual_Visits = as.numeric(gsub(",", "", Annual_Visits)))
str(visits) #check that it worked
p <- ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000))
p
p1a <- p + geom_line() + geom_point() # default color and shape to points
p1a

p1 <- p + 
  geom_line(linewidth = 0.6) + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24)
p1
p2 <- p1 + scale_y_continuous(name = "Annual visitors in 1000's",
                              limits = c(2000, 4500),
                              breaks = seq(2000, 4500, by = 500)) + # label at 2000, 2500, ... up to 4500
           scale_x_continuous(limits = c(1994, 2024),
                              breaks = c(seq(1994, 2024, by = 5))) # label at 1994, 1999, ... up to 2024
p2
p3 <- p2 + labs(x = "Year", 
                title = "Annual visitation/1000 people in Acadia NP 1994 - 2024")
p3
p4 <- p3 + theme_bw() 
p4
p4b <- p3 + theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # make x axis text bigger and angle
                  panel.grid.major = element_blank(), # turns of major grids
                  panel.grid.minor = element_blank(), # turns off minor grids
                  panel.background = element_rect(fill = 'white', color = 'dimgrey'), # panel white w/ grey border
                  plot.margin = margin(2, 3, 2, 3), # increase white margin around plot 
                  title = element_text(size = 10) # reduce title size 
                  )

p4b
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line() + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24) +
  labs(x = "Year", title = "Annual visitation/1000 people in Acadia NP 1994 - 2024") +
  scale_y_continuous(name = "Annual visitors in 1000's",
                     limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), # make x axis text bigger and angle
        panel.grid.major = element_blank(), # turns of major grids
        panel.grid.minor = element_blank(), # turns off minor grids
        panel.background = element_rect(fill = 'white', color = 'dimgrey'), # make panel white w/ grey border
        plot.margin = margin(2, 3, 2, 3), # increase white margin around plot 
        title = element_text(size = 10) # reduce title size 
        )
  
pq <- 
  ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
    geom_line(linewidth = 0.75, linetype = 'dashed') + 
    geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
    labs(x = "Year", y = "Annual visits in 1,000s") +
    scale_y_continuous(limits = c(2000, 4500),
                       breaks = seq(2000, 4500, by = 500)) + 
    scale_x_continuous(limits = c(1994, 2024),
                       breaks = c(seq(1994, 2024, by = 5))) + 
    theme_classic()
pq
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line(linewidth = 0.75, linetype = 'dashed') + 
  geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
  labs(x = "Year", y = "Annual visits in 1,000s") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme_classic()

library(dplyr)
library(ggplot2)
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")
library(dplyr)
chem <- read.csv("./data/NETN_water_chemistry_data.csv")
jordDO <- chem |> 
  filter(SiteCode == "ACJORD") |> 
  filter(Parameter == "DO_mgL") |> 
  mutate(month = as.numeric(gsub("/", "", substr(EventDate, 1, 2))))
head(jordDO)
unique(jordDO$SiteCode) # check filter worked
unique(jordDO$Parameter) # check filter worked
jordDO_sum <- jordDO |> group_by(month) |> 
  summarize(mean_DO = mean(Value),
            num_meas = n(),
            se_DO = sd(Value)/sqrt(num_meas))
jordDO_sum
ggplot(data = jordDO_sum, aes(x = month, y = mean_DO)) +
  geom_col(fill = "#74AAE3", color = "dimgrey", width = 0.75) +
  geom_errorbar(aes(ymin = mean_DO - 1.96*se_DO, ymax = mean_DO + 1.96*se_DO), 
                width = 0.75) +
  theme_bw() +
  labs(x = NULL, y = "Dissolved Oxygen mg/L") +
  scale_x_continuous(limits = c(4, 11),
                     breaks = c(seq(5, 10, by = 1)),
                     labels = c("May", "Jun", "Jul", 
                                "Aug", "Sep", "Oct"))

ggplot(data = jordDO, aes(x = month, y = Value, group = month)) +
  geom_boxplot(outliers = F) + 
  geom_point(alpha = 0.2) +
  theme_bw() +
  labs(x = NULL, y = "Dissolved Oxygen mg/L") +
  scale_x_continuous(limits = c(4, 11),
                     breaks = c(seq(5, 10, by = 1)),
                     labels = c("May", "Jun", "Jul", 
                                "Aug", "Sep", "Oct"))

#------------------------------------
#         Day 3: Pivot Code 
#------------------------------------
library(dplyr)
library(stringr) # for word 
bat_cap <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/example_bat_capture_data.csv")
head(bat_cap)
str(bat_cap)
bat_cap <- read.csv("./data/example_bat_capture_data.csv")
bat_cap <- bat_cap |> 
  mutate(genus = toupper(word(Latin, 1)), # capitalize and extract first word in Latin
         species = toupper(word(Latin, 2)), # capitalize and extract second word in Latin
         sppcode = paste0(substr(genus, 1, 3), # combine first 3 characters of genus and species
                            substr(species, 1, 3))) |> 
  select(-genus, -species) # drop temporary columns         

head(bat_cap)
bat_sum <- bat_cap |> 
  summarize(num_indiv = sum(!is.na(sppcode)), # I prefer this over n()
            .by = c("Site", "Year", "sppcode")) |> 
  arrange(Site, Year, sppcode) # helpful for ordering the future wide columns
bat_wide <- bat_sum |> pivot_wider(names_from = sppcode, values_from = num_indiv)
head(bat_wide)
bat_wide <- bat_sum |> pivot_wider(names_from = sppcode, 
                                   values_from = num_indiv, 
                                   values_fill = 0)
head(bat_wide)
table(complete.cases(bat_wide)) # all true; no blanks
bat_wide2 <- bat_sum |> pivot_wider(names_from = sppcode, 
                                    values_from = num_indiv, 
                                    values_fill = 0, 
                                    names_prefix = "spp_")
head(bat_wide2)
bat_wide_yr <- pivot_wider(bat_sum, 
                           names_from = Year, 
                           values_from = num_indiv, 
                           values_fill = 0, 
                           names_prefix = "yr")
head(bat_wide_yr)
bat_long <- bat_wide |> pivot_longer(cols = -c(Site, Year), 
                                     names_to = "sppcode", 
                                     values_to = "num_indiv")
head(bat_long)
bat_long_yr <- pivot_longer(bat_wide_yr, 
                            cols = -c(Site, sppcode),
                            names_to = "Year", 
                            values_to = "num_indiv", 
                            names_prefix = "yr") # drops this string from values
#------------------------------------
#        Day 3: Join Code 
#------------------------------------
bat_sites <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/bat_site_info.csv")
sort(unique(bat_sites$Site)) # Sites 1, 2, 3, 4, 5
sort(unique(bat_wide$Site)) # Sites 1, 2, 3, 5, 6
bat_full <- full_join(bat_sites, bat_wide, by = "Site")
table(bat_full$Site)
knitr::kable(bat_full, align = 'c') |> 
  kableExtra::scroll_box(height = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 10) |> 
  kableExtra::column_spec(1:10, background = 'white', include_thead = T)
bat_inner <- inner_join(bat_sites, bat_wide, by = "Site")
table(bat_inner$Site)
knitr::kable(bat_inner) |> 
  kableExtra::scroll_box(height = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 10) |> 
  kableExtra::column_spec(1:10, background = 'white', include_thead = T)
bat_left <- left_join(bat_sites, bat_wide, by = "Site")
table(bat_left$Site)
knitr::kable(bat_left) |> 
  kableExtra::scroll_box(height = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 10) |> 
  kableExtra::column_spec(1:10, background = 'white', include_thead = T)
bat_right <- right_join(bat_sites, bat_wide, by = "Site")
table(bat_right$Site)
knitr::kable(bat_right) |> 
  kableExtra::scroll_box(height = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 10) |> 
  kableExtra::column_spec(1:10, background = 'white', include_thead = T)
anti_join(bat_sites, bat_wide, by = "Site")
anti_join(bat_wide, bat_sites, by = "Site")
spp_tbl <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_species_table.csv")
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
# left join species to trees, because don't want to include species not found in tree data
trees_spp <- left_join(trees, 
                       spp_tbl |> select(TSN, ScientificName, CommonName), 
                       by = c("TSN", "ScientificName"))

head(trees_spp)
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
# anti join of trees against species table, selecting only columns of interest
anti_join(trees, spp_tbl, by = c("TSN", "ScientificName")) |> 
  select(ParkUnit, PlotCode, SampleYear, ScientificName)
#------------------------------------
#     Day 3: Dates and Time Code 
#------------------------------------
codes <- read.csv("./data/datetime_codes.csv", encoding = "Latin-1")
knitr::kable(codes) |> 
  #kableExtra::scroll_box(width = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 11,
                            bootstrap_options = "condensed") |> 
  kableExtra::column_spec(1:2, background = 'white', include_thead = T)
Sys.time()
class(Sys.time()) # POSIXct POSIXt
Sys.Date()
class(Sys.Date()) # Date
# date with slashes and full year
date_chr1 <- "3/12/2026"
date1 <- as.Date(date_chr1, format = "%m/%d/%Y")
str(date1)
# date with dashes and 2-digit year
date_chr2 <- "3-12-26"
date2 <- as.Date(date_chr2, format = "%m-%d-%y")
str(date2)
# date written out
date_chr3 <- "March 12, 2026"
date3 <- as.Date(date_chr3, format = "%b %d, %Y")
str(date3)
#Julian date as numeric
as.numeric(format(date1, format = "%j"))

#Return day of week
format(date1, format = "%A") 
#Return abbreviated day of week
format(date1, format = "%a") 

#Return written out date with month name
format(date1, format = "%B %d, %Y") 
#Return abbreviated written out date with month name
format(date1, format = "%b %d, %Y") 

date1 + 1 # add a day
date1 + 7 # add a week
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
# by 15 days
seq.Date(date_list[1], date_list[2], by = "15 days")
# by month
seq.Date(date_list[1], date_list[2], by = "1 month")
# by 6 months
seq.Date(date_list[1], date_list[2], by = "6 months")
format(date1, format = "%Y%m%d")
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "3 months")
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "1 week")
unclass(as.POSIXct("2026-03-12 01:30:00", "%Y-%m-%d %H:%M:%S", tz = "America/New_York"))
unclass(as.POSIXlt("2026-03-12 01:30:00", "%Y-%m-%d %H:%M:%S", tz = "America/New_York"))
Sys.timezone()
OlsonNames()
temp_data1 <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv")
head(temp_data1)
temp_data <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv", skip = 1)[,1:3]
colnames(temp_data) <- c("index", "date_time", "tempF")
View(temp_data)
knitr::kable(temp_data[1:50,], caption = "First 50 rows of temp_data") |> 
  kableExtra::scroll_box(height = "300px") |> 
  kableExtra::kable_styling(full_width = F, html_font = 'Arial', font_size = 10) |> 
  kableExtra::column_spec(1:3, background = 'white', include_thead = T)
temp_data$timestamp <- as.POSIXct(temp_data$date_time, 
                                  format = "%m/%d/%Y %H:%M", 
                                  tz = "America/New_York")
head(temp_data)
temp_data$date <- format(temp_data$timestamp, "%Y%m%d") 
temp_data$month <- format(temp_data$timestamp, "%b")
temp_data$time <- format(temp_data$timestamp, "%I:%M") 
temp_data$hour <- as.numeric(format(temp_data$timestamp, "%I")) 
head(temp_data)
temp_data$month_num <- as.numeric(format(temp_data$timestamp, "%m"))
head(temp_data)
temp_data$julian <- as.numeric(format(temp_data$timestamp, "%j"))
head(temp_data)
#------------------------------------
#       Day 2: Adv. ggplot Code 
#------------------------------------
library(dplyr)
library(ggplot2)
library(patchwork) # for arranging ggplot objects
library(RColorBrewer) # for palettes
library(viridis) # for palettes
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")
str(chem)
library(viridis)
library(RColorBrewer)
library(scales)
library(viridis)
chem <- read.csv("./data/NETN_water_chemistry_data.csv")
str(chem)
chem <- chem |> mutate(date = as.Date(EventDate, "%m/%d/%Y"),
                       year = as.numeric(format(date, "%Y")),
                       mon = as.numeric(format(date, "%m")),
                       doy = as.numeric(format(date, "%j"))) 

ACAD_lakes <- c("ACBUBL", "ACEAGL", "ACECHO", "ACJORD", 
                "ACLONG", "ACSEAL", "ACUHAD", "ACWHOL")

lakes_temp <- chem |> filter(SiteCode %in% ACAD_lakes) |> 
  filter(Parameter %in% "Temp_F") 

head(lakes_temp)
ggplot(lakes_temp, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_point() 
ggplot(lakes_temp, 
       aes(x = date, y = Value, color = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point() 

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, 
                       shape = SiteCode)) + 
  theme_bw() +
  geom_point() +
  scale_color_manual(values = c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
                                "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
                                "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
                                "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")) +
  scale_fill_manual(values = c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
                               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
                               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
                               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")) +
  scale_shape_manual(values = c("ACBUBL" = 21, "ACEAGL" = 23, 
                                "ACECHO" = 24, "ACJORD" = 25, 
                                "ACLONG" = 23, "ACSEAL" = 21, 
                                "ACUHAD" = 25, "ACWHOL" = 24))

site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point() +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") 

ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") 
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(x = "Year", y = "Temp. (F)") # answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_line() + # new line
  geom_point(color = "dimgrey") +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = FALSE, span = 0.5, linewidth = 1) + # new line
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
# Need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
p_site <- 
  ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                         fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 4)
p_site
p_year <- 
  ggplot(lakes_temp |> filter(year > 2015), 
         aes(x = doy, y = Value, color = SiteCode, 
             fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_line(linewidth = 0.7) +
  geom_point(color = "dimgrey", size = 2.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~year, ncol = 3) 
p_year
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')

pH <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "pH")
temp <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "Temp_F")
dosat <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "DOsat_pct")
cond <- chem |> filter(SiteCode == "ACJORD") |> filter(Parameter == "SpCond_uScm")

p_pH <-
  ggplot(pH, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "pH", x = "Year")  

p_temp <-
  ggplot(temp, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "Temp (F)", x = "Year")  

p_do <-
  ggplot(dosat, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "DO (%sat.)", x = "Year")  

p_cond <-
  ggplot(cond, aes(x = date, y = Value)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) +
  labs(y = "Spec. Cond. (uScm)", x = "Year")  

library(patchwork)
p_pH + p_temp + p_do + p_cond
library(patchwork)
p_pH / p_temp / p_do / p_cond + plot_layout(axes = "collect_x")
p_site2 <- p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

p_site2
# Find range of months in the data
range_mon <- range(lakes_temp$mon)
range_mon # 5:10

# Set up the date range as a Date type
range_date <- as.Date(c("5/1/2025", "11/01/2025"), format = "%m/%d/%Y")
axis_dates <- seq.Date(range_date[1], range_date[2], by = "1 month")
axis_dates
axis_dates_label <- format(axis_dates, "%b-%d")

# Find the doy value that matches each of the axis dates
axis_doy <- as.numeric(format(axis_dates, "%j"))
axis_doy
axis_doy_limits <- c(min(axis_doy)-1, max(axis_doy) + 1)
# Set the limits of the x axis as before and after the last sample, 
# otherwise cuts off May 1 in axis
p_year + 
  scale_x_continuous(limits = axis_doy_limits,
                     breaks = axis_doy, 
                     labels = axis_dates_label) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "4 years") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p_site3 <- p_site2 + theme(legend.position = 'bottom')
p_site3
p_site3 + geom_hline(yintercept = 75, linetype = 'dashed', linewidth = 1) +
          geom_hline(yintercept = 50, linetype = 'dotted', linewidth = 1)
p_site4 <- p_site3 + 
  geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
  geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
  scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold")

p_site4
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5, show.legend = FALSE) + 
  geom_point(color = "dimgrey", alpha = 0.5, size = 2) + 
  
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temperature (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 4) + 
  scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5), 
        legend.position = 'bottom') + 
  
  geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
  geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
  scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold") +
  
  guides(fill = guide_legend(override.aes = list(alpha = 1)))

p_site4 + guides(shape = 'none', color = 'none', fill = 'none') +
          theme(legend.key.width = unit(0.8, "cm"))

#------------------------------------
#     Day 3: ggplot Palettes Code 
#------------------------------------
display.brewer.all(colorblindFriendly = TRUE)
p_pal <- ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                                fill = SiteCode, shape = SiteCode)) + 
         theme_bw() +
         geom_smooth(se = F, span = 0.5, linewidth = 1) + 
         geom_point(color = "dimgrey", alpha = 0.5, size = 2) + 
  
         scale_shape_manual(values = site_shps, name = "Lake") +
         labs(y = "Temperature (F)", x = "Year") +
  
         facet_wrap(~SiteCode, ncol = 4) + 
  
         scale_x_date(date_labels = "%Y", date_breaks = "2 years") +
         theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + 
  
         geom_hline(aes(yintercept = 75, linetype = "Upper"), linewidth = 1) +
         geom_hline(aes(yintercept = 50, linetype = "Lower"), linewidth = 1) +
         scale_linetype_manual(values = c("dashed", "dotted"), name = "WQ Threshold")
p_pal +  scale_color_brewer(name = "Lake", palette = "Set2", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # solid symbols in legend

p_pal +  scale_color_brewer(name = "Lake", palette = "Dark2", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend


p_pal +  scale_color_brewer(name = "Lake", palette = "RdYlBu", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend
# viridis 
scales::show_col(viridis(12), cex_label = 0.45, ncol = 6)
p_pal + scale_color_viridis_d(name = "Lake", aesthetics = c("fill", "color"))  #default viridis 
p_pal + scale_color_viridis_d(name = "Lake", aesthetics = c("fill", "color"), option = 'turbo') 
p_heat <- 
ggplot(lakes_temp, aes(x = mon, y = year, color = Value, fill = Value)) + 
  theme_bw() +
  geom_tile() + 
  labs(y = "Year", x = "Month") +
  facet_wrap(~SiteCode, ncol = 4) + 
  scale_x_continuous(breaks = c(5, 6, 7, 8, 9, 10),
                     limits = c(4, 11), 
                     labels = month.abb[5:10]) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

p_heat + scale_color_viridis_c(name = "Temp. (F)", aesthetics = c("fill", "color")) 
p_heat + scale_color_viridis_c(name = "Temp. (F)", aesthetics = c("fill", "color"), 
                               option = "plasma", direction = -1) 
p_heat + scale_color_gradient(low = "#FCFC9A", high = "#F54927", 
                              aesthetics = c("fill", 'color'), 
                              name = "Temp. (F)") 

p_heat + scale_color_gradient2(low = "navy", mid = "#FCFC9A", high = "#F54927", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 

p_heat + scale_color_gradientn(colors = c("#805A91", "#406AC2", "#FBFFAD", "#FFA34A", "#AB1F1F"), 
                               aesthetics = c("fill", 'color'),
                               guide = "legend",
                               breaks = c(seq(40, 85, 5)), 
                               name = "Temp. (F)") 

p_heat + scale_color_gradient2(low = "#3E693D", mid = "#FDFFC7", high = "#7A6646", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 
#------------------------------------
#     Day 3: Best Practices Code 
#------------------------------------
# libraries
library(dplyr) # for mutate and filter

# parameters
analysis_year <- 2017

# data sets
df <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Filtering on RAM sites, create new site as a number column, and only include data from specified year
df2 <- df |> filter(Site_Type == "RAM") |> 
             filter(Year == analysis_year) |> 
             mutate(site_num = as.numeric(substr(Site_Name, 5, 6)))
snake_case # most common in R
camelCase # capitalize new words after the first
period.separation # separate words by periods
whyWOULDyouDOthisTOsomeone # excess capitalization is a pain
# good word order
ACAD_wet <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_wet2 <- ACAD_wet |> filter(year > 2020)
ACAD_wet3 <- ACAD_wet2 |> mutate(plot_type = "RAM")

# bad word order
wet_ACAD <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_after_2020 <- wet_ACAD |> filter(year > 2020)
RAM_ACAD_2020 <- ACAD_after_2020 |> mutate(plot_type = "RAM")

# super long names
ACAD_wetland_sampling_data <- data.frame(years_plots_were_sampled = c(2020:2025), wetland_plots_sampled = c(1:6))
ACAD_wetland_sampling_data2 <- ACAD_wetland_sampling_data |> filter(years_plots_were_sampled > 2020)

# shorter still meaningful
ACAD_wet <- data.frame(year = 2020:2025, plot = 1:6)
ACAD_wet2 <- ACAD_wet |> filter(year > 2020)
# Good code
trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(DecayClassCode),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode) |> 
  arrange(Plot_Name, TagCode)

# Same code, but much harder to follow
trees_final <- trees|>mutate(DecayClassCode_num=as.numeric(DecayClassCode), Plot_Name=paste(ParkUnit,PlotCode,sep = "-"),  Date=as.Date(SampleDate,format="%m/%d/%Y"))|> rename("Species"="ScientificName")|>filter(IsQAQC==FALSE)|>select(-DecayClassCode)|>arrange(Plot_Name,TagCode)
# Good code
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line() + 
  geom_point(color = "black", fill = "#82C2a3", size = 2.5, shape = 24) +
  labs(x = "Year", 
       y = "Annual visitors in 1000's") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill = 'white', color = 'dimgrey'),
        title = element_text(size = 10) 
        )

# Same code but hard to follow
ggplot(data=visits,aes(x=Year,y=Annual_Visits/1000))+geom_line()+geom_point(color="black",fill="#82C2a3",size=2.5,shape=24) +
labs(x = "Year", y = "Annual visitors in 1000's")+
scale_y_continuous(limits=c(2000,4500),breaks=seq(2000,4500,by=500))+ 
scale_x_continuous(limits=c(1994,2024),breaks=c(seq(1994,2024,by=5)))+ 
theme(axis.text.x=element_text(size=10,angle=45,hjust=1), panel.grid.major=element_blank(), 
panel.grid.minor=element_blank(),panel.background=element_rect(fill='white',color='dimgrey'),
title = element_text(size = 10))
#------------------------------------
#     Day 1: Challenges Code 
#------------------------------------
ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
ACAD_wetland[c(2, 4, 6, 8), c(1, 2)]
# Option 1
length(unique(ACAD_wetland[, "Latin_Name"]))
# Option 2
length(unique(ACAD_wetland$Latin_Name)) # equivalent
# Option 1 - used unique to just return unique site name
unique(ACAD_wetland$Site_Name[ACAD_wetland$Protected == TRUE])

# Option 2
unique(ACAD_wetland[ACAD_wetland$Protected == TRUE, "Site_Name"])
mima12 <- subset(trees, PlotCode == 12)
nrow(mima12) # 12
mima12_as <- subset(trees, PlotCode == 12 & TreeStatusCode == "AS")
nrow(mima12_as) # 6
mima12 <- trees[trees$Plot_Name == "MIMA-012",]
table(mima12$TreeStatusCode) # 6

View(trees)
max_dbh <- max(trees$DBHcm, na.rm = TRUE)
trees[trees$DBHcm == max_dbh,]
View(trees)
max_dbh <- max(trees$DBHcm, na.rm = TRUE)
max_dbh #443
trees[trees$DBHcm == max_dbh,]
# Plot MIMA-016, TagCode = 1.
# create copy of trees data
trees_fix <- trees

# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3
range(trees$DBHcm)
range(trees_fix$DBHcm)
hist(ACAD_wetland$Ave_Cov)
#---- Day 2: Challenges Code ----
library(dplyr)
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
trees_final <- trees |> 
  mutate(DecayClassCode_num = as.numeric(DecayClassCode),
         Plot_Name = paste(ParkUnit, PlotCode, sep = "-"),
         Date = as.Date(SampleDate, format = "%m/%d/%Y")) |> 
  rename("Species" = "ScientificName") |> 
  filter(IsQAQC == FALSE) |> 
  select(-DecayClassCode)
ACAD_wetland <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_wetland_data_clean.csv"
  )
library(ggplot2)
library(dplyr) # for mutate
visits <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/ACAD_annual_visits.csv")
visits <- visits |> mutate(Annual_Visits = as.numeric(gsub(",", "", Annual_Visits)))
trees_final |> filter(Plot_Name == "MIMA-12") |> nrow()
trees_final |> filter(Plot_Name == "MIMA-12" & TreeStatusCode == "AS") |> nrow()
# Base R and dplyr combo
max_dbh <- max(trees_final$DBHcm, na.rm = TRUE)
trees_final |> 
  filter(DBHcm == max_dbh) |> 
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)

# dplyr with slice
trees_final |> 
  arrange(desc(DBHcm)) |> # arrange DBHcm high to low via desc()
  slice(1) |> # slice the top record
  select(Plot_Name, SampleYear, TagCode, Species, DBHcm)

# Base R
# create copy of trees data
trees_fix <- trees3
# find the problematic DBH value, and change it to 44.3
trees_fix$DBHcm[trees_fix$Plot_Name == "MIMA-016" & trees_fix$TagCode == 1 & trees_fix$DBHcm == 443] <- 44.3

# dplyr via replace
trees_fix <- trees |> mutate(DBHcm = replace(DBHcm, DBHcm == 443.0, 44.3))
range(trees$DBHcm)
range(trees_fix$DBHcm)
# read in wetland data if you don't already have it loaded.
ACAD_wetland <- read.csv("./data/ACAD_wetland_data_clean.csv")

# Base R using the with() function
ACAD_wetland$Status <- with(ACAD_wetland, ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected)
# Tidyverse
ACAD_wetland <- ACAD_wetland |> mutate(Status = ifelse(Protected == TRUE, "protected", "public"))
table(ACAD_wetland$Status, ACAD_wetland$Protected) # check your work
# Base R using the with() function and nested ifelse()
ACAD_wetland$abundance_cat <- with(ACAD_wetland, ifelse(Ave_Cov < 10, "Low",
                                                        ifelse(Ave_Cov >= 10 & Ave_Cov <= 50, "Medium", "High")))
# Tidyverse using case_when() and between
ACAD_wetland <- ACAD_wetland |> mutate(abundance_cat = case_when(Ave_Cov < 10 ~ "Low",
                                                                 between(Ave_Cov, 10, 50) ~ "Medium", 
                                                                 TRUE ~ "High"))
table(ACAD_wetland$abundance_cat)
# Using group_by()
ACAD_inv <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(Pct_Cov = sum(Ave_Cov), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv)

# Using summarize(.by)
ACAD_inv2 <- ACAD_wetland |> 
  summarize(Pct_Cov = sum(Ave_Cov), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_inv2) # should be the same as ACAD_inv
# Using group_by()
ACAD_spp <- ACAD_wetland |> group_by(Site_Name, Year, Invasive) |> 
  summarize(num_spp = n(), 
            .groups = 'drop') |>  # optional line to keep console from being chatty
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp)

# Using summarize(.by)
ACAD_spp2 <- ACAD_wetland |> 
  summarize(num_spp = n(), .by = c(Site_Name, Year, Invasive)) |> 
  arrange(Site_Name) # sort by Site_Name for easier comparison

head(ACAD_spp2) # should be the same as ACAD_inv
# using the .by within mutate (newer solution)
ACAD_wetland <- ACAD_wetland |> 
  mutate(Site_Cover = sum(Ave_Cov), 
         .by = c(Site_Name, Year)) |> 
  mutate(rel_cov = (Ave_Cov/Site_Cover)*100,
         .by = c(Site_Name, Year, Latin_Name, Common))

# older solution
ACAD_wetland <- ACAD_wetland |> group_by(Site_Name, Year) |> 
  mutate(Site_Cover = sum(Ave_Cov)) |> 
  ungroup() # good practice to ungroup after group.

table(ACAD_wetland$Site_Name, ACAD_wetland$Site_Cover) # check that each site has a unique value.
head(ACAD_wetland)
# Create new dataset because collapsing rows on grouping variables
# Using group_by() and summarize()
ACAD_wetland_relcov <- ACAD_wetland |> group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = (Ave_Cov/Site_Cover)*100,
            .groups = 'drop') |> 
  ungroup()

# Using summarize(.by = ) 
ACAD_wetland_relcov2 <- ACAD_wetland |> #group_by(Site_Name, Year, Latin_Name, Common) |> 
  summarize(rel_cov = Ave_Cov/Site_Cover, 
            .by = c("Site_Name", "Year", "Latin_Name", "Common"))

# Check that your relative cover sums to 100 for each site
relcov_check <- ACAD_wetland_relcov2 |> group_by(Site_Name, Year) |> 
  summarize(tot_relcov = sum(rel_cov)*100, .groups = 'drop')
table(relcov_check$tot_relcov) # they should all be 100

pq <- 
  ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
    geom_line(linewidth = 0.75, linetype = 'dashed') + 
    geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
    labs(x = "Year", y = "Annual visits in 1,000s") +
    scale_y_continuous(limits = c(2000, 4500),
                       breaks = seq(2000, 4500, by = 500)) + 
    scale_x_continuous(limits = c(1994, 2024),
                       breaks = c(seq(1994, 2024, by = 5))) + 
    theme_classic()
pq
ggplot(data = visits, aes(x = Year, y = Annual_Visits/1000)) +
  geom_line(linewidth = 0.75, linetype = 'dashed') + 
  geom_point(color = "black", fill = "#0080FF", size = 2.5, shape = 21) +
  labs(x = "Year", y = "Annual visits in 1,000s") +
  scale_y_continuous(limits = c(2000, 4500),
                     breaks = seq(2000, 4500, by = 500)) + 
  scale_x_continuous(limits = c(1994, 2024),
                     breaks = c(seq(1994, 2024, by = 5))) + 
  theme_classic()

#------------------------------------
#       Day 3: Challenges Code 
#------------------------------------
library(dplyr)

# bat capture data
bat_cap <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/example_bat_capture_data.csv")
library(dplyr)
# tree data
trees <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_data.csv")
# tree species table
spp_tbl <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_tree_species_table.csv")
# Hobo Temp data
temp_data <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/HOBO_temp_example.csv", skip = 1)[,1:3]
colnames(temp_data) <- c("index", "temp_time", "tempF")
temp_data$timestamp_temp <- as.POSIXct(temp_data$temp_time, 
                                       format = "%m/%d/%Y %H:%M", 
                                       tz = "America/New_York")
# Water chemistry data for ggplot section
library(dplyr)
library(ggplot2)
library(patchwork) # for arranging ggplot objects
library(RColorBrewer) # for palettes
library(viridis) # for palettes
chem <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_2026/refs/heads/main/data/NETN_water_chemistry_data.csv")

chem <- chem |> mutate(date = as.Date(EventDate, "%m/%d/%Y"),
                       year = as.numeric(format(date, "%Y")),
                       mon = as.numeric(format(date, "%m")),
                       doy = as.numeric(format(date, "%j"))) 

ACAD_lakes <- c("ACBUBL", "ACEAGL", "ACECHO", "ACJORD", 
                "ACLONG", "ACSEAL", "ACUHAD", "ACWHOL")

lakes_temp <- chem |> filter(SiteCode %in% ACAD_lakes) |> 
  filter(Parameter %in% "Temp_F") 
bat_wide_yr <- pivot_wider(bat_sum, names_from = Year, 
                           values_from = num_indiv, 
                           values_fill = 0, 
                           names_prefix = "yr")
head(bat_wide_yr)
bat_long_yr <- pivot_longer(bat_wide_yr, 
                            cols = -c(Site, sppcode),
                            names_to = "Year", 
                            values_to = "num_indiv", 
                            names_prefix = "yr") # drops this string from values
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
# left join species to trees, because don't want to include species not found in tree data
trees_spp <- left_join(trees, 
                       spp_tbl |> select(TSN, ScientificName, CommonName), 
                       by = c("TSN", "ScientificName"))

head(trees_spp)
# find the columns in common
intersect(names(spp_tbl), names(trees)) # TSN and ScientificName
# anti join of trees against species table, selecting only columns of interest
anti_join(trees, spp_tbl, by = c("TSN", "ScientificName")) |> 
  select(ParkUnit, PlotCode, SampleYear, ScientificName)
format(date1, format = "%Y%m%d")
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "3 months")
date_list <- as.Date(c("01/01/2026", "12/31/2026"), format = "%m/%d/%Y") 
seq.Date(date_list[1], date_list[2], by = "1 week")
temp_data$month_num <- as.numeric(format(temp_data$timestamp_temp, "%m"))
head(temp_data)
temp_data$julian <- as.numeric(format(temp_data$timestamp_temp, "%j"))
head(temp_data)
# Will need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)
# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_point(color = "dimgrey", size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(x = "Year", y = "Temp. (F)") # answer
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
# Need to run these first
site_cols <- c("ACBUBL" = "#7FC7E1", "ACEAGL" = "#ED5C5C", 
               "ACECHO" = "#F59617", "ACJORD" = "#A26CF5", 
               "ACLONG" = "#1952CF", "ACSEAL" = "#F3F56C", 
               "ACUHAD" = "#FFA6C8", "ACWHOL" = "#8FD184")
site_shps <- c("ACBUBL" = 21, "ACEAGL" = 23, 
               "ACECHO" = 24, "ACJORD" = 25, 
               "ACLONG" = 23, "ACSEAL" = 21, 
               "ACUHAD" = 25, "ACWHOL" = 24)

# Now generate the plot
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5) +
  geom_point(color = "dimgrey", alpha = 0.85) + # or remove the alpha
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  geom_point(color = "dimgrey", alpha = 0.5) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake")  +
  labs(x = "Year", y = "Temp. (F)") 
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')
ggplot(lakes_temp, aes(x = date, y = Value, color = SiteCode, 
                       fill = SiteCode, shape = SiteCode)) + 
  theme_bw() +
  geom_smooth(se = F, span = 0.5) +
  geom_point(color = "black", alpha = 0.6, size = 2) +
  scale_color_manual(values = site_cols, name = "Lake") +
  scale_fill_manual(values = site_cols, name = "Lake") +
  scale_shape_manual(values = site_shps, name = "Lake") +
  labs(y = "Temp. (F)", x = "Year") +
  facet_wrap(~SiteCode, ncol = 2) +
  theme(legend.position = 'bottom')

p_site + 
  scale_x_date(date_labels = "%Y", date_breaks = "4 years") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
p_pal +  scale_color_brewer(name = "Lake", palette = "RdYlBu", aesthetics = c("fill", "color"))  +
         guides(fill = guide_legend(override.aes = list(alpha = 1))) # makes symbols not transparent in legend
p_heat + scale_color_gradient2(low = "#3E693D", mid = "#FDFFC7", high = "#7A6646", 
                               aesthetics = c("fill", 'color'),
                               midpoint = mean(lakes_temp$Value), 
                               name = "Temp. (F)") 
?plot
?dplyr::filter
mean_x <- mean(c(1, 3, 5, 7, 8, 21) # missing closing parentheses
mean_x <- mean(c(1, 3, 5, 7, 8, 21)) # correct
               
birds <- c("black-capped chickadee", "golden-crowned kinglet, "wood thrush") # missing quote after kinglet
birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected

birds <- c("black-capped chickadee", "golden-crowned kinglet" "wood thrush") # missing comma after kinglet
birds <- c("black-capped chickadee", "golden-crowned kinglet", "wood thrush") # corrected
x_mean <- maen(x) # misspelled mean
x_mean <- mean(x) # Corrected

# Missing comma to indicate subsetting rows (records)
ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name)]
# Correct
ACAD_wetland2 <- ACAD_wetland[!is.na(ACAD_wetland$Site_Name), ]