NPS Inventory and Monitoring Division Intro to R Training: January 31 – February 2, 2022

Prep for Training

Installing required software

The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.

NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!

Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.

Install the latest version of R (4.1.2 as of January 2022) Assuming you’re on a PC, the installer for the latest version of R can be downloaded from The Comprehensive R Archive Network (CRAN). You’ll want to download the R installer by clicking on “Download R 4.0.4 for Windows”, or whatever the latest version number happens to be. After it downloads, open and run the installer with default settings. Note that there’s no need to pin R to your taskbar or add a shortcut to your desktop. For 99% of your needs, you’ll run R within RStudio, and won’t ever open R directly.


Install RStudio The installer for RStudio for PCs can be downloaded here. You’ll need to click on the large blue “DOWNLOAD RSTUDIO FOR WINDOWS” button to download the installer. After it downloads, open and run the installer with default settings. I like having RStudio pinned to my taskbar, so it’s easier to find/open, but it’s up to you and your preferences.


Day 3: Required Packages

A number of packages are required to follow along with the Data Visualization session. Please try to install these ahead of time by running the code below, so you’re able to keep up.

packages <- c("tidyverse", "ggthemes", "GGally", "RColorBrewer", 
              "viridis", "scales", "plotly", "patchwork", 
              "sf", "tmap", "leaflet", "spsurvey")

install.packages(setdiff(packages, rownames(installed.packages())))  

If folks are having trouble installing tmap due to an issue with one of its dependencies, terra, try running the following code, and then reinstall tmap.

install.packages('terra', repos='https://rspatial.r-universe.dev')
Optional Reading We know you’re all busy and have to prioritize how you spend your time. If you have any time before training starts, we highly recommend reading Chapter 2: R basics and workflows in STAT545. This is an online book based on a graduate level statistics class that Dr. Jenny Bryan teaches and is turning into this book. She’s a stats professor turned senior programmer at RStudio, and I really like how she approaches programming R in this book.


Structure of training

The training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.

Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.

Four days is barely enough to scratch the surface on what you can do in R. Our goals with this training are to help you get beyond the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you in your work. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you start out on this R learning curve. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.

Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.


Day 1: Intro to R

Intro to R & RStudio

Some Notes on Coding

Welcome to the class!

Learning (or re-learning) how to program can feel like an intimidating learning curve. Congrats on taking the first step! Before we delve into technical topics, let’s talk a little about coding in general.

It’s okay to make mistakes

Errors and broken code are not only part of the learning process, they’re part of the coding process. If you think your code is running perfectly on the first try, you probably didn’t test it thoroughly enough. It can be frustrating and discouraging when you can’t get your code to work the way you want it to, but getting stuck doesn’t make you a bad programmer. Try to approach broken code as a puzzle and a learning opportunity.

Use the resources available to you

As the saying goes,

Good coders write good code; great coders steal great code.

Google and Stack Overflow are great resources when you don’t know off the top of your head how to do something. And reach out to your colleagues too - there’s no reason to waste time writing code that someone else has already written! Just give credit where credit is due, and take some time to make sure you understand what the copied code does. And especially don’t hesitate to reach out to more experienced colleagues with questions.

About R

R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source, which means you will never need a paid license to use it, and anyone who wants to can view the underlying source code and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.

Every programming language has strengths and weaknesses. We like R because:

  • It is great for statistics and data exploration
  • It is good at producing publication-quality plots and tables
  • There are many well-maintained add-ons (called packages) that add useful features to R (this website was built using the rmarkdown package)
  • R is well-documented, with many learning resources available whether you are a beginner or an advanced user

For more information on the history of R, visit the R Project website.

About RStudio

RStudio is what’s called an integrated development environment (IDE). When you install R, it comes with a simple user interface that lets you write and execute code. Writing code in this interface is kind of like doing word processing in Notepad: it’s simple and straightforward, but soon you’ll start wishing for more features. This is where RStudio comes in. RStudio makes programming in R easier by color coding different types of code, autocompleting code, flagging mistakes (like spellcheck for code), and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).

When you open RStudio, you typically see 4 panes:

RStudio

  • Source:
    This is where the code gets written. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.
    The source pane will helpfully color-code your code to make it easier to read. It will also detect syntax errors (the coding equivalent of a grammar checker). These are flagged with a red “x” next to the line number and a squiggly line under the offending code. When you’re ready to run all or part of your script:

    • Highlight the line(s) of code you want to run
    • Either click the “Run” button (top right of the source pane) or press Ctrl+Enter. At this point, the code gets sent to the console (the bottom left pane). You’ll see your code appear in the console, and then you’ll see the output if there is any.
  • Console:
    This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.1.0 or greater). We talked above about how you can run a script at the console. You can also type code directly into the console and run it that way. That code won’t get saved to a file, but it’s a great way to experiment and test out lines of code before you add them to your script in the source pane.
    The console is also where errors appear if your code breaks. Deciphering errors can be a challenge sometimes, but Googling them is a good place to start.

  • Environment/History/Connections:

    • Environment: This is where you can see what is currently in your environment. Think of the environment as temporary storage for objects - things like datasets and stored values - that you are using in your script(s). You can also click on objects and view them.
      Remember that anything you see in your environment is temporary and it will disappear when you restart R. If there is something in your environment that you want to access in the future, make sure that your script is able to reproduce it (or save it to a file).
    • History: This shows the code you’ve run in the current session. It’s not good to rely on it, but it can be a way to recover code you ran at the console and later realized you needed in your script.
    • Connections: This is one way to connect R to a database. We’ll talk about this more later on.
    • Git: If you have installed Git on your computer, you may see a Git tab. We won’t talk much about it this week, but this is where you’ll keep track of changes to your code. You are encouraged to attend the Git/GitHub session during Week 2 if you’d like to learn more!
    • Tutorial: This has some interactive tutorials that you can check out if you are interested.
  • Files/Plots/Packages/Help/Viewer:

    • Files: This tab shows the files within your working directory (typically the folder where your current code project lives). More on this later.
    • Plots: This tab will show plots that you create.
    • Packages: This tab allows you to install, load, and update packages, and also view the help files within each package. This is kind of like the R equivalent of an app store. You can also do this with code.
    • Help: Allows you to search for and view documentation for packages that are installed on your computer.
    • Viewer: Shows HTML outputs produced by R Markdown, R Shiny, and some plotting and mapping packages.
Global Options

There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.

  1. Under the General tab, you should see that your R Version is [64-bit] and the version is R-4.1.0 or greater. If it’s not, you probably need to update R. Let an instructor know if you have questions about this.
  2. Make sure you are not saving your environment. To do this, uncheck the Restore .RData into your workspace at startup option.
    When this option is checked, R Studio saves your current working environment (the stuff in the Environment tab) when you exit. The next time you open R Studio, it restores that same environment. This seems like a good thing, but the whole point of using R is that your code should return the same results every time you run it. Clearing your environment every time you close RStudio forces you to run your code with a clean slate so that you can be sure that it will work for everyone, not just you and your specific environment.
    If you sometimes work with huge datasets that take a long time to load and process, you may want to set Save workspace to .RData on exit to “Ask”. When you close RStudio, it will ask you if you want to save your workspace image. If you have a huge dataset (or the products of a huge dataset) in your environment, select “Yes” and then run load(".RData") at the console the next time you open RStudio to restore everything.

Most other settings are whatever you prefer, including everything in the Appearance window.

Starting a Project

When you’re working on a code project it’s important to stay organized. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.

Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.

default working directory

If you refer to a file in R using a relative path, for example ./data/my_data_file.csv, R will look in your current working directory for a folder called data containing a file called my_data_file.csv. Using relative paths is a good idea because the full path is specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in.
Let’s make a project for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd-r-training-intro. Next, you’ll select what folder to keep your project folder in. Documents/R is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.

Create new project step 1 Create new project step 2 Create new project step 3

If you successfully started a project named imd-r-training-intro, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.

Let’s start coding

First we need to create a new R script file. Make sure you are working in the imd-r-training-intro project that you just created. Click on the New File icon new file icon in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call it my_first_script.R.

The basics

We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. Try it out! You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.

To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.

Text following a hashtag/pound sign (#) will be ignored - use this to leave comments about what your code is doing. Commenting your code is one of the best habits you can form, and you should start now! Comments are a gift to your future self and anyone else who tries to use your code.

# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code

# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")
## [1] "Hello, lets do some basic math. Results of operations will appear here"
# one plus one
1+1
## [1] 2
# two times three, divided by four
(2*3)/4
## [1] 1.5
# basic mathematical and trigonometric functions are fairly similar to what they would be in excel

# calculate the square root of 9
sqrt(9)
## [1] 3
# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)
## [1] 2
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)
## [1] -1
# calculate 5 to the tenth power
5^10
## [1] 9765625

Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.

Variables and functions

Variables

Occasionally, it’s enough to just run a line of code and display the result in the console. But typically the code we write is more complex than adding one plus one, and when we execute a line of code, we want to be able to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs.

R uses the <- symbol for variable assignment. If you’ve used other programming languages (or remember high school algebra), you may be tempted to use = instead. It will work, but there are subtle differences between <- and =, so you should get in the habit of using <-.

# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:

a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261

In the code above, we assign the variables a and b once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a or b, we only have to do it in one place.

Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.

All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.

Functions

It’s hard to get very far in R without making use of functions. Think of a function as a black box that takes some kind of input (the argument(s)) from the user and outputs a result (the return value). Can you identify the functions in the above examples?

anatomy of a function

Using functions - reading data from CSV

The read.csv function is used to bring in a dataset from a CSV file, and it stores the data in one of the fundamental data structures in R: a data frame. read.csv() takes the file path to the CSV as input, and it outputs a data frame containing the data from the CSV.

We’ll get very familiar with data frames in this class, but for the moment just know that it’s a table of data with rows and columns. Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). Once you have read the data in, you can take a quick look at its structure by typing the name of the variable it’s stored in.

# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")

# display the 'tree_data' data frame we just created
tree_data
##               Species DBH_in Height_ft Age_yr
## 1   arboreas madeupis    4.0         6      3
## 2   arboreas madeupis    6.0        10      5
## 3   arboreas madeupis    8.0         5      6
## 4   arboreas madeupis   11.0         7      6
## 5   arboreas madeupis   12.0        16      8
## 6   arboreas madeupis   13.0        19      8
## 7   arboreas madeupis   12.0        24     11
## 8   arboreas madeupis   20.0        22     11
## 9   arboreas madeupis   20.0        32     12
## 10  arboreas madeupis   11.0        31     13
## 11  arboreas madeupis   27.0        35     15
## 12  arboreas madeupis   25.0        33     15
## 13  arboreas madeupis   30.0        40     17
## 14  arboreas madeupis   35.0        67     18
## 15  arboreas madeupis   29.0        60     21
## 16  arboreas madeupis   37.0        65     30
## 17  arboreas madeupis   31.0        78     32
## 18  arboreas madeupis   38.0        76     36
## 19  arboreas madeupis   45.0        91     37
## 20  arboreas madeupis   55.0        83     38
## 21  arboreas madeupis   60.0        88     40
## 22  arboreas madeupis   75.0        86     43
## 23  arboreas madeupis   76.0        89     51
## 24  arboreas madeupis   80.0        75     55
## 25  arboreas madeupis   99.0       100     61
## 26 planteus falsinius    9.0         6      2
## 27 planteus falsinius    9.3         9      4
## 28 planteus falsinius   10.1         8      5
## 29 planteus falsinius   10.3        14      4
## 30 planteus falsinius   10.7        12      6
## 31 planteus falsinius   10.8        10      8
## 32 planteus falsinius   11.0        10      7
## 33 planteus falsinius   11.3        11      7
## 34 planteus falsinius   11.4        13      9
## 35 planteus falsinius   11.6        13     12
## 36 planteus falsinius   11.6        15     15
## 37 planteus falsinius   11.9        23     17
## 38 planteus falsinius   12.3        16     18
## 39 planteus falsinius   12.4        21     15
## 40 planteus falsinius   12.6        23     17
## 41 planteus falsinius   12.6        18     21
## 42 planteus falsinius   12.8        25     20
## 43 planteus falsinius   15.0        35     30

Examining datasets

Compared to scrolling through a dataset in Excel, examining a dataset in R can feel unintuitive at first. Stick with it, though. Once you get used to exploring data in R, you’ll find that R will help you learn more about your dataset in a more efficient manner.

Let’s see what we can learn about the dataset as a whole. The most useful functions for this are:

  • head() Show only the first few lines of the dataframe in the console. The function is useful for taking a quick glance to ensure that column names were properly assigned, and to take a quick look a column names without printing the whole dataset.
  • summary() Show a basic statistical summary of the dataset.
  • str() Show information about the structure of the dataset, including number of rows and columns, column data types, and the first few values in each column.
  • View() In a new tab, display a spreadsheet-like view of the full dataset with options to sort and filter. Note that you can’t change the data in this view - it is read-only.
head(tree_data)  # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
##             Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis      4         6      3
## 2 arboreas madeupis      6        10      5
## 3 arboreas madeupis      8         5      6
## 4 arboreas madeupis     11         7      6
## 5 arboreas madeupis     12        16      8
## 6 arboreas madeupis     13        19      8
head(tree_data, n = 10)  # Show the first 10 rows
##              Species DBH_in Height_ft Age_yr
## 1  arboreas madeupis      4         6      3
## 2  arboreas madeupis      6        10      5
## 3  arboreas madeupis      8         5      6
## 4  arboreas madeupis     11         7      6
## 5  arboreas madeupis     12        16      8
## 6  arboreas madeupis     13        19      8
## 7  arboreas madeupis     12        24     11
## 8  arboreas madeupis     20        22     11
## 9  arboreas madeupis     20        32     12
## 10 arboreas madeupis     11        31     13
summary(tree_data)  # View a basic statistical summary
##    Species              DBH_in        Height_ft          Age_yr     
##  Length:43          Min.   : 4.00   Min.   :  5.00   Min.   : 2.00  
##  Class :character   1st Qu.:11.00   1st Qu.: 12.50   1st Qu.: 7.50  
##  Mode  :character   Median :12.60   Median : 23.00   Median :15.00  
##                     Mean   :24.78   Mean   : 35.35   Mean   :18.81  
##                     3rd Qu.:30.50   3rd Qu.: 62.50   3rd Qu.:25.50  
##                     Max.   :99.00   Max.   :100.00   Max.   :61.00
str(tree_data)  # Get info about the structure of the data
## 'data.frame':    43 obs. of  4 variables:
##  $ Species  : chr  "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
##  $ DBH_in   : num  4 6 8 11 12 13 12 20 20 11 ...
##  $ Height_ft: int  6 10 5 7 16 19 24 22 32 31 ...
##  $ Age_yr   : int  3 5 6 6 8 8 11 11 12 13 ...
View(tree_data)  # Open a new tab with a spreadsheet view of the data

Data structures

We’re going to take a little detour into data structures at this point. It’ll all tie back in to our tree data.

The data frame we just examined is a type of data structure. A data structure is what it sounds like: it’s a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this unfortunately-named data structure later). Today we’ll focus on vectors and data frames.

Vectors

Vectors are the simplest data structure in R. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. Here are some examples of vectors:

digits <- 0:9  # Use x:y to create a sequence of integers starting at x and ending at y
digits
##  [1] 0 1 2 3 4 5 6 7 8 9
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
##  [1]  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
## [1] "greater roadrunner"  "Costa's hummingbird" "kakapo"

Note the use of c(). The c() function stands for combine, and it combines elements into a single vector. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.

Let’s play around with vectors a little more. We can use is.vector() to test whether something is a vector. We can get the length of a vector with length(). Note that single values in R are just vectors of length one.

is.vector(digits)  # Should be TRUE
## [1] TRUE
is.vector(favorite_birds)  # Should also be TRUE
## [1] TRUE
length(digits)  # Hopefully this is 10
## [1] 10
length(shoe_sizes)
## [1] 10
# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
## [1] TRUE
is.vector(length_one_int)
## [1] TRUE
length(length_one_chr)
## [1] 1
length(length_one_int)
## [1] 1

In the examples above, each vector contains a different type of data. digits contains integers, is_odd contains logical (true/false) values, favorite_birds contains text, and shoe_sizes contains decimal numbers. That’s because a given vector can only contain a single type of data. In R, there are four data types that we will typically encounter:

  • character Regular text, denoted with double or single quotation marks (e.g. "hello", "3", "R is my favorite programming language")
  • numeric Decimal numbers (e.g. 23, 3.1415)
  • integer Integers. If you want to explicitly denote a number as an integer in R, append L to it or use as.integer() (e.g. 5L, as.integer(30)).
  • logical True or false values (TRUE, FALSE). Note that TRUE and FALSE must be all-uppercase.

There are two more data types, complex and raw, but you are unlikely to encounter these so we won’t cover them here.

You can use the class() function to get the data type of a vector:

class(favorite_birds)
## [1] "character"
class(shoe_sizes)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"

If you need to access a single element of a vector, you can use the syntax my_vector[x] where x is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1] is the first element of my_vector). If you’ve coded in other languages, you may be used to indexing starting at 0.

second_favorite_bird <- favorite_birds[2]
second_favorite_bird
## [1] "Costa's hummingbird"
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
## [1] "greater roadrunner"  "Costa's hummingbird"

Logical vectors can also be used to subset a vector. The logical vector must be the length of the vector you are subsetting.

odd_digits <- digits[is_odd]
odd_digits
## [1] 1 3 5 7 9
Data frames

Let’s revisit our trees data frame. We’ve explored the data frame as a whole, but it’s often useful to look at one column at a time. To do this, we’ll use the $ syntax:

tree_data$Species
##  [1] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [4] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [7] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [10] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [13] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [16] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [19] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [22] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [25] "arboreas madeupis"  "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data$Age_yr
##  [1]  3  5  6  6  8  8 11 11 12 13 15 15 17 18 21 30 32 36 37 38 40 43 51 55 61
## [26]  2  4  5  4  6  8  7  7  9 12 15 17 18 15 17 21 20 30

You can also use square brackets [] to access data frame columns. You can specify columns by name or by index (integer indicating position of column). It’s almost always best to refer to columns by name when possible - it makes your code easier to read, and it prevents your code from breaking if columns get reordered.

The square bracket syntax allows you to select multiple columns at a time and to select a subset of rows.

tree_data[, "Species"]  # Same as tree_data$Species
##  [1] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [4] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [7] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [10] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [13] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [16] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [19] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [22] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [25] "arboreas madeupis"  "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data[, c("Species", "Age_yr")]  # Data frame with only Species and Age_yr columns
##               Species Age_yr
## 1   arboreas madeupis      3
## 2   arboreas madeupis      5
## 3   arboreas madeupis      6
## 4   arboreas madeupis      6
## 5   arboreas madeupis      8
## 6   arboreas madeupis      8
## 7   arboreas madeupis     11
## 8   arboreas madeupis     11
## 9   arboreas madeupis     12
## 10  arboreas madeupis     13
## 11  arboreas madeupis     15
## 12  arboreas madeupis     15
## 13  arboreas madeupis     17
## 14  arboreas madeupis     18
## 15  arboreas madeupis     21
## 16  arboreas madeupis     30
## 17  arboreas madeupis     32
## 18  arboreas madeupis     36
## 19  arboreas madeupis     37
## 20  arboreas madeupis     38
## 21  arboreas madeupis     40
## 22  arboreas madeupis     43
## 23  arboreas madeupis     51
## 24  arboreas madeupis     55
## 25  arboreas madeupis     61
## 26 planteus falsinius      2
## 27 planteus falsinius      4
## 28 planteus falsinius      5
## 29 planteus falsinius      4
## 30 planteus falsinius      6
## 31 planteus falsinius      8
## 32 planteus falsinius      7
## 33 planteus falsinius      7
## 34 planteus falsinius      9
## 35 planteus falsinius     12
## 36 planteus falsinius     15
## 37 planteus falsinius     17
## 38 planteus falsinius     18
## 39 planteus falsinius     15
## 40 planteus falsinius     17
## 41 planteus falsinius     21
## 42 planteus falsinius     20
## 43 planteus falsinius     30
tree_data[1:5, "Species"]  # First five elements of Species column
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis"

Now that we’ve learned about vectors, these data frame columns might look familiar. A data frame is just a collection of vectors of the same length, where each vector is a column.

is.vector(tree_data$Species)
## [1] TRUE
class(tree_data$Species)
## [1] "character"
is.vector(tree_data$Age_yr)
## [1] TRUE
class(tree_data$Age_yr)
## [1] "integer"
str(tree_data)  # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
## 'data.frame':    43 obs. of  4 variables:
##  $ Species  : chr  "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
##  $ DBH_in   : num  4 6 8 11 12 13 12 20 20 11 ...
##  $ Height_ft: int  6 10 5 7 16 19 24 22 32 31 ...
##  $ Age_yr   : int  3 5 6 6 8 8 11 11 12 13 ...

Working with columns

R is great at working with vectors because they are the fundamental data structure of the language. Since data frame columns are vectors, they are typically easy to operate on.

Basic statistical summary functions aren’t too different from what you’d do in Excel.

#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)
## [1] 18.81395
#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)
## [1] 24.78372
#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
## [1] 15
sd(tree_data$Age_yr)
## [1] 15.02261
min(tree_data$Age_yr)
## [1] 2

Another useful way to summarize a column is to get the unique values it contains. This is typically most useful with character columns (e.g. species) but works with any data type.

unique(tree_data$Species)  # in this line, we are printing all of the unique species names in the dataset (in this case 2).
## [1] "arboreas madeupis"  "planteus falsinius"
species_names <- unique(tree_data$Species)  # assign unique species names to a variable

Summarizing data is useful but we also want to be able to modify it. So let’s say we want to convert the ‘Height_ft’ column in tree_data from feet to inches. We’ll start by assigning a conversion factor to the variable in_per_ft.

in_per_ft <- 12  # Since there are 12 inches in a foot

We can then use this conversion factor to convert the ‘Height_ft’ column to inches, and place this converted value in a new column we will create in the data frame. Notice below that if you assign a vector to a column that doesn’t exist yet, R will automatically create that column. We’ll use the head() function to verify that the Height_in column was created correctly.

# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft

# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
##             Species DBH_in Height_ft Age_yr Height_in
## 1 arboreas madeupis      4         6      3        72
## 2 arboreas madeupis      6        10      5       120
## 3 arboreas madeupis      8         5      6        60
## 4 arboreas madeupis     11         7      6        84
## 5 arboreas madeupis     12        16      8       192
## 6 arboreas madeupis     13        19      8       228

Basic data visualization

It’s often easier to get a sense of our data by visualizing it, so let’s explore the basic tools for taking a first glance at our data.

Histogram

We’ll start with a histogram. The code below generates a basic histogram plot of a specific column in the dataframe (remember from earlier that this is denoted by the $ sign) using the hist() function. We’ll start by creating a histogram for the ‘Age_yr’ column.

hist(x = tree_data$Age_yr)

Looking at the histogram, we can see that the age of most trees in the dataset fall in the 0-20 year old age class.

Scatter plot

The data frame also includes information on DBH, and height of the trees. What if we want to quickly take a look at how these two variables relate to one another? To get a basic plot, you can use the plot() function. A simple call to plot() takes two arguments: x (a vector of x coordinates) and y (a vector of y coordinates). You can learn more about plot() by typing ?plot at the console.

# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)

# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)

Getting Help

Getting Help

There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.

?plot
?dplyr::filter

If you have a question, online resources include Stackexchange and Stackoverflow are extremely helpful. Google searches are a great first step. It helps to include “in R” in every search related to R code.

Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused data science community of practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join. The Teams join code will be shared with course participants.

Day 2: Data Wrangling

Intro and Packages

Data Wrangling Intro

Goals of this Day

  1. Learn to load and understand how packages work
  2. Understand what data wrangling is and some ways to do it.
  3. Learn how to clean your import data so that R understands what it is
  4. Get data in the format you need for analysis using the tidyverse.
  5. Combine data from different data.frames so that it is in the correct format

Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.


Working with Packages: Installing a Package

What is a package? A package is a group of functions that someone has put together. They come with documentation that tells you about what each function does.

First, load package

  1. Go to packages tab in lower right quadrant
  2. Click on “install”
  3. Type name of package ‘janitor’
  4. Click “install”
  5. Code will pop up in your console
install.packages("janitor")
Calling a Package

To use a package, call it by using the function “library”

library('janitor')

This is called “loading a package” or “calling a package”.


Using Package

Look at the vignette

  • Go to packages window of lower right quadrant
  • Click on ‘janitor’ name in list of packages
  • Vignette will show up in the ‘help’ tab

take time to click a function from the vignette list walk through how to get helpful info

Did you get “error” text?

Text pops up in the console after you run code You must decide whether it means you have to change something or not

library('janitor')
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test


Challenge: Install a Package

Challenge: install + call the package ‘tidyverse’


Solution

Solution: install + call the package ‘tidyverse’

install.packages("tidyverse")
library('tidyverse')


Data Cleaning

Data Cleaning

Data wrangling is all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process.Material presented today can be found in Hadley Wickham’s book “R for Data Science”, which is free and online. https://r4ds.had.co.nz/wrangle-intro.html

If you’re starting from this tab of the lesson, you’ll need to load two packages before proceeding.

library('janitor')
library('tidyverse')

Major Components to Data Wrangling

  • Step 1: Import (covered yesterday)
  • Step 2+3: Tidy & Transform (get data into a useful format)


Column Names

Sometimes you get data where the column names are messy. For an example of data that needs tidying before exploration and analyses, we’ll use fake data. Copy + paste the code chunk below into your script, and run it in the console to generate a dataframe called “messy_data”.

Here, we’ll clarify that we are using the word “data.frame” or “dataframe” to describe the data. The word “tibble” also shows up. We’ll use these terms interchangeably.

messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools', 
                                   'Monument', 'Tidepools', NA),
                     Visitor_count = c(200, 300, 400, 500, 400, NA),
                     `empty row` = NA)

This data.frame captures the hypothetical number of visitors across four trails at a park. There are a few qualities that make it difficult to interpret. Seeing this data, what would you change to make it most consistent?

messy_data
## # A tibble: 6 x 3
##   TrailName Visitor_count `empty row`
##   <chr>             <dbl> <lgl>      
## 1 Bayside             200 NA         
## 2 Coastal             300 NA         
## 3 Tidepools           400 NA         
## 4 Monument            500 NA         
## 5 Tidepools           400 NA         
## 6 <NA>                 NA NA

There are four things I suggest fixing with these data:

  1. The column names are inconsistently formatted
  2. There is an empty row
  3. There is an empty column
  4. There is a duplicated row

The Janitor package has ways to fix all of these data situations and aid in data cleaning. Let’s proceed with changing the column names.

The clean_names() function in the janitor package will consistently format column names.

clean_data <- clean_names(messy_data)
clean_data
## # A tibble: 6 x 3
##   trail_name visitor_count empty_row
##   <chr>              <dbl> <lgl>    
## 1 Bayside              200 NA       
## 2 Coastal              300 NA       
## 3 Tidepools            400 NA       
## 4 Monument             500 NA       
## 5 Tidepools            400 NA       
## 6 <NA>                  NA NA

Empty Rows and Columns

Now, we’ll deal with the empty row and empty column. To remove empty rows or columns, we can use the remove_empty() function.

The arguments for the function are as follows:

remove_empty(
  # first argument is the data
  dat, 
  # the next specifies if you want to get rid of empty rows, columns, or both
  which = c("rows", "cols"), 
  # this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
  quiet = TRUE)

Let’s give it a try

Removing empty rows

clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
## # A tibble: 5 x 3
##   trail_name visitor_count empty_row
##   <chr>              <dbl> <lgl>    
## 1 Bayside              200 NA       
## 2 Coastal              300 NA       
## 3 Tidepools            400 NA       
## 4 Monument             500 NA       
## 5 Tidepools            400 NA

Removing empty rows and columns

clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400

The Pipe - Sequences of Functions

What happens if you’d like to both clean_names() and remove_empty() on the same data?

Solution 1 - run as a sequence

clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400

Solution 2 - use the pipe operator

Pipe operators (%>% or |>) can be read like the word “then”. For example, clean_names() %>% remove_empty() can be read as “clean names, then remove empty”. They allow for a more clean workflow.

clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400
# note that replacing %>% with |> also works

Note that there is no data argument in the remove_empty() function. This is because, when using a pipe, the data argument is inherited from the prior function.


Removing Duplicate Entries

A final function we’ll discuss is distinct() from the dplyr package, which is park of tidyverse. It removes duplicate rows from data.

clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
              distinct()
clean_data
## # A tibble: 4 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500

A word of caution - sometimes you may not want to remove duplicate rows, or may not want to remove empty rows, because they are important to the data. You know your data best. Use functions responsibly.


Challenge

The code chunk below creates messy data, called “messy_data2”. Use functions from the Janitor package to tidy the data, and name the result “clean_data2”. See if you can use the pipe operator to streamline the script.

messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
                     visitor_count = c(NA, NA, 400, 500, 400, 400),
                     `empty row` = NA)

Solution
clean_data2 <- clean_names(messy_data2) %>%
               remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
               distinct()

Dplyr functions

Data Wrangling

In this section, we’ll introduce functions in the dplyr package. This package is one of the packages that make up the tidyverse, so you won’t have to install and call another package. The dplyr package is a group of functions that help to manipulate and summarize data. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8

You can think of these functions like verbs that accomplish tasks, and are strung together by the pipe operator (%>% or |>).

We’ll cover the following functions:

  1. select() - selects or removes columns
  2. rename() - renames columns
  3. arrange() - arranges a data.frame by column values
  4. filter() - filters data for observations that meet specific criteria
  5. mutate() - adds new column(s) to a data.frame
  6. pull() %>% table() - isolates and summarizes by a column
  7. group_by() %>% summarize() - summary calculations by groups/factor levels and returns data for each group
  8. group_by() %>% mutate() - summary calculations by groups/factor levels and returns data for each observation

The data we’ll use for modules 3-5 is from the Tidy Tuesday community on GitHub. Each week on Tuesday, the community posts data for exploratory data analysis and visualization. You can add data from this page to your RStudio environment by copying + pasting the “Get the data!” code chunk into your script and running it in the console.

https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-17

The data we’ll use is in csv format, so we’ll need to call the readr package, as well as the tidyverse and janitor for this lesson (if you’ve already called tidyverse and janitor, don’t run those lines).

library(readr)
library(janitor)
library(tidyverse)

Next, we’ll load the data from the R for Data Science Tidy Tuesday GitHub page. There are three lines which will load three data.frames.

  1. park_visits - which contains data on visits to US National Parks
  2. state_pop - which contains information on populations of US states
  3. gas_price - which contains information on gas prices over space and time

For now, we’ll focus on the park_visits data.frame, and will circle back to the other two later. There are descriptions of each column for each data.frame on the Tidy Tuesday website linked above if you’d like more information.

park_visits <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

Working with dplyr

Dplyr is a package that is part of the tidyverse, which is used with data.frames. It makes tasks like filtering and summarizing data easier. Many of the functions within the package are “verb-like” so they are easy to understand and remember.


dplyr - select() and rename()

select() - keeps the columns you indicate and drops the rest. Since select() puts the columns in the order you type them, you can also use select() to reorder columns.

rename() - keeps all the columns, but renames the ones you indicate

Since we won’t be using geographical information, other than state, in our module, we’ll select some columns from the park_visits data to use today.

Note that, the name to the left of the arrow is the resulting data.frame, the information to the right that’s connected by pipes are the steps to get to the result. I normally name the result something different than the original data.frame so that the original data aren’t affected. However, since we won’t need the geographical information later on, I will name this one “park_visits”.

# let's see the column names in park_visits
colnames(park_visits)
##  [1] "year"              "gnis_id"           "geometry"         
##  [4] "metadata"          "number_of_records" "parkname"         
##  [7] "region"            "state"             "unit_code"        
## [10] "unit_name"         "unit_type"         "visitors"
# we can also look at the first few rows by using head()

head(park_visits)
##   year gnis_id     geometry metadata number_of_records               parkname
## 1 1904 1163670      POLYGON     <NA>                 1            Crater Lake
## 2 1941 1531834 MULTIPOLYGON     <NA>                 1         Lake Roosevelt
## 3 1961 2055170 MULTIPOLYGON     <NA>                 1        Lewis and Clark
## 4 1935 1530459 MULTIPOLYGON     <NA>                 1                Olympic
## 5 1982  277263      POLYGON     <NA>                 1 Santa Monica Mountains
## 6 1919  578853 MULTIPOLYGON     <NA>                 1                   <NA>
##   region state unit_code                                       unit_name
## 1     PW    OR      CRLA                       Crater Lake National Park
## 2     PW    WA      LARO         Lake Roosevelt National Recreation Area
## 3     PW    WA      LEWI        Lewis and Clark National Historical Park
## 4     PW    WA      OLYM                           Olympic National Park
## 5     PW    CA      SAMO Santa Monica Mountains National Recreation Area
## 6     NE    ME      ACAD                            Acadia National Park
##                  unit_type visitors
## 1            National Park     1500
## 2 National Recreation Area        0
## 3 National Historical Park    69000
## 4            National Park     2200
## 5 National Recreation Area   468144
## 6            National Park    64000
# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)

# you can check if you have the same result by using colnames()
colnames(park_visits1)
## [1] "year"      "parkname"  "unit_code" "state"     "visitors"
colnames(park_visits2)
## [1] "year"      "parkname"  "state"     "unit_code" "visitors"
# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)

# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

Taking a look at the park_visits data, there are some column name inconsistencies. For example, some are in snake case (unit_code), but some are not (parkname). To rename a column, use rename(). The structure of this argument is as follows:

rename(dataset, new_name = old_name) OR dataset %>% rename(new_name = old_name)

# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)

# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
## [1] "year"      "park_name" "unit_code" "state"     "visitors"

dplyr- filter()

filter() - filters out rows from a data.frame based on values in columns can combine criteria using & (and), | (or), ! (not)

If you’d like to get visitation data for just parks in the state of Washington (WA), you can use the filter function. The structure is as follows:

wa_park_visits <- filter(
                        # name of data.frame
                        park_visits, 
                        # condition
                        state == 'WA'
                        )

# head() shows the first few rows of data
head(wa_park_visits)

# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)

Now that you’ve got parks in the state of Washington, perhaps you’d like to know which parks in WA had over 5 million total visitors. This will require more intensive filtering. We’ll have to filter for parks where state == ‘WA’ AND year == ‘Total’ AND visitation >= 5,000,000.

wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000)

# head() shows the first few rows of data
head(wa_park_visits)

You also might be wondering why we use “==” instead of “=”. In R, the single equal sign “=” is used to assign something. For example, if you want to make some variable equal to a number, you could tell R “x = 5”, and it would remember that x is equal to 5. To evaluate something as TRUE/FALSE, you use “==”. If you type “5 == 3” into the console, it will return the answer “FALSE”. The filtering argument uses this TRUE/FALSE evaluation - filtering for rows where the stated condition - such as year == ’Total” - is TRUE.


dplyr- challenge 1

Time for the first challenge. Using the park_visits data, create a new data.frame containing total visitation for a state of your choice. Note that some “states” aren’t states. For example, VI refers to the Virgin Islands. How can you check your work?


dplyr- solution 1
pr_park_visits <- park_visits %>%
                  # filter for parks in Puerto Rico
                  filter(state == 'PR' &
                  # filter for total visitation
                  year == 'Total')

# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!

# remove result from environment
remove(pr_park_visits)

dplyr - arrange()

arrange() reorders rows. You can think of this like “sort” in Excel. While this is generally not needed in R, it can be helpful for exploring your data, and when working with time series data and lagged values.

As an example, we’ll re-visit total park visitation in WA, sorting by the number of visitors.

wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000) %>%
                  # arrange by visitation - default is ascending order
                  arrange(visitors)

# look at result (df is short so head() not needed)
wa_park_visits
##    year       park_name unit_code state  visitors
## 1 Total Lewis and Clark      LEWI    WA   9515314
## 2 Total            <NA>      NEPE    WA   9566104
## 3 Total       Ross Lake      ROLA    WA  11335924
## 4 Total  North Cascades      NOCA    WA  14809679
## 5 Total  Fort Vancouver      FOVA    WA  20238294
## 6 Total San Juan Island      SAJH    WA  23066952
## 7 Total  Lake Roosevelt      LARO    WA  62247752
## 8 Total   Mount Rainier      MORA    WA  94488801
## 9 Total         Olympic      OLYM    WA 160737962
# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
                  arrange(desc(visitors))

# look at result 
wa_park_visits2
##    year       park_name unit_code state  visitors
## 1 Total         Olympic      OLYM    WA 160737962
## 2 Total   Mount Rainier      MORA    WA  94488801
## 3 Total  Lake Roosevelt      LARO    WA  62247752
## 4 Total San Juan Island      SAJH    WA  23066952
## 5 Total  Fort Vancouver      FOVA    WA  20238294
## 6 Total  North Cascades      NOCA    WA  14809679
## 7 Total       Ross Lake      ROLA    WA  11335924
## 8 Total            <NA>      NEPE    WA   9566104
## 9 Total Lewis and Clark      LEWI    WA   9515314
# remove one result
remove(wa_park_visits2)

dplyr - pull()

pull() takes data from a column and returns in in vector form

pull() %>% table() is a handy way to see a summary of categorical data (e.g. how may observations for each state there are in your data)

# two ways to get the number of observations per state

# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
           # pull the state column only
           pull(state) %>%
           # get # of observations by state in table output
           table()

# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
           # group by state
           group_by(state) %>%
           # get tally of observations by state in output
           tally()

# remove created objects
remove(state_n, state_n2)

dplyr - mutate()

mutate() makes new columns by doing calculations on old columns

As an example, let’s calculate the total visitation for parks with over 5 million total visitors in WA in millions, by dividing the number of total visitors by 1,000,000.

wa_park_visits_millions <- wa_park_visits %>%
                           mutate(visitors_mil = visitors/1000000)

wa_park_visits_millions
##    year       park_name unit_code state  visitors visitors_mil
## 1 Total Lewis and Clark      LEWI    WA   9515314     9.515314
## 2 Total            <NA>      NEPE    WA   9566104     9.566104
## 3 Total       Ross Lake      ROLA    WA  11335924    11.335924
## 4 Total  North Cascades      NOCA    WA  14809679    14.809679
## 5 Total  Fort Vancouver      FOVA    WA  20238294    20.238294
## 6 Total San Juan Island      SAJH    WA  23066952    23.066952
## 7 Total  Lake Roosevelt      LARO    WA  62247752    62.247752
## 8 Total   Mount Rainier      MORA    WA  94488801    94.488801
## 9 Total         Olympic      OLYM    WA 160737962   160.737962

dplyr - group_by() and mutate()

group_by() indicates that data is in groups, so you can do calculations separately for each group

group_by() %>% mutate() will add a column and return the same number of rows as the original data

This is helpful when you want to get group summary information without sacrificing the original values. For example, if we want to know what percent of the total visitation of a park occurs in each year, and add this to the original data.

park_visits2 <- park_visits %>%
                # remove the rows with Totals per park
                filter(year != "Total")  %>% 
                # do calculations by park
                group_by(park_name) %>%
                # add visit_percent column
                # visits fore each year divided by the sum of total visits for each park
                # the na.rm = TRUE means that NA values are not included in calculations
                # round(2) is rounds result to 2 decimal places for easier reading
                mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
                         round(2)) 

# take a look at the result
head(park_visits2)
## # A tibble: 6 x 6
## # Groups:   park_name [6]
##   year  park_name              unit_code state visitors visit_percent
##   <chr> <chr>                  <chr>     <chr>    <int>         <dbl>
## 1 1904  Crater Lake            CRLA      OR        1500          0   
## 2 1941  Lake Roosevelt         LARO      WA           0          0   
## 3 1961  Lewis and Clark        LEWI      WA       69000          0.73
## 4 1935  Olympic                OLYM      WA        2200          0   
## 5 1982  Santa Monica Mountains SAMO      CA      468144          2.66
## 6 1919  <NA>                   ACAD      ME       64000          0
# remove from environment
remove(park_visits2)

dplyr - group_by() and summarize()

group_by() and summarize() (or summarise()) also work with grouped data. Here you calculate a summary for each group. Output data frame will have the columns that define the group and the summary data, will have one row for each group.

I find this combination most helpful for calculating means and standard deviations for data. You can also find minimum and maximum values. We’ll filter for total visitation numbers for each park, then calculate a visitation summary for each state.

state_summary <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # sd of total visitors across parks in each state
                  sd_visit = sd(visitors, na.rm = TRUE),
                  # min total visitors across parks in each state
                  min_visit = min(visitors, na.rm = TRUE),
                  # max total visitors across parks in each state
                  max_visit = max(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                )

# take a look at the result
head(state_summary)
## # A tibble: 6 x 6
##   state mean_visit   sd_visit min_visit max_visit n_parks
##   <chr>      <dbl>      <dbl>     <int>     <int>   <int>
## 1 AK      4661539    6737110.     10490  19476522      22
## 2 AL      2998848.   2162876.    365883   5487784       5
## 3 AR     20647567.  34237894.     60563  92028933       7
## 4 AS       113694         NA     113694    113694       1
## 5 AZ     23655713.  46310344.    348811 205486894      19
## 6 CA     61322519. 120602434.      6349 611031225      26
# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
                 ungroup() %>%
                 mutate(dif = max_visit - min_visit)

# remove from environment
remove(state_summary)

dplyr - challenge 2

For this second challenge, you’ll start with the park_visits data. Calculate the average visitation divided by the number of parks for each state. The final data.frame should be arranged in descending order by a numeric column of your choice.


dplyr - solution 2
challenge2 <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                ) %>%
                # ungroup to do another calculation
                ungroup() %>%
                # calculate visit/n
                mutate(visit_per_park = mean_visit/n_parks) %>%
                # select relevant columns
                select(state, visit_per_park) %>%
                # arrange in descending order
                arrange(desc(visit_per_park))
                

# take a look at the result
head(challenge2)
## # A tibble: 6 x 2
##   state visit_per_park
##   <chr>          <dbl>
## 1 NV        103867788.
## 2 PR         55928192 
## 3 ME         40890226.
## 4 NC         19591218.
## 5 MS         19160525 
## 6 OK         18206483.
# remove from environment
remove(challenge2)


Pivoting Tables

Pivoting Tables

This section will go over the use of functions to change data from wide to long format and vice-versa.

We will:

  1. Explain what these formats are and why anyone cares
  2. pivot_longer()
  3. pivot_Wider()

If you’re starting with this section, run the following code chunk to get caught up.

# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

Why do we care?

Depending on your goal, you may want to have your data formatted in multiple ways. For example, it is much easier for humans to read wide format data, while most R functions rely on long format data. Here is a quick demonstration using park visits data:

# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
  # get parks in MEDN and no totals, years 2010-2015
  filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
  # make year a number, since no more text
  mutate(year = as.numeric(year)) %>%
  # get years 2010:2015
  filter(year >= 2010 & year <= 2015) %>%
  # arrange in ascending order
  arrange(year) %>%
  # select relevant columns
  select(unit_code, year, visitors)

# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits
## # A tibble: 18 x 3
##    unit_code  year visitors
##    <chr>     <dbl>    <dbl>
##  1 SAMO       2010   568371
##  2 CABR       2010   763140
##  3 CHIS       2010   277515
##  4 SAMO       2011   609636
##  5 CABR       2011   813351
##  6 CHIS       2011   242756
##  7 SAMO       2012   649471
##  8 CABR       2012   877951
##  9 CHIS       2012   249594
## 10 SAMO       2013   633054
## 11 CABR       2013   856128
## 12 CHIS       2013   212029
## 13 SAMO       2014   694714
## 14 CABR       2014   893434
## 15 CHIS       2014   342161
## 16 SAMO       2015   797126
## 17 CABR       2015   981825
## 18 CHIS       2015   324816
# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
## # A tibble: 3 x 7
##   unit_code `2010` `2011` `2012` `2013` `2014` `2015`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 SAMO      568371 609636 649471 633054 694714 797126
## 2 CABR      763140 813351 877951 856128 893434 981825
## 3 CHIS      277515 242756 249594 212029 342161 324816

There are two functions we’ll discuss today:

  1. pivot_wider() : converts from long –> wide
  2. pivot_longer() : converts from wide –> long

pivot_wider()

Full disclosure, I almost always have to google the arguments that go into the pivot_ functions. Here’s the website for pivot_wider() https://tidyr.tidyverse.org/reference/pivot_wider.html

Here’s a short example - pivoting the MEDN visits from long to wide, with more detail than the prior code chunk.

# check out the original data (medn_visits) in long format
medn_visits
## # A tibble: 18 x 3
##    unit_code  year visitors
##    <chr>     <dbl>    <dbl>
##  1 SAMO       2010   568371
##  2 CABR       2010   763140
##  3 CHIS       2010   277515
##  4 SAMO       2011   609636
##  5 CABR       2011   813351
##  6 CHIS       2011   242756
##  7 SAMO       2012   649471
##  8 CABR       2012   877951
##  9 CHIS       2012   249594
## 10 SAMO       2013   633054
## 11 CABR       2013   856128
## 12 CHIS       2013   212029
## 13 SAMO       2014   694714
## 14 CABR       2014   893434
## 15 CHIS       2014   342161
## 16 SAMO       2015   797126
## 17 CABR       2015   981825
## 18 CHIS       2015   324816
# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
  pivot_wider(
    # names from is the thing you want to be the columns
    names_from = year,
    # values from is what you want to fill the columns with
    values_from = visitors
  )

# check out the result
medn_visits_wide
## # A tibble: 3 x 7
##   unit_code `2010` `2011` `2012` `2013` `2014` `2015`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 SAMO      568371 609636 649471 633054 694714 797126
## 2 CABR      763140 813351 877951 856128 893434 981825
## 3 CHIS      277515 242756 249594 212029 342161 324816
# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
  pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
## # A tibble: 6 x 4
##    year   SAMO   CABR   CHIS
##   <dbl>  <dbl>  <dbl>  <dbl>
## 1  2010 568371 763140 277515
## 2  2011 609636 813351 242756
## 3  2012 649471 877951 249594
## 4  2013 633054 856128 212029
## 5  2014 694714 893434 342161
## 6  2015 797126 981825 324816

pivot_longer()

Now that we have our data in wide format, it is time to revert back to pivoting longer. Here’s the reference material for pivot_longer() https://tidyr.tidyverse.org/reference/pivot_longer.html.

# check out the original data (medn_visits_wide) in wide format
medn_visits_wide
## # A tibble: 6 x 4
##    year   SAMO   CABR   CHIS
##   <dbl>  <dbl>  <dbl>  <dbl>
## 1  2010 568371 763140 277515
## 2  2011 609636 813351 242756
## 3  2012 649471 877951 249594
## 4  2013 633054 856128 212029
## 5  2014 694714 893434 342161
## 6  2015 797126 981825 324816
# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
  pivot_longer(
    # first argument is the columns you'd like to transform
    SAMO:CHIS,
    # next is what you'd like to name the former name cols
    names_to = "unit_code",
    # last is what you'd like to name the former values cols
    values_to = "visitors"
  )

# check out the result
medn_visits_long
## # A tibble: 18 x 3
##     year unit_code visitors
##    <dbl> <chr>        <dbl>
##  1  2010 SAMO        568371
##  2  2010 CABR        763140
##  3  2010 CHIS        277515
##  4  2011 SAMO        609636
##  5  2011 CABR        813351
##  6  2011 CHIS        242756
##  7  2012 SAMO        649471
##  8  2012 CABR        877951
##  9  2012 CHIS        249594
## 10  2013 SAMO        633054
## 11  2013 CABR        856128
## 12  2013 CHIS        212029
## 13  2014 SAMO        694714
## 14  2014 CABR        893434
## 15  2014 CHIS        342161
## 16  2015 SAMO        797126
## 17  2015 CABR        981825
## 18  2015 CHIS        324816
# this should look familiar - like the original medn_visits data

pivot challenge

This is a more involved challenge that will draw on your knowledge of both dplyr (from Module 3) and pivoting.

Using the state_pop data, accomplish the following tasks:

  1. Get data for 3 states of your choice for the years 2010-2015 (note - there is no ‘total’ values in the year column)
  2. Create a data.frame in long format
  3. Pivot wider to make a human-friendly table with states as rows and years as columns
  4. Pivot longer, returning to the original layout

Hint: if your column names are numbers, like years, you must refer to them using back-ticks. A column called 2020 would be `2020`


pivot solution
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
  filter(state %in% c("FL", "AK", "CA") &
    year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
  pivot_wider(
    names_from = "year",
    values_from = "pop"
  )

# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
  pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")



Joining Tables

Joining Tables

Oftentimes we want to use data that is split into two or more data.frames. This section will go over the use of functions to “join” or combine data from two tables into one.

We will:

  1. Use bind_rows() and bind_columns() for simple cases

  2. Used dplyr commands for more complicated joins

  1. left_join() and right join())
  2. full join()
  3. inner_join()
  1. Use filtering joins to filter data in one data.frame based on what is in another data.frame
  1. semi_join()
  2. anti_join()

If you’re starting with this section, run the following code chunk to get caught up.

# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
  select(year, parkname, unit_code, state, visitors) %>%
  filter(year != "Total") %>%
  mutate(year = as.integer(year))


Binding Rows and Columns

bind_rows() and bind_columns() simply merge two data.frames directly.

  1. bind_rows()

bind_rows() adds the rows in the second data.frame to the bottom of the first data.frame. This generally only makes sense if they have many identical columns. Make sure that columns with the same data have the same name!

# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
  # get data from Shenandoah.
  filter(unit_code == "SHEN")

SHEN_visits
## # A tibble: 81 x 5
##     year parkname unit_code state visitors
##    <int> <chr>    <chr>     <chr>    <dbl>
##  1  1936 <NA>     SHEN      VA      694098
##  2  2016 <NA>     SHEN      VA     1437341
##  3  2015 <NA>     SHEN      VA     1321873
##  4  2014 <NA>     SHEN      VA     1255321
##  5  2013 <NA>     SHEN      VA     1136505
##  6  2012 <NA>     SHEN      VA     1210200
##  7  2011 <NA>     SHEN      VA     1209883
##  8  2010 <NA>     SHEN      VA     1253386
##  9  2009 <NA>     SHEN      VA     1120981
## 10  2008 <NA>     SHEN      VA     1075878
## # ... with 71 more rows
ACAD_visits <- park_visits %>%
  # get data from Acadia.
  filter(unit_code == "ACAD")

ACAD_visits
## # A tibble: 98 x 5
##     year parkname unit_code state visitors
##    <int> <chr>    <chr>     <chr>    <dbl>
##  1  1919 <NA>     ACAD      ME       64000
##  2  2016 <NA>     ACAD      ME     3303393
##  3  2015 <NA>     ACAD      ME     2811184
##  4  2014 <NA>     ACAD      ME     2563129
##  5  2013 <NA>     ACAD      ME     2254922
##  6  2012 <NA>     ACAD      ME     2431052
##  7  2011 <NA>     ACAD      ME     2374645
##  8  2010 <NA>     ACAD      ME     2504208
##  9  2009 <NA>     ACAD      ME     2227698
## 10  2008 <NA>     ACAD      ME     2075857
## # ... with 88 more rows
# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)

# check that data from both parks is there
SHEN_ACAD_visits %>%
  pull(unit_code) %>%
  table()
## .
## ACAD SHEN 
##   98   81
rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)

You generally would not split up a data.frame and then put it together again like this, but it can be useful if you have data that has the same data structure but starts out broken up by site, date, park etc. and you wish to combine it into one data.frame.

  1. bind_columns()

bind_columns() adds the columns from the second data.frame to the right of the first data.frame.

# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
  # get first 2 columns.
  select(year, parkname)

visits_left
## # A tibble: 21,174 x 2
##     year parkname              
##    <int> <chr>                 
##  1  1904 Crater Lake           
##  2  1941 Lake Roosevelt        
##  3  1961 Lewis and Clark       
##  4  1935 Olympic               
##  5  1982 Santa Monica Mountains
##  6  1919 <NA>                  
##  7  1969 <NA>                  
##  8  1967 <NA>                  
##  9  1944 <NA>                  
## 10  1989 <NA>                  
## # ... with 21,164 more rows
visits_right <- park_visits %>%
  #  get last 3 columns.
  select(unit_code, state, visitors)

visits_right
## # A tibble: 21,174 x 3
##    unit_code state visitors
##    <chr>     <chr>    <dbl>
##  1 CRLA      OR        1500
##  2 LARO      WA           0
##  3 LEWI      WA       69000
##  4 OLYM      WA        2200
##  5 SAMO      CA      468144
##  6 ACAD      ME       64000
##  7 AMIS      TX      448000
##  8 ASIS      MD      738700
##  9 BIBE      TX        1409
## 10 BICY      FL       81157
## # ... with 21,164 more rows
visits_bound <- bind_cols(visits_left, visits_right)

visits_bound
## # A tibble: 21,174 x 5
##     year parkname               unit_code state visitors
##    <int> <chr>                  <chr>     <chr>    <dbl>
##  1  1904 Crater Lake            CRLA      OR        1500
##  2  1941 Lake Roosevelt         LARO      WA           0
##  3  1961 Lewis and Clark        LEWI      WA       69000
##  4  1935 Olympic                OLYM      WA        2200
##  5  1982 Santa Monica Mountains SAMO      CA      468144
##  6  1919 <NA>                   ACAD      ME       64000
##  7  1969 <NA>                   AMIS      TX      448000
##  8  1967 <NA>                   ASIS      MD      738700
##  9  1944 <NA>                   BIBE      TX        1409
## 10  1989 <NA>                   BICY      FL       81157
## # ... with 21,164 more rows
rm(visits_left, visits_right, visits_bound)

bind_columns() assumes that the rows are in the same order in each data.frame, and that there are no rows in one data.frame that are missing in the other. Unless you are sure that these are true, you should avoid using this function.


bind_rows() challenge

Using the park_visits data, make 2 smaller data.frames:

  1. Atlantic will have just data from Puerto Rico (PR) and the Virign Islands (VI).
  2. Pacific will just have data from American Samoa (AS), Guam (GU) and Hawaii (HI)

Then, using bind_rows() combine the 2 data.frames.


bind_rows Solution
# Atlantic islands
Atlantic <- park_visits %>%
  filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
  filter(state %in% c("AS", "GU", "HI"))

# All islands
Islands <- bind_rows(Atlantic, Pacific)


rm(Atlantic, Pacific, Islands)


Basic Joins - left_join() and right_join()

left_join() and right_join() are a safer way to join columns from one data.frame to the columns in another.

For either of these functions, you need to specify one or more key columns using the by argument. Key columns are the data that is used to determine which rows in once data.frame match the rows in the other. The order of the rows in the data.frames does not matter.

The syntax is left_join(x,y, by="name") where x and y are data.frames.

The only difference between left_join() and right_join() is what happens when the rows in the data.frames don’t match up one to one. left_join() keeps all the rows in x and drops any rows in y that don’t match, while right_join() keeps all the rows in y and drop any rows in x that don’t match.

Let’s combine the gas_price data with the state_pop data. The gas price data does not vary by state, so we just have to match the data by year. In order to make the way these functions work a little clearer, we will first filter the state_pop data to only include data before 2000, so that there is less of an overlap between the 2 data sources.

# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)

# left_join example
state_pop_gas1 <- left_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas1
## # A tibble: 5,100 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1900 AL    1830000          NA           NA
##  2  1901 AL    1907000          NA           NA
##  3  1902 AL    1935000          NA           NA
##  4  1903 AL    1957000          NA           NA
##  5  1904 AL    1978000          NA           NA
##  6  1905 AL    2012000          NA           NA
##  7  1906 AL    2045000          NA           NA
##  8  1907 AL    2058000          NA           NA
##  9  1908 AL    2070000          NA           NA
## 10  1909 AL    2108000          NA           NA
## # ... with 5,090 more rows

A couple of thing to notice. The state_pop_gas1 data.frame has 5100 rows because that is what is in the state_pop2 data.frame. Every year is repeated 51 times (once for each state + DC) so the gas price data for a given year gets repeated 51 times as well. The gas_price data only goes back to 1929, so for years before that, the data just has NA for gas prices.

Here is the exact same thing, but this time using right_join()

# left_join example
state_pop_gas2 <- right_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas2
## # A tibble: 3,637 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1929 AL    2644000        0.21         2.38
##  2  1930 AL    2647000        0.2          2.3 
##  3  1931 AL    2649000        0.17         2.18
##  4  1932 AL    2653000        0.18         2.61
##  5  1933 AL    2661000        0.18         2.66
##  6  1934 AL    2685000        0.19         2.67
##  7  1935 AL    2719000        0.19         2.62
##  8  1936 AL    2743000        0.19         2.67
##  9  1937 AL    2762000        0.2          2.63
## 10  1938 AL    2787000        0.2          2.64
## # ... with 3,627 more rows

Now the data is very different. It has 3637 rows, which seems pretty random. What has happened is:

  1. All data pre-1929 was dropped.
  2. Data for the 71 years from 1929 to 1999 has been repeated 51 times to match the 51 states (=3621 records).
  3. The years 2000 to 2015 (15 records) each appear once as there is no matching data in the state_pop2 data.frame.

So, all the data from gas_price is present, if necessary it was repeated to match up with the state_pop2 data.


Basic Joins - inner_join() and full_join()

In the previous section we used left and right joins to get all of the data from one data.frame with matching data from a second data.frame. Now we will look at 2 other joins:

  1. inner_join() - only rows that have matches in both data.frames are kept
  2. full_join() - you get all the rows from both datasets even if some don’t match.

inner join:

# inner_join example 

state_pop_gas3 <- inner_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas3
## # A tibble: 3,621 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1929 AL    2644000        0.21         2.38
##  2  1930 AL    2647000        0.2          2.3 
##  3  1931 AL    2649000        0.17         2.18
##  4  1932 AL    2653000        0.18         2.61
##  5  1933 AL    2661000        0.18         2.66
##  6  1934 AL    2685000        0.19         2.67
##  7  1935 AL    2719000        0.19         2.62
##  8  1936 AL    2743000        0.19         2.67
##  9  1937 AL    2762000        0.2          2.63
## 10  1938 AL    2787000        0.2          2.64
## # ... with 3,611 more rows


Now the data.frame has 3621 rows = there are 71 years of data from 51 states. This is the smallest data.frame from these joins. The years in state_pop_gas3 are only those that are found in both data.frames.

full join:

# full_join example 

state_pop_gas4 <- full_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas4
## # A tibble: 5,116 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1900 AL    1830000          NA           NA
##  2  1901 AL    1907000          NA           NA
##  3  1902 AL    1935000          NA           NA
##  4  1903 AL    1957000          NA           NA
##  5  1904 AL    1978000          NA           NA
##  6  1905 AL    2012000          NA           NA
##  7  1906 AL    2045000          NA           NA
##  8  1907 AL    2058000          NA           NA
##  9  1908 AL    2070000          NA           NA
## 10  1909 AL    2108000          NA           NA
## # ... with 5,106 more rows
rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)

This data.frame has 5116 rows, there most of all the data.frames. This includes all data from either data.frame regardless of if there is a match.


Basic Joins Challenge

Using the park_visits and the state_pop data create a new data.frame with all the data from park_visits and any matching data from state_pop.

Hint if you need to use 2 columns as keys the you write the by= argument like so: by=c("Column1","Column2").


Basic Join Solution
park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))

rm(park_visits_pop)


Filtering Joins

Sometimes you don’t want to combine two tables, but you do want to filter one table so that it only has data that can match data in another table. semi_join() is used for this

semi join:

# semi_join example 

state_pop_gas5 <- semi_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas5
## # A tibble: 3,621 x 3
##     year state     pop
##    <dbl> <chr>   <dbl>
##  1  1929 AL    2644000
##  2  1930 AL    2647000
##  3  1931 AL    2649000
##  4  1932 AL    2653000
##  5  1933 AL    2661000
##  6  1934 AL    2685000
##  7  1935 AL    2719000
##  8  1936 AL    2743000
##  9  1937 AL    2762000
## 10  1938 AL    2787000
## # ... with 3,611 more rows

The columns in state_pop_gas4 are the same that are in state_pop2. However,all the data from year prior to 1929 has now been removed, because pre-1929 data is not in the gas_price data.

anti_join() is the opposite of semi_join(). It will keep the data one data.frame that does not have a match in the second data.frame.

anti_join:

# anti_join example 

state_pop_gas6 <- anti_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas6
## # A tibble: 1,479 x 3
##     year state     pop
##    <dbl> <chr>   <dbl>
##  1  1900 AL    1830000
##  2  1901 AL    1907000
##  3  1902 AL    1935000
##  4  1903 AL    1957000
##  5  1904 AL    1978000
##  6  1905 AL    2012000
##  7  1906 AL    2045000
##  8  1907 AL    2058000
##  9  1908 AL    2070000
## 10  1909 AL    2108000
## # ... with 1,469 more rows
# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)

state_pop_gas_6 only has 1479 rows, because it only contains data from before 1929 when the gas_price data starts.


Filtering Joins Challenge

Filter the park_visits data so that it only contains data that can be joined with the data in and the state_pop data create a new data.frame with all the data from park_visit and any matching data from state_pop and gas_price.

Note that the state_pop data has some NAs for population for Alaska and Hawaii prior to 1949. For a bigger challenge, don’t include data those state / year combinations in your final data set.

Hint: the function is.na(data) will be TRUE when a value is NA. If you put an exclamation mark in front !is.na(data) it will tell you when the values are not NA.


Filtering Joins Solution
## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.

park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>% 
  semi_join(gas_price, by="year")

park_visits_filtered
## # A tibble: 19,964 x 5
##     year parkname               unit_code state visitors
##    <int> <chr>                  <chr>     <chr>    <dbl>
##  1  1941 Lake Roosevelt         LARO      WA           0
##  2  1961 Lewis and Clark        LEWI      WA       69000
##  3  1935 Olympic                OLYM      WA        2200
##  4  1982 Santa Monica Mountains SAMO      CA      468144
##  5  1969 <NA>                   AMIS      TX      448000
##  6  1967 <NA>                   ASIS      MD      738700
##  7  1944 <NA>                   BIBE      TX        1409
##  8  1989 <NA>                   BICY      FL       81157
##  9  1988 <NA>                   BISO      TN      613011
## 10  1941 <NA>                   BLRI      NC      895874
## # ... with 19,954 more rows
rm(park_visits_filtered)


Day 3: Data Viz

Intro to Data Visualizations

Packages Used in this Section
# Packages used in this section
library(tidyverse)
library(viridis) # for colorblind-friendly palettes


Introduction to Data Visualizations

Goals of this Day

  1. Learn rules for effective data visualizations
  2. Create simple plots, using base R functions
  3. Create customized and interactive plots, using ggplot functions
  4. Arrange multiple plots on a page, using cowplot and patchwork functions
  5. Create interactive maps, using sf and tmap functions


The Power of (Good) Data Visualizations

Data are useful only when used. Data are used only when understood.


Consider three examples:

Example 1. Plots convey messages faster than tables

Most people can understand this…

Average daily Covid cases per 100k people, by region (Sources: State and local health agencies [cases]; Census Bureau [population data]


faster than they can understand this…

Table of Covid cases
Daily Covid cases and population numbers by state (only showing first 7 records)
state timestamp cases total_population
AK 2022-01-25 04:00:00 203110 731545
AL 2022-01-25 04:00:00 1153149 4903185
AR 2022-01-25 04:00:00 738652 3017804
AZ 2022-01-25 04:00:00 1767303 7278717
CA 2022-01-25 04:00:00 7862003 39512223
CO 2022-01-25 04:00:00 1207991 5758736
CT 2022-01-25 04:00:00 683731 3565287

Example 2. Plots reveal patterns and highlight extremes

This table shows average monthly revenue for Acme products.

Average monthly revenue (in $1000’s) from Acme product sales, 1950 - 2020
category product Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
party supplies balloons 892 1557 1320 972 1309 1174 1153 1138 1275 1178 1325 1422
party supplies confetti 1271 1311 829 1020 1233 1061 1088 1395 1376 1152 1568 1412
party supplies party hats 1338 1497 1445 956 1372 1482 1048 877 1404 1030 1458 1547
party supplies wrapping paper 1396 1026 932 891 1364 896 900 1221 1146 967 1394 1507
school supplies backpacks 1802 1773 1611 1723 1799 1730 1813 1676 1748 1652 1819 1759
school supplies notebooks 1153 1471 1541 1371 1592 1514 1725 1702 1457 1604 1729 1279
school supplies pencils 1679 1304 1054 1259 1425 1608 1972 1811 1610 1004 1417 1283
school supplies staplers 1074 1708 1439 1154 1551 1099 1793 1601 1647 1666 1389 1511

Use the table above to answer these questions:

  1. What product and month had the highest average monthly revenue?
  2. What product and month had the lowest average monthly revenue?
Table as Heat Map

Now let’s display the same table as a heat map, with larger numbers represented by darker color cells. How quickly can we answer those same two questions? What patterns can we see in the heat map that were not obvious in the table above?


Example 3. Plots provide insights that statistics obscure

In 1973, Francis Anscombe published “Graphs in statistical analysis”, a paper describing four bivariate datasets with identical means, variances, and correlations.

Anscombe’s Quartet - Four bivariate datasets with identical summary statistics
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.
x1 y1 x2 y2 x3 y3 x4 y4
mean 9 7.50 9 7.50 9 7.50 9 7.50
var 11 4.13 11 4.13 11 4.12 11 4.12


Anscombe data as plots

Despite their identical statistics, when we plot the data we see the four datasets are actually very different.

)

Anscombe’s point? TO UNDERSTAND THE DATA, WE MUST PLOT THE DATA!


Rules for Effective Data Visualizations

Anscombe used clever thinking and simple plots to demonstrate the importance of data visualizations. But it’s not enough to just plot the data. To have impact, a plot must convey a clear message. How can we do this?

  • Have a purpose. Every data visualization should tell a story. What story does the Covid plot tell? What message does Anscombe’s quartet of plots convey?

  • Consider your audience. Avoid scientific names, acronyms, or jargon unless your audience is well-versed in that language. Use color-blind friendly colors.

  • Use an appropriate visualization. For example:

    • Line graphs work well for showing changes in continuous data over time
    • Bar charts compare counts or proportions in categorical data (pie charts get a bad rap in the data viz world, but can be useful in certain situations)
    • For statistics (e.g., means) with confidence intervals, point plots with error bars are preferred over bar charts (nice explanation here)
    • Scatterplots are useful for showing the relationship (correlation) between two continuous variables.
    • Matrix heat maps can efficiently compare the magnitude of numbers when we have lots of data structured in table format, especially when colors have a clear connection to the numbers (e.g., scorecard data)
    • Box plots, violin plots, and histograms show distributions and outliers for continuous data. Dot plots are a useful alternative when sample sizes are small.

  • Keep it simple!!!

    • Every plot element and aesthetic should have a purpose.
    • Avoid 3D charts unless you have good reason otherwise.
    • Don’t try to cram everything into one plot (e.g., juxtapose two plots instead of adding a secondary y-axis. Nice explanation here).

  • Use informative text and arrows wisely. Clear, meaningful titles, subtitles, axes titles/labels, and annotations help convey the message of a plot. Use lines and arrows (sparingly but effectively) to emphasize important thresholds, data points or other plot features. Fonts should be large enough with good contrast (against the background) and sufficient white space to be easily readable.


Let’s Start Coding

For today’s data visualizations we will use the park visits data you’re already familiar with from the Data Wrangling module.

From subsets of the park visits and gas price data, we will create plots to:

  • compare the distribution of park visitor counts among unit types
  • show temporal trends and unusual years in park visitor patterns
  • show the relationship between gas price and park visits

Make sure you’ve installed and loaded the packages.

  1. Import and view the park visits and gas price data
  2. Extract a subset of park data and join with gas price data
  3. Get the data in ‘tidy’ format
  4. Convert region to a factor data type
  5. Make sure every park unit has one and only one data record for each year

Step 1. Import and view the data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv") # visits to US National Parks

gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv") # gas prices over space and time

View() the data to get an initial sense of what we’re working with.

# Examine the data to understand data structure, data types, and potential problems
View(park_visits); summary(park_visits)
View(gas_price); summary(gas_price)
Step 2. Extract a subset of park data and join with gas price data

Before we run more detailed data checks, let’s extract a subset of the park_visits data to work with and then join in the gas_price$gas_constant data, using year as the joining column.

The new data frame should be called joined_dat and should only include park_visits data that meet all of these criteria:

  • region is IM, PW or SE
  • unit_type is National Monument, National Historic Site, or National Park
  • year is in the range 2000 - 2015, inclusive

We also only need to keep a subset of the data columns from park_visits. The new data frame should have these columns (in this order): region, unit_type, unit_name, unit_code, year, visitors

NOTE: Even though park_visits$year is classified as a character data type, R recognizes the numeric order of quoted numbers (e.g., “6”) and puts quoted numbers before alphabetic characters. Type sort(c("q", "1", "7", "3", "b")) in your console and see how R sorts those values. Because of these sorting rules for quoted numbers, we can filter year for “2000”:“2015” and R will correctly give us all records for which year (even though it’s a character data type) is between 2000 and 2015, inclusive.

# NOTES: 
# 1. When we preface function names with the package name, we avoid potential issues with duplicate function names among packages
# 2. The dplyr::select() function can simultaneously select columns, rename them, and order them as specified. See `?select`
# 3. When we `dplyr::filter()` by multiple conditions, we can use a comma (,) or and-sign (&) to combine the multiple conditions
# 4. `gas_price$gas_constant` is gas price in constant 2015 dollars/gallon

joined_dat <- park_visits %>%
  dplyr::filter( # filter by multiple conditions
    region %in% c("IM", "PW", "SE"), 
    unit_type %in% c("National Monument", "National Historic Site", "National Park"),
    year %in% "2000":"2015") %>%
  dplyr::select(region, unit_type, unit_name, unit_code, year, visitors) %>% # keep these columns only, and arrange them in this order
  dplyr::mutate(year = as.numeric(year)) %>% # convert `park_visits$year` to numeric, to match the data type of `gas_price$year`
  dplyr::left_join(gas_price[c("year", "gas_constant")], by = "year")

# summary(joined_dat) # always check the data before and after wrangling!

The new data frame should look like this when we type glimpse(joined_dat) in the console:

## Rows: 1,919
## Columns: 7
## $ region       <chr> "PW", "SE", "IM", "IM", "PW", "IM", "PW", "PW", "PW", "IM~
## $ unit_type    <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name    <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code    <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year         <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors     <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~

We now have all the data we need in a single data frame.

Step 3. Get the data in ‘tidy’ format

We are primarily using dplyr to wrangle data and ggplot2 to plot data. These two packages are part of the Tidyverse of packages for data science. Tidyverse packages share a common design philosophy centered on ideas like breaking down complex tasks into a series of simpler functions executed in sequence (e.g., piping %>%). Functions from tidyverse packages work together most seamlessly and efficiently when data are in a tidy format. Remember that a ‘tidy’ format means data have one variable per column and one observation per row.

Look at the joined_dat data frame. Is it in ‘tidy’ format? Yes, it is! The visitor count for each park unit and year is in a separate data frame row. No need to pivot_long anything today. But let’s make two final changes to the data before we begin plotting. First, we will convert region to an ordered factor with more descriptive labels. Second, we will remove park units that don’t have visitor data for every year from 2000 to 2015. We will end with a final check of the data to make sure it’s ready for plotting.
Step 4. Convert region to a factor data type

Convert the variable region to a factor data type. Relabel and re-order the levels as Pacific West < Intermountain < and Southeast (the default order is alphabetical). This way, bar plots and color legends will order the region levels from west to east instead of alphabetically.

# NOTES:
# 1. The factor data type is similar to (and often interchangeable with) the character data type, but limits entries to pre-specified values and allows us to specify an order to those values (other than the default alphabetical order)
# 2. Factors are especially useful for ordered categories (e.g., low < medium < high) with a small number of possible values
# 3. Under the hood, `R` store factors as integer vectors representing the order of the value (with character string labels)
# 4. I usually classify categorical variables as character data types, unless I want to limit entries to a few pre-specified values, or if I want to enforce a particular order to the category levels

unique(joined_dat$region) # check what values are in `region`
## [1] "PW" "SE" "IM"
joined_dat$region <- 
  factor(joined_dat$region, 
         levels = c("PW", "IM", "SE"), # specifies the factor levels and an order for them. If we omit this argument, the factor levels will be the unique set of values in the column, ordered alphabetically
         labels = c("Pacific West", "Intermountain", "Southeast")) # relabels the levels with more descriptive names, keeping the same order as the `levels` argument

# Check the data
glimpse(joined_dat) # `region` is now a factor data type
## Rows: 1,919
## Columns: 7
## $ region       <fct> Pacific West, Southeast, Intermountain, Intermountain, Pa~
## $ unit_type    <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name    <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code    <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year         <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors     <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~
levels(joined_dat$region) # shows the (renamed) factor levels and their order
## [1] "Pacific West"  "Intermountain" "Southeast"
str(joined_dat$region) # shows the values are coded as integers indicating factor level order
##  Factor w/ 3 levels "Pacific West",..: 1 3 2 2 1 2 1 1 1 2 ...
summary(joined_dat) # the summary for `region` now counts the number of times each factor level occurs
##            region     unit_type          unit_name          unit_code        
##  Pacific West :527   Length:1919        Length:1919        Length:1919       
##  Intermountain:965   Class :character   Class :character   Class :character  
##  Southeast    :427   Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##       year         visitors         gas_constant  
##  Min.   :2000   Min.   :       0   Min.   :1.750  
##  1st Qu.:2004   1st Qu.:   57344   1st Qu.:2.320  
##  Median :2008   Median :  187397   Median :3.000  
##  Mean   :2008   Mean   :  602483   Mean   :2.828  
##  3rd Qu.:2012   3rd Qu.:  652166   3rd Qu.:3.610  
##  Max.   :2015   Max.   :10712674   Max.   :3.800
Step 5. Make sure every park unit has one and only one data record for each year

We will be creating plots to see if trends in park visits from 2000 to 2015 differ by region or unit type. To avoid drawing improper conclusions from the data, make sure every park unit has one and only one data record for each year. Specifically, check for:

  • park-year combinations with more than one data record–could be a duplicate or another type of error
  • park-year combinations with no data record–indicating incomplete data for a park unit

Determining if park units have incomplete data is not as simple as filtering for NA’s because missing park unit-years are not represented by records with NA for the visitor count–the missing data are simply not in the data frame.

We also cannot just count the number of records for each park unit and assume that if a park unit has 16 records then the data are fine. A park unit could, for example, be missing one year of data and have a duplicate record for another year. It would then have 16 data records but not one per year from 2000 to 2015.

We present one–but not the only–way to solve this issue.

# NOTES: 
# 1. Check out `?expand_grid` for an alternative approach
# 2. The `table` data class describs

# Run the code in pieces to see what the different lines do. For example, copy `table(uc = park_sub$unit_code, yr =park_sub$year)` and run it in the console to see the output. Then copy `table(uc = park_sub$unit_code, yr =park_sub$year) %>% as.data.frame()` and run it in the console to see the output. Repeat until the end of the code chunk.
remove_units <- 
  table(unit = joined_dat$unit_code, yr = joined_dat$year) %>% # build a contingency table of counts for every combination of `unit_code` and `year`
  as.data.frame() %>% # convert from `table` to `data frame` class
  dplyr::filter(Freq != 1) %>% # filter for unit-year combinations that are less than or greater than 1
  distinct(unit) # identify the corresponding park units

# Remove those park units from the data
viz_dat <- joined_dat %>%
  dplyr::filter(!unit_code %in% remove_units$unit)
  
# View(viz_dat) to make sure the park units have been removed. The final data frame should have 1856 rows and 7 columns.
Step 6. Save viz_dat as an R object (optional)
# If you want to save `viz_dat` to your current working directory, run the code below. This code saves `viz_dat` as an .RDS file, which just means it saves it as an `R` object instead of as a .csv file. 

# When we save a data frame as an `R` object we retain the accompanying attribute information (e.g., the order of factor levels for `region`)--this information would not be saved in a .csv file.

saveRDS(viz_dat, "viz_dat.RDS")

# To read the RDS file from your working directory and assign it to `viz_dat`, use the code below. You can assign it to any name you want. If you're reading it from a location other than your working current directory, provide the file path in the function argument.
viz_dat <- readRDS("viz_dat.RDS")

Note that a report of our findings should explain how and why we selected this subset of data to work with.

Step 7. Final data check

Always do a final check of the data to look for missing data, duplicated records, funny numbers or category levels, etc. It helps to create summary tables that decompose the data in various ways. These tables can identify oddities (e.g., category levels with only one record) and also give us an idea of the sample sizes we’ll be working with if we create plots with subsets of the data.

CHALLENGE: Create a table (data frame) that shows the number of park units in each combination of region and. Extra credit if you can reformat the table to wide instead of long format.

The new data frame (if you have it in wide format) should look like this:

## # A tibble: 3 x 4
##   unit_type              `Pacific West` Intermountain Southeast
##   <chr>                           <int>         <int>     <int>
## 1 National Historic Site              7             7        11
## 2 National Monument                   9            34         7
## 3 National Park                      16            18         7

We see from this table that we have data for 34 National Monuments in the Intermountain region and that the smallest number of park units in any region-type is 7. We’re ready to start plotting.

Answer to challenge question
tab_count <- viz_dat %>% 
  dplyr::distinct(region, unit_type, unit_code) %>%
  dplyr::count(region, unit_type) %>%
  tidyr::pivot_wider(names_from = region, values_from = n)

# View(viz_dat) to double-check the numbers

Plotting with Base R

Data Used in this Section
# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument


Packages Used in this Section
# Packages used in this section
library(tidyverse)


Why plot in base R?

Plots can reveal data issues that we had overlooked with our data checks. I sometimes use base R plotting functions to create quick-and-dirty plots for this purpose. If I want to do a lot of plot customizations, I find ggplot functions easier to remember and use.

In addition to quick plots, base R plotting functions have another important use. Many R packages have functions that create specific data visualizations when R’s generic plot() is applied to an object of a particular class. For example, lm() is a statistical function that fits a general linear model to data and saves the model as an object of class lm. Run the code below to see what happens when we apply plot() to an object of class lm.

# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.

# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the name `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots
We aren’t going to discuss statistics or residual plots today, but it’s useful to know these common uses for plot().


Feeding Data to Base R Plotting Functions

Look in the help file for the R graphics package that is automatically installed with R (in RStudio’s ‘Packages’ tab, click the hyperlink for the package called ‘graphics’). Here you will find many functions for base R graphics. For example, the boxplot() function draws box plots.

Typically, we can feed data to most base R plotting functions in one of two ways:

  1. Provide the data as a formula of the form y ~ x where y is a vector of data for the y-axis and x is a vector of data for the x-axis. For example, plot(y ~ x, data = my_data) or plot(my_data$y ~ my_data$x).

  2. Provide the vectors as separate arguments. For example, plot(x = my_data$x, y = my_data$y) When providing vectors as separate arguments, each vector must be a standalone vector. That is, we can’t just give column names and provide the data frame via a data argument.

With the non-formula approach, the argument names may not be x and y. For example, with bar plots, the argument name height is used instead of y. Refer to the help (e.g., ?barplot) for each plotting function to see what arguments it accepts.

I tend to use the formula approach for feeding data to base R plotting functions, because it is more consistent across the different types of plots. In other words, I’m too lazy to look up what argument names to use with each plot type for the non-formula approach.


Create a Box Plot

Let’s use base R functions to create a box plot of park visitors by year, for a subset of viz_dat that meets these criteria:

  • region is Intermountain
  • unit_type is National Monument
  • year is 2000, 2005, 2010, or 2015 (each year will be a separate box plot)

We want to know how much spread there is (across park units) in the number of visitors to National Monuments of the Intermountain region. We also want to know if visitor counts differed much among the four years.

# Check out `?boxplot`

# Subset the data
subdat <- viz_dat %>%
  dplyr::filter(
    year %in% c(2000, 2005, 2010, 2015),
    unit_type == "National Monument",
    region == "Intermountain")

boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula

# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument

This plot is okay, but even for a basic plot I would want to make the y-axis labels more understandable. Because the visitor numbers are large, R is using scientific notation for the y-axis labels. So instead of 200,000 we see 2e+05. Let’s change that in the box plot and add plot and axes titles.

# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
        data = subdat,
        main = "National Monuments in the Intermountain Region", # plot title
        xlab = "Year", # x-axis title
        ylab = "Number of visitors, in 1000's") # y-axis title

Not bad for basic box plots. Now look at the data underlying the box plots. In ?boxplot, the section titled ‘Value’ explains what these numbers mean. For any

p_box
## $stats
##         [,1]    [,2]     [,3]    [,4]
## [1,]   3.131   2.919   4.3500   9.492
## [2,]  61.170  53.521  52.5660  53.165
## [3,] 112.611 103.570 103.8875  94.931
## [4,] 258.306 280.068 221.0830 222.723
## [5,] 550.657 505.158 470.9210 416.635
## 
## $n
## [1] 34 34 34 34
## 
## $conf
##          [,1]      [,2]      [,3]      [,4]
## [1,]  59.1935  42.18307  58.22483  48.98625
## [2,] 166.0285 164.95693 149.55017 140.87575
## 
## $out
##  [1] 789.131 848.348 622.320 830.253 525.831 578.554 827.247 793.601 497.506
## [10] 478.833 813.686 588.006
## 
## $group
##  [1] 1 1 2 2 3 3 3 4 4 4 4 4
## 
## $names
## [1] "2000" "2005" "2010" "2015"

From the boxplots and numbers, it looks like visits were generally down in 2010 and 2015 compared to 2000 and 2005. The spread of visitor counts across these 34 National Monuments is huge, though.


Graphical Parameters for Base R Plots

Type ?par in the console to see the many graphical parameters (arguments) that can be used to modify base R plots. Unless you plan to use base R functions to create complex plots, you can probably get by with remembering a small handful of useful graphical parameters, such as pch = to change scatter plot point types, col = to set color, and a variety of cex arguments to change font size and point size.

Let’s create a basic scatter plot and use some of these graphical parameters to gussy up the plot.


Create a scatter plot

CHALLENGE: Create a scatter plot showing the relationship between park visits (y-axis) and gas price (x-axis) for Arches National Park (unit code is ARCH).

We want to know if annual visitor counts at Arches National Park is correlated with gas price (annual national average).

The data for this plot should be the 16 records of data for Arches National Park (one data point per year). Assign this subset of data to the name ‘arch_dat’ because we will be using it again for later plots.

Make these modifications to the default scatter plot:

  • Show visitor count as 1000’s of visitors (i.e., divide the visitor count by 1000) as we did for the box plots.

  • Use the graphical parameter pch to change the point type from the default open circle to a solid circle (see ?points to find the number corresponding to a filled circle. The examples at the end of ?points show how to use pch as a plot argument)

  • Increase point size to 1.5 instead of the default of 1

Remember, the Internet is a great source of help when coding in R!

The final plot should show this relationship:

Answer to Challenge Question
# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")

This plot shows that annual visitor counts increased as gas prices increased. That’s not the relationship we might have expected to see. But correlation is not causation, and the resolution of data influences the pattern we see. Maybe if we had used average June - July gas prices within 330 km of Arches National Park (instead of annual national averages) we might have seen a different relationship with visitor counts. Maybe, maybe not.

The scatter plot shows another interesting bit of information. What is that funny point that is marching to its own beat?! Gas price was low (~2.50 per gallon in 2015 dollars) and the visitor count was really high, ~ 1.4 million. A quick look at the data shows that this funny point was for the year 2015.

It would be interesting to see if 2015 was a high visitor count year for other park units as well. We will save that question for ggplot.

For now, let’s see if we gain any additional insight from plots showing trends in visits to Arches National Park and trends in national gas prices. Try creating these plots on your own by modifying the code slightly.

# Visitors vs. year
plot(visitors/1000 ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's")

The plot above shows that from 2004 to 2015, visitor numbers at Arches National Park increased every year. Visitor counts REALLY jumped in 2014 and 2015.

# Gas price vs. year
plot(gas_constant ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Average National Gas Prices, 2000 - 2015", xlab = "Year", ylab = "Gas price (constant 2015 dollars/gallon)")

The plot above shows that gas prices have gone up and down between 2000 and 2015. There was a huge drop in gas price between 2008 and 2009 and again between 2014 and 2015.

These scatter plots give interesting insight on trends in national gas prices and trends in visits to Arches National Park. The pattern we saw in our initial scatter plot–positive correlation between annual visitor counts and national average gas price–was almost certainly a spurious correlation. It’s important to look at data from many different directions and to not jump to conclusions based on a narrow exploration of the data.


Create a Line Graph

Line graphs are often used for showing changes in continuous data over time. Let’s make one last plot in base R before we move on to ggplot.

CHALLENGE: Re-create the plot of trends in visitor counts at Arches National Park, using a line graph instead of a scatter plot. If you’re feeling ambitious, make it a line graph with over-plotted points.

The final plot (with over-plotted points) should show this relationship:

Answer to Challenge Question
# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points

Building a Basic ggplot

Data Used in this Section
# This section requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park


Packages Used in this Section
# Packages used in this section
library(tidyverse)
library(ggthemes)


Why ggplot2?

If you search the Internet for information on plotting in R, you will quickly learn that ggplot2 is the most popular R package for plotting. It takes a little effort to learn how the pieces of a ggplot object fit together. But once you get the hang of it, you will be able to create a large variety of attractive plots with just a few lines of R code.

Reference documents and cheatsheets can be very helpful while you are learning to use the ggplot2 package.

So let’s get started! We will build our first ggplot object step-by-step to demonstrate how each component contributes to the final plot. Our first ggplot object will re-create the line graph of trends in visits to Arches National Park.


Step 1. Create a “Master” Template for the Plot
  • identify the data (must be a data frame or tibble)
  • set default aesthetic mappings

An aesthetic mapping describes how variables in the data frame should be represented in the plot. Any column of data that will be used to create the plot must be mapped to an aesthetic. Examples of aesthetics include x, y, color, fill, point size, point shape, and line type.

From the ‘arch_dat’ data we will map ‘visitors’ (in 1000’s) to the y variable and ‘year’ to the x variable. This will set up default axes and axes titles for the plot.

Master template with default aesthetics
# NOTES:
# 1. The first argument provided to the `ggplot()` function is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to the name `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.

(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))

summary(p) # summary of the information contained in the plot object
## data: region, unit_type, unit_name, unit_code, year, visitors,
##   gas_constant [16x7]
## mapping:  x = ~year, y = ~visitors/1000
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object. 
## # A tibble: 16 x 7
##    region        unit_type     unit_name   unit_code  year visitors gas_constant
##    <fct>         <chr>         <chr>       <chr>     <dbl>    <dbl>        <dbl>
##  1 Intermountain National Park Arches Nat~ ARCH       2015  1399247         2.45
##  2 Intermountain National Park Arches Nat~ ARCH       2014  1284767         3.4 
##  3 Intermountain National Park Arches Nat~ ARCH       2013  1082866         3.62
##  4 Intermountain National Park Arches Nat~ ARCH       2012  1070577         3.8 
##  5 Intermountain National Park Arches Nat~ ARCH       2011  1040758         3.75
##  6 Intermountain National Park Arches Nat~ ARCH       2010  1014405         3.02
##  7 Intermountain National Park Arches Nat~ ARCH       2009   996312         2.58
##  8 Intermountain National Park Arches Nat~ ARCH       2008   928795         3.61
##  9 Intermountain National Park Arches Nat~ ARCH       2007   860181         3.16
## 10 Intermountain National Park Arches Nat~ ARCH       2006   833049         3   
## 11 Intermountain National Park Arches Nat~ ARCH       2005   781670         2.74
## 12 Intermountain National Park Arches Nat~ ARCH       2004   733131         2.32
## 13 Intermountain National Park Arches Nat~ ARCH       2003   757781         2.01
## 14 Intermountain National Park Arches Nat~ ARCH       2002   769672         1.75
## 15 Intermountain National Park Arches Nat~ ARCH       2001   754026         1.91
## 16 Intermountain National Park Arches Nat~ ARCH       2000   786429         2.02
Step 2. Add Geometry Layer(s)

The geometry layer determines how the data will be represented in the plot. We can generally think of this as the plot type. For example, geom_hist() creates a histogram and geom_line() creates a line graph.

If we had set default data and aesthetic mappings, the geometry layer will use that information unless additional (overriding) data or aesthetics are set in the layer itself. If we had NOT set default data and aesthetics, they will need to be set within the layer. At a minimum we need to specify the data frame and which data frame columns hold the x and (for most plot types) y aesthetics. The ggplot object will use its own default values for any other aesthetics (e.g., color, size…) if not defined by the user.

Each geometry layer has its own set of required and accepted aesthetics. For example,geom_point() (point plot) REQUIRES x and y aesthetics and ACCEPTS aesthetics that modify point color and size. It does NOT accept the linetype aesthetic. We cover plot aesthetics in more detail in the section titled ‘Customizing a ggplot Object’.

Add a line graph geometry layer
# NOTE: We combine ggplot object layers with `+`, not with `%>%`. With a ggplot object we are not piping sequential results but rather layering pieces. In this example we are building our ggplot object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.

p + geom_line()

If we had NOT set default data and x and y aesthetics in the plot’s master template we could have set them within the geometry layer like this:

# Example of setting aesthetic mappings within the layer only
ggplot() +
  geom_line(data = arch_dat, aes(x = year, y = visitors/1000))
Add multiple geometry layers

We can include multiple geometry layers in a single ggplot object. For example, to overlay a point plot on the line graph we would add a geom_point() layer. If we had set default data and aesthetic mappings, this additional geometry would use that information unless additional (overriding) information were set in the geom_point() layer itself.

# This works because we had defined default data and aesthetics in `p`, so don't need to provide additional information to the geometry layers
p + geom_line() + geom_point() 

The code below will NOT add the points layer because geom_point() has no data or aesthetics defined, and we did not define default values in the ggplot object.

ggplot() +
  geom_line(data = arch_dat, aes(x = year, y = visitors/1000)) +
  geom_point()
Geometry layers are shortcuts for the layer() function

Geometry layers can alternatively be specified with the generic plot layer() function. Geometry layers are basically the layer() function with some pre-defined (default) arguments. For example, geom_point() is the same as layer(geom = "point", stat = "identity", position = "identity"). See ?layer for more information.

p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
Step 3. Use scale_...() Functions to Fine-Tune Aesthetics

All aesthetic mappings can be fine-tuned with a scale_...() function. For example, the aesthetic mapping for x can be modified with scale_x_continuous() if the data are continuous, scale_x_discrete() if the data are categorical, and scale_x_date() if the data class is data/time. ggplot2 provides default scale arguments for all aesthetic mappings. If we don’t explicitly define a scale for an aesthetic, ggplot2 will use its default scale settings.

Modify the y-axis with scale_y_continous()
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to the name `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot2` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`

# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p + 
  geom_line() + 
  geom_point() +
  scale_y_continuous(
    name = "Number of visitors, in 1000's", # y-axis title
    limits = c(600, 1500), # minimum and maximum values for y-axis
    breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)
Step 4. Use labs() to Specify Plot Labels

This useful layer allows us to set various plot labels, such as the plot title and subtitle, x- and y-axis titles, and alternative text and plot tags. Some of these labels can be set in other ways rather than via labs(). For example, we had already set the y-axis title in the first argument of scale_y_continous().

If we want to remove a default label from the plot, we can set the argument to NULL. For example, labs(x = NULL) removes the default x-axis title.

Add plot labels
# NOTE: See `?ggtitle`, `?xlab`, and `?ylab` for alternative ways to set plot labels

(p3 <- p2 + 
  labs(
    title = "Visitor Counts at Arches National Park, 2000 - 2015", # plot title
    subtitle = "Visits to Arches National Park have increased each year since 2004", # plot subtitle
    x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
 )
Step 5. Set a Complete Theme and Plot Element Themes

Use theme() to customize all non-data components of the plot. Conveniently, the ggplot2 package provides complete themes that change several elements of the plot’s appearance at once. The default complete theme is theme_grey. See ?theme for all the ways we can modify the non-data plot elements. This help page also provides a hyperlink to a help page for complete themes (I’m sure there is a way to get there more directly, but I couldn’t figure it out).

Explore ggplot2 complete themes
p3 + 
  theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()... 
Create a FiveThirtyEight-inspired plot

For more fun, install and load the ggthemes package, which allows us to apply plot styles inspired by the Economist and the popular FiveThirtyEight website, among other choices.

p3 + 
  ggthemes::theme_fivethirtyeight()
Modify theme elements

In addition to setting complete themes, we can individually modify element themes. If the element themes layer is set AFTER the complete themes layer, the element themes will override any conflicting default arguments in the complete theme. Some of these element these can alternatively be set in a scale_...() function. For example, to move the y-axis to the right side of the plot, use scale_y_continuous(position = "right") or theme()

It’s a bit difficult to understand the help page for ?themes. I usually resort to an Internet search to figure out how to set an element theme.

p3 + 
  ggthemes::theme_fivethirtyeight() +
  theme(
    plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
Step 6. Split a Plot into Many Subplots

We can use facet_grid() or facet_wrap() to split a ggplot object into multiple subplots, each representing a different group level. This is a nice way to simplify data-heavy plots with a single line of code.

With facet_grid we can specify different grouping variables for a row of plots versus a column of plots. For example, we can say that each row of subplots should represent a different region and each column of subplots should represent a different unit type. With facet_grid, facet column titles are horizontal at the top of each facet and facet row titles will be vertical to the right of each facet. With facet_wrap, subplots are “wrapped” in order from left-to-right and top-to-bottom, just like text is wrapped on a page. The facet titles are horizontal at the top of each facet.

Create a faceted plot
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)

subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)

ggplot(subdat, aes(x = year, y = visitors/1000)) + 
  geom_line() + 
  geom_point() +
  labs(
    title = "Annual Visitor Counts at Three National Parks, 2000 - 2015",
    x = "Year", 
    y = "Number of visitors, in 1000's") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5)) + # center the plot title
  facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")

We were wondering if 2015 was a very high year for visitor counts in parks other than Arches National Park. It looks like it was for Grand Canyon and Yosemite National Parks too!

Summary of Useful ggplot Object Components

These are not the ONLY ggplot components, but they are the ones most useful to know (IMHO)

  • Data - a data frame (or tibble). Specified as the first ggplot() argument. Any aesthetics set at this level are part of the “master” template for the plot.

  • Aesthetic mapping - the variables represented in the plot (and how), e.g., x, y, color, fill, size, shape, line type. These aesthetics, except x & y, can also be set to a constant value by specifying outside the aes(). Aesthetics can be defined in the master template or in a geometry layer.

  • Geometry layer - the plot/mark layers, e.g., geom_point, geom_line, geom_boxplot. Can specify multiple such layers in a plot.

  • Scales - used to fine-tune aesthetics. Takes the form ‘scale_xxx_yyy’, where ‘xxx’ refers to the aesthetic and ‘yyy’ how the aesthetic values will be specified, e.g., discrete, continuous, manual. The associated legend for the aesthetic can also be modified here.

  • Labs - sets the plot labels, including plot title, subtitle, caption, and axis labels. Importantly, this is also where you would set alternative text and plot tags, for Section 508 Compliance.

  • Themes - complete themes (e.g., theme_bw, theme_classic) can be used to control all non-data display. Themes can also be used to more specifically modify aspects of the panel background, legend elements, label text size and color, etc.

  • Facets - subset the data by grouping variables and generate a separate subplot for each grouping variable, putting the graphs all on the same page.

Customizing a ggplot Object

Data Used in this Section
# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frame in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

Packages Used in this Section
# Packages used in this section

# NOTE: This is a useful chunk of code that installs any missing packages and then loads all of them. Useful if we're sharing code with others who may not have some packages installed. Also increasingly useful if the list of packages required to run your script is somewhat long (I don't consider this a long list, but this code is handy to know.)

pkgs <- c("tidyverse",
          "ggthemes", # nice plot themes
          "GGally", # for stat_prop
          "RColorBrewer", # to display Brewer palettes
          "viridis", # for colorblind-friendly palettes
          "scales", # breaks_pretty(), etc.
          "plotly", # to make plots interactive
          "patchwork") # for page layout

installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies

Get Ready to be Challenged

Although ggplot2 default options create decent plots, we have extensive options for customizing every aspect of a ggplot object. In this section of the data visualization module we will practice building and customizing ggplot objects.

The format of this section is that for each plot type we:

  • Describe the plot to be created
  • List ggplot2 coding elements that will be introduced (we introduce a lot of new elements with histograms, so you can use them with subsequent plots)
  • Create an initial plot
  • Repeat the steps of examining the code output, identifying ways to improve the plot, and modifying the code–until we are happy with a final plot

We include Challenge Questions that ask you to figure out some coding on your own. These are coding challenges that should be “do-able” at this stage of learning, and will help you hone your skills of searching the Internet and R help documents for answers to your R-related questions–an absolute MUST for continuing to improve your coding skills after this training is done.

Embrace the challenge. Happy coding!


Histograms

Create a histogram to show the distribution of annual visits in the year 2015, by NPS unit. This plot will give us an idea if the distribution of visitor counts across NPS units is highly skewed, and if visitor distributions were similar across different unit types in 2015.

Learn to:

  • Build a ggplot() frequency histogram, density histogram, and dot plot
  • Set color, fill, and transparency aesthetics
  • Examine the data underlying a ggplot() object
  • Add vertical lines on a plot
  • Customize x-axis breaks
  • Customize facet labels to include useful summary statistics
  • Organize multiple plots on a page

General histogram suggestions:

  • Here are guidelines for selecting an appropriate number of bins for histograms, mostly applicable to fairly normally distributed data.
  • For small sample sizes, consider dot plots instead of histograms.
  • Smoothed density histograms with geom_density() can be a useful alternative for comparing multiple histograms with different sample sizes, if the actual counts per bin are not of interest.
Step 1. Basic histogram

Subset the data for this plot and create a basic histogram for a first look at the data.

# Subset the data, only keeping data for the year 2015
# To improve readability, show visitor count as number of visitors in 1000's. This conversion be done during initial data formatting, or can be calculated directly in the `ggplot()` function. I'm doing the conversion in the data wrangling stage so I don't have to keep retyping it in the plot iterations.

dat_2015 <- subset(viz_dat, year == 2015) %>% 
  dplyr::mutate(visitors_k = visitors/1000)

head(dat_2015) # always check the data before plotting
## # A tibble: 6 x 8
##   region  unit_type  unit_name  unit_code  year visitors gas_constant visitors_k
##   <fct>   <chr>      <chr>      <chr>     <dbl>    <dbl>        <dbl>      <dbl>
## 1 Pacifi~ National ~ Crater La~ CRLA       2015   614712         2.45       615.
## 2 Pacifi~ National ~ Olympic N~ OLYM       2015  3263761         2.45      3264.
## 3 Interm~ National ~ Big Bend ~ BIBE       2015   381747         2.45       382.
## 4 Pacifi~ National ~ Craters o~ CRMO       2015   246826         2.45       247.
## 5 Interm~ National ~ El Malpai~ ELMA       2015   174433         2.45       174.
## 6 Southe~ National ~ Everglade~ EVER       2015  1077427         2.45      1077.
# Basic histogram
ggplot(dat_2015, aes(x = visitors_k)) + # specify the variable to plot
  geom_histogram()

# Compare the plot to the data to make sure things look about right

This histogram has a really long, discontinuous tail. It looks like a very small number of NPS units had much higher visitor counts in 2015 than the bulk of the units did. Check the underlying data to make sure this observed pattern is correct. For now, don’t worry too much about the error message that says we are using a default of bins = 30. This message is reminding us that we should thoughtfully select an appropriate bin size for a histogram instead of relying on a default. If we explicitly set the default number of bins to 30 (geom_histogram(bins = 30)) the message goes away, even though we produce the same histogram. For highly skewed data like we have here, typical rules-of-thumb for the appropriate number of bins may not apply.

Step 2. Group by unit type

We can split out the unit types in this histogram by assigning a different fill to each unit type or by faceting the histogram by unit type. When we map the ggplot() fill aesthetic to a categorical variable, ggplot() automatically maps that variable to a grouping aesthetic as well. That is, each category level will be represented by a separate histogram (with a different fill) in the plot.

The difference between the fill and color aesthetics in ggplot2 is that fill changes the color in the inside of a point or bar, and color changes the outline.

Let’s try it both ways (fill aesthetic and faceting) and also compare the frequency histogram to a density histogram and dot plot.

First try – use different fill colors for different unit types.

# NOTES: 
# 1. We can set aesthetics to constant values by excluding the `aes()` function. For example, `aes(fill = unit_type)` groups and fills histograms based on the `unit_type` variable (different fill color for each unit type). In contrast, `fill = "red"` creates a single histogram that is red.
# 2. Instead of adding the `fill` aesthetic to the master template, we could have added it directly in the `geom_histogram()` layer

ggplot(dat_2015, aes(x = visitors_k, fill = unit_type)) + 
  geom_histogram(alpha = 0.3, color = "black") # histograms for the different unit types overlap, so apply transparency (alpha = 0.3) and add a black outline on the bars so we can more easily distinguish them in the histograms.

Well, transparency didn’t help much here. We can’t really see what the distribution for National Historic Sites looks like. Try faceting. Save the faceted frequency histogram to a variable named hist1 so we can look at the underlying summary counts in a later step.

# Facet wrap
(hist1 <- ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_histogram() +
  facet_wrap(vars(unit_type), ncol = 1)) # facet by unit type, put all plots in a single column

# Now try using `facet_grid()` instead to see what the difference is

Compare to density histograms and dot plots.

# Density histogram
ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_density() + # <<< only changed this line
  facet_wrap(vars(unit_type), ncol = 1)

# Dot plot
ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_dotplot() + # <<< changed this line
  facet_wrap(vars(unit_type), ncol = 1)

The dot plot is actually nicely informative (though could use tweaking), given the moderately small sample size of our data set. But the reader is forced to manually count the taller bin stacks, which can be slower than reading the y-axis of the frequency histogram. We could do some hack coding to change the y-axis labels to show counts instead of density, but instead let us customize the frequency histogram so it is more readable and informative.

Step 3. Customize bin sizes and x-axis labels

Look at the count summaries used to build hist1

View(ggplot_build(hist1)$data[[1]]) 
y count x xmin xmax density ncount ndensity flipped_aes PANEL group ymin ymax colour fill size linetype alpha
20 20 0.0000 -184.6333 184.6333 0.0021665 1.00 1.00 FALSE 1 -1 0 20 NA grey35 0.5 1 NA
1 1 369.2666 184.6333 553.8999 0.0001083 0.05 0.05 FALSE 1 -1 0 1 NA grey35 0.5 1 NA
2 2 738.5332 553.8999 923.1666 0.0002166 0.10 0.10 FALSE 1 -1 0 2 NA grey35 0.5 1 NA
0 0 1107.7999 923.1666 1292.4332 0.0000000 0.00 0.00 FALSE 1 -1 0 0 NA grey35 0.5 1 NA
2 2 1477.0665 1292.4332 1661.6998 0.0002166 0.10 0.10 FALSE 1 -1 0 2 NA grey35 0.5 1 NA
0 0 1846.3331 1661.6998 2030.9664 0.0000000 0.00 0.00 FALSE 1 -1 0 0 NA grey35 0.5 1 NA
0 0 2215.5997 2030.9664 2400.2330 0.0000000 0.00 0.00 FALSE 1 -1 0 0 NA grey35 0.5 1 NA

In each record, y (same as count) tells us the height of the histogram bar, xmin and xmax tell us the x-axis boundaries of that bar, PANELtells us which facet plot the record belongs to, andfill` shows the fill color applied.

We can use the ggplot_build function to examine the information underlying any ggplot object. In this case, we were only interested in the count summaries so we only viewed the data[[1]] list element.

Notice that the first bin goes from -185 to +185. That’s due to geom_histogram()’s default binning (and bin centering) behavior. We could use the boundary and/or center arguments in geom_histogram() to change the bin centering.

Let’s customize the bin sizes and x-axis labels. Save the result as hist2.

(hist2 <- ggplot(dat_2015, aes(x = visitors_k)) +
   geom_histogram(binwidth = 200) + # each bin now spans 200k visitors
   scale_x_continuous(breaks = seq(0, 11000, by = 2000)) + # label the x-axis from 0 to 11,000 at intervals of 2000
   facet_wrap(vars(unit_type), ncol = 1)
)

# Use `ggplot_build()` to see the new histogram bins
Step 4. Add median lines and more informative facet labels
# This is not the only way, but we are going to add new data columns that include the unit type median and a facet label. 
# For alternative ideas, check out the `labeller` argument in `facet_wrap()`

dat_2015_label <- dat_2015 %>%
              group_by(unit_type) %>% 
              mutate(N = n(),
                     type_med = round(median(visitors_k)),
                     type_label = paste0(unique(unit_type),"  (N = ", N, " NPS units; median = ", type_med, "k visitors)")) # `paste` and `paste0` are nice functions for combining strings and variables
                 
# View(dat_2015_label) # look at the new columns    
ggplot(dat_2015_label, aes(x = visitors_k)) +
  geom_histogram( binwidth = 200) +
  scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
  geom_vline(aes(xintercept = type_med)) + # add median lines
   facet_wrap(vars(type_label), ncol = 1) # we are faceting by `type_label`, which splits the data out the same way as `unit_type`. This is not the only way to customize facet labels.

Step 5. Customize a final plot and arrange plots in a single graphic

CHALLENGE: Customize a final plot however you wish and save it as hist3. Then use the R package patchwork (make sure you have it installed and loaded) to arrange hist1 and hist3 into a single graphic

In the graphic, arrange the two histograms side-by-side and label them (A) and (B). Give hist3 more room (a wider column) in the graphic so x-axis labels are not squished. Add a title to the graphic.

Here is an example of a final histogram.

Here is an example of two histograms combined in a single graphic.

Answer to challenge question
# Final histogram
(hist3 <- ggplot(dat_2015_label, aes(x = visitors_k)) +
    geom_histogram(fill = "lightblue", color = "black", binwidth = 200) +
    scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
    geom_vline(aes(xintercept = type_med), color = "red", size = 1, linetype = "dashed") +
    labs(x = "Number of visitors, in 1000's", y = "Number of NPS units", title = "Distribution of Visitor Counts (Yr 2015)",
       subtitle = "Red dashed line shows median count") +
   theme(plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
 facet_wrap(vars(type_label), ncol = 1)
 )
# Combine two histograms in a single graphic
(hist_page <- hist1 + hist3 +
  patchwork::plot_layout(ncol = 2, widths = c(1, 2)) + 
  patchwork::plot_annotation(
    title = "Basic (A) and Customized (B) Histograms",
    tag_prefix = "(",
    tag_levels = "1",
    tag_suffix = ")"
)
)
Further improvements?

We are going to stop here with the histograms, but it’s useful to think about ways we can further improve our plots.

  1. It would help to add x-axis labels on every facet, not just to the bottom facet. I found an easy way to do this with the lemon package, but it caused problems with the median lines (showed all three median lines on each facet). I also found some complicated coding solutions, but would prefer an easy fix. So this is something I will have to tinker around with when I have more time (ha!). In general, I try to avoid relying on “new-ish” packages (like lemon) unless I think the package will likely be maintained and updated into the future (e.g., Hadley’s packages).

  2. In these histograms, the really long tail in the national parks panel makes it difficult to realize differences in median counts for the three unit types. Depending on the key message we want to convey with the figure, we may consider an alternative plot (perhaps setting the upper x-axis limit to 6000 and adding an annotation with GRSM’s visitor count and reason for not showing).

Look at your final plot and think about ways to improve it (better yet, actually write the code to do so!)


Bar Charts

Create a bar chart to see if regions differ much in % visitor counts (out of the regional total for that year) across the three park unit types. Compare results for year 2000 vs. 2015, and facet by region.

Learn to:

  • Build ggplot() stacked and side-by-side bar charts
  • Easily convert proportion to % on an axis
  • Use brewer color palettes
  • Manually assign your own colors in a plot
  • Label bar chart segments with the % they represent
  • Save a ggplot object as a .pdf file

General bar chart suggestions:

  • Bar charts are commonly used for counts (or proportions) and should have a y-axis that starts at 0 (in contrast to scatterplots, where it is fine to start the x- or y-axis at a different number)
  • With stacked bar charts, it can be difficult to compare across charts for group levels that are not at the bottom of the stack, because this bottom level is the only one with a consistent starting point.
Step 1. Basic stacked bar chart

Look at ?geom_bar to understand the difference between geom_bar() and geom_col().

The code below is copied directly from the first Example in ?geom_bar. Run this code, then View() the mpg data set to try to figure how the data were used to create the resulting bar chart. The mpg data are automatically provided with the ggplot2 install. Check out ?mpg for details.

# geom_bar is designed to make it easy to create bar charts that show
# counts (or sums of weights)
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
# Total engine displacement of each class
g + geom_bar(aes(weight = displ))
# Map class to y instead to flip the orientation
ggplot(mpg) + geom_bar(aes(y = class)) 

# NOTE: We can also use `coord_flip()` to flip the axes

Let’s try building a bar chart with our data. We are going to start off by doing all of the data wrangling within geom_bar() instead of wrangling a plot-specific data set. It’s useful to know how to wrangle within ggplot2 functions because it comes in handy sometimes.

NOTE: In the bar chart we will map the x aesthetic to year as a factor data class (instead of numeric). This isn’t absolutely necessary, but will avoid errors later when we build on this basic bar chart. Bar charts are not meant to display two continuous variables, so with as.factor(year) we are making it clear that year should be considered a categorical variable.

# Stacked bar chart, using the handy `weight` argument
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors)) + # we don't need to divide visitors by 1000 for this one because we will be converting to proportions anyway
  facet_grid(cols = vars(region))

Notice that the order of levels in the legend is alphabetical from top to bottom, and the order of colors in the stacked bar chart matches the legend order. If unit_type were a factor data type (instead of character), then the legend and chart colors would follow the factor level orders (remember that we had manually set factor level orders for region)

Step 2. Create a stacked bar chart of proportions (instead of counts)

Instead of wrangling the data set to calculate group proportions, we can convert count data to proportion data by simply changing the position setting in geom_bar().

# NOTE: Setting the `position` argument to "dodge" would change the stacked bar to a side-by-side bar chart

# Set the `position` argument to "fill" to convert the counts to proportions
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors), position = "fill") + # <<< changed this line
  facet_grid(cols = vars(region))
Step 3. Apply a colorblind-friendly palette

The RColorBrewer and viridis packages have many popular color palettes for:

  • sequential data (continuous variables ordered from low to high, e.g., age)
  • diverging data (e.g., like pH, where there may a mid-value and we may want a different colors for values below vs. above that mid-value)
  • qualitative data (categorical data like: cat, dog, rat)

The viridis palettes are known for being colorblind-friendly. RColorBrewer has a subset of palettes that is colorblind-friendly.

Apply one of the colorblind-friendly palettes to the bar chart.

# These are colorblind-friendly palettes in `RColorBrewer`
RColorBrewer::display.brewer.all(n=NULL, type="all", select=NULL, exact.n=TRUE, colorblindFriendly=TRUE)

ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors), position = "fill") +
  scale_fill_brewer(palette = "Set2") + # pick a palette to apply
  facet_grid(cols = vars(region))

To manually assign colors to each variable:

# Google search a cheat sheet of "R ggplot2 colors" to see hundreds of named colors to choose from. Manually set fill for each unit type by replacing `scale_fill_brewer()` with:

scale_fill_manual(values = c("National Historic Site " = "cadetblue", "National Monument" = "mediumpurple2", "National Park" = "burlywood1"))

To create your own palette of colors, and call it from the plot…

# A colorblind-friendly palette, with colors identified by hexadecimal code instead of by name (you can replace these with color names)
my_colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

scales::show_col(my_colors) # the `scales` package has a `show_col()` function to display the colors corresponding to a hexadecimal code

# To use your custom palette, replace `scale_fill_brewer()` with:
  scale_fill_manual(values = my_colors) # make sure the number of colors in your custom palette is greater than or equal to the number of levels in your grouping variable (in our case, 3 unit types--so unit types (alphabetically or by factor level order) will be assigned the the first 3 colors in the custom palette
Step 4. Label each bar segment with the proportion represented

A potential problem with stacked bar charts (depending on the message we want to convey) is that it can be difficult to determine differences between two bar charts, for a grouping level that is not grounded at the bottom of the stack. For example, it may be hard to tell if the proportion of visitors to National Monuments in the Southeast was higher or lower in 2015 compared to 2010. That is because the lower boundaries of the National Monuments bar segments have different starting points. If it’s important for our plot to convey this kind of information to the reader, one option would be to label each bar segment with the % of the bar it represents.

Below is some pseudo-complicated code to add bar segment labels as %. Ultimately, an easier option (at least, easier to remember) would have been to wrangle the data with columns that calculate % visitors grouped by region, unit type, and year. Then use geom_col() to just show the pre-calculated percentages, and geom_text() to add those pre-calculated values as labels.

ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) + # <<< this `by =` argument is required for the `stat` argument in `geom_text()` to calculate proportion
  geom_bar(aes(weight = visitors), position = "fill") + 
  geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
  scale_fill_brewer(palette = "Set2") +
  facet_grid(cols = vars(region))

Step 5. Finish it off

CHALLENGE: Finalize your bar chart and save the resulting plot as a .pdf file.

The final bar chart should convert the y-axis labels to percentages instead of proportions (e.g., 50% instead of 0.50) and should include appropriate plot labels. Feel free to add other personal tweaks as well! Then save your ggplot object as a .pdf file in your working directory (test your Internet search skills!)

Here is an example of a final bar chart. Yours may look similar (or not!)

Answer to challenge question
# Assign the final bar chart to the name 'final_bar' so we can save it as a .pdf file in the next step
(final_bar <- 
   ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) +
   geom_bar(aes(weight = visitors), position = "fill") + 
   labs(fill = "NPS unit type", title = "Proportion of visits in each NPS unit type and region", subtitle = "Comparison of Years 2000 & 2015", y = "% of visits", x = "Year") +
   scale_y_continuous(labels = scales::percent) + # <<< this line converts proportion to % on the y-axis
   geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
   scale_fill_brewer(palette = "Set2") +
   ggthemes::theme_pander() +
   theme(plot.margin = margin(rep(15, 4)), # increase white margin around plot. This particular theme needs it, others may not.
         plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
   facet_grid(cols = vars(region))
)

# And here is the code to save it as a .pdf file
ggsave(filename = "final_bar.pdf", plot = final_bar) 

# NOTE: `ggsave()` uses the file extension to guess at the file type to save as. There are function arguments to set figure width and height (and the associated measurement unit). The default is to save the image based on how we manually sized it in the RStudio plots plane. When we save the image this way, we will see a message in the console about the width and height of the saved image. For reproducibility, it's usually a good idea to hard code the output figure dimensions (if you plan to reuse or share a script)
Further improvements?

Spend some time thinking about how you might improve your final bar chart. Here are a few things to consider.

  1. Regions differ in the numbers of relative proportions of unit types. For example, the Intermountain region has very few Historic Sites, and that fact certainly contributes to the very low proportion of visits to Historic Sites in that region. Depending on what message we want to convey with these bar charts, it may be more appropriate to summarize the data in a different way (for example, comparing average number of visits per unit, across unit types).

  2. A potential problem with stacked bar charts (depending on the message we want to convey) is that it can be difficult to determine differences between two bar charts, for a grouping level that is not grounded at the bottom of the stack. For example, it may be hard to tell if the proportion of visitors to National Monuments in the Southeast was higher or lower in 2015 compared to 2010. That is because the lower boundaries of the National Monuments bar segments have different starting points. If it’s important for our plot to convey this kind of information to the reader, one option would be to label each bar segment with the % of the bar it represents, and perhaps also actual number of visits.


Line Graphs

Create a line graph to show trends in visitor counts, separately for each unit type and region. Do not include Southeast region data in this figure.

Learn to:

  • Build a line graph
  • Make it interactive (i.e., hovering over a line will cause the data to show in a pop-up)

Step 1. Basic line graph

Subset the data for this plot and create a basic faceted line graph for a first look at the data.

# Subset the data
line_dat <- viz_dat %>%
  dplyr::filter(region != "Southeast") %>%
  dplyr::mutate(visitors_k = visitors/1000)

ggplot(line_dat, aes(x = year, y = visitors_k)) +
  geom_line() +
  facet_grid(rows = vars(unit_type), col = vars(region))
Step 2. Fix and finalize the line graph

CHALLENGE: This one’s all yours! Create a final line graph that is interactive.

After you fix the mis-connected lines issue, learn about and apply two functions in your final line graph:

  1. breaks_pretty() from the scales package automatically sets attractive axis breaks–use this function to improve the spacing of your y-axis labels.
  2. ggplotly() from the plotly package makes it super easy to make a ggplot objective interactive.
Make sure you have scales and plotly installed and loaded before you begin.

Here is an example of a final line graph. Feel free to personalize your graph with your own favorite theme and customizations.

Answer to challenge question
# NOTES: 
# 1. If we had mapped unit_code to a color or linetype aesthetic we would not have needed to separately specify the group aesthetic
# 2. `ggplotly` doesn't support subtitles created in a ggplot object. To add subtitles to a ggplotly object, we can hack the `ggplotly` title text feature, or use `ggplotly` annotations (but then we have to specify position of text on plot)

(final_line <- ggplot(line_dat, aes(x = year, y = visitors_k, group = unit_code)) + # it didn't know to draw separate line per unit. Could also set as color or line type aesthetic, etc. and that would automatically also map it to group aesthetic. 
  geom_line() + 
  labs(x = "Year", y = "Number of visitors, in 1000's") +
  scale_y_continuous(breaks = scales::breaks_pretty()) +
  facet_grid(rows = vars(unit_type), col = vars(region)) + # so it's separate scales per unit type but same within a row (so good way to decide what should be facet row vs. col)
  ggthemes::theme_fivethirtyeight(base_size = 14) +
    theme(plot.margin=unit(c(2,1,1,1), 'cm')) # needed extra space at top to fit ggplotly hack subtitle
  ) %>%
  ggplotly() %>%
  layout(title = list(
    text = paste0("Trends in visitor counts (in 1000's), by NPS unit type and region", '<br>', '<sup>', 'Each line represents an NPS unit', '</sup>'))) 
Further improvements? This line graph was really meant to add a few more skills to your R plotting toolkit. It can be greatly improved, and we’ll leave it at that.

Matrix Heat Maps

Create a matrix heat map to examine differences in annual visitor counts among National Parks and trends over time

Learn to:

  • Build a matrix heat map from a data frame
  • Reverse the order of axes
  • Add text labels to heat maps
  • Change the length and position of a legend
  • Order heat map rows based on average rank over time
Step 1. Basic matrix heat map

Subset the data for this plot and create a basic matrix heat map for a first look at the data.

# Subset the data, only keeping data for National Parks
# Instead of showing visitors in 1000's we will show visitors in millions because annual National Park visitor counts are generally higher
tab_visits <- viz_dat %>%
  dplyr::filter(unit_type == "National Park") %>%
  dplyr::mutate(visitors_m = visitors/1000000,
                year_f = as.factor(year)) 

head(tab_visits) # always check the data before plotting
## # A tibble: 6 x 9
##   region  unit_type unit_name   unit_code  year visitors gas_constant visitors_m
##   <fct>   <chr>     <chr>       <chr>     <dbl>    <dbl>        <dbl>      <dbl>
## 1 Pacifi~ National~ Crater Lak~ CRLA       2015   614712         2.45      0.615
## 2 Pacifi~ National~ Olympic Na~ OLYM       2015  3263761         2.45      3.26 
## 3 Interm~ National~ Big Bend N~ BIBE       2015   381747         2.45      0.382
## 4 Southe~ National~ Everglades~ EVER       2015  1077427         2.45      1.08 
## 5 Southe~ National~ Great Smok~ GRSM       2015 10712674         2.45     10.7  
## 6 Pacifi~ National~ Redwood Na~ REDW       2015   527143         2.45      0.527
## # ... with 1 more variable: year_f <fct>
# Starting plot
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile()

# Try setting `x = year` instead of `x = year_f`, to see what changes in the heat map
# Compare the plot to the data to make sure things look about right
Step 2. Improve the visibility of matrix cells

The matrix cell colors all run into each other. To make the individual cells more obvious, we need to change the cell color from the default black to white. We also don’t get much color range with many shades of a single color. It’s easier to see differences among heat map values if we use a palette with a wider color range, e.g., a palette that blends two different colors.

ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) + # <<< white border with thicker border than default
  scale_fill_viridis_c() # <color-blind friendly and larger range of colors

What patterns do we see in the heat map so far? With the viridis palette it’s a bit easier to see that there are a few National Parks with really high visitor counts– Great Smoky Mountains and Grand Canyon National Parks, for example.

Step 3. Label cells with their values

If we are not just trying to show “big picture” patterns in the heat map, but want the reader to be able to distinguish smaller differences, it is useful to label the heat map cells with their actual values. Before we do that here, we will also reverse the color order so the lighter colors correspond with smaller values (to me, that order seems to make more sense).

ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
  geom_tile(colour = "white", size = 0.3) +
  scale_fill_viridis_c(direction = -1) + # <<< Reverse color order
  geom_text(aes(label=round(visitors_m, 1)), size = 3) # <<< add the cell labels, rounded to one digit. Adjust font size.

With the numbers added, we can now more clearly see that annual visitor counts increased at some parks (e.g., Arches National Park) over time, and declined at others (e.g., Mammoth Cave National Park). Those patterns aren’t that easy to discern with the colors alone.

Step 4. Make text color conditional on value

It’s difficult to see black numbers against the darker colors. One way to fix this is to set the text color conditional on the matrix value.

ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) +
  scale_fill_viridis_c(direction = -1) + 
  geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3) + # <<< make color conditional on value. Choose a value that makes sense for the color palette you use.
  scale_color_manual(values = c("black", "white")) # set color of TEXT as FALSE(0) = black (for <=7) and TRUE(1) = white (for >7)
Step 5. Finalize the heat map

This is a decent start. The rest is up to you!

CHALLENGE: Create a final matrix heat map.

See how many of these Challenge tasks you can figure out.

  1. Reverse the order of the y-axis so it’s alphabetical from top to bottom (i.e., Arches NP at top, Zion NP at bottom).
  2. Remove the legend that was created by default for the text color (the one with TRUE / FALSE levels) - we don’t need that legend.
  3. Change the title of the visitor counts legend to “Visitors, in millions” (right now it’s “visitors_m”)
  4. Move the legend so it’s above the plot and make it wider horizontally
  5. Change the legend breaks so they show the numbers 2 - 10, in increments of 2
  6. Customize the heat map with labels and anything other tweaks you would like

SUPERSTAR CHALLENGES!!!

SUPERSTAR 1. Facet by region and fix any formatting problems that you get (the problems will be obvious)

SUPERSTAR 2. Change the Park name labels so they say “NP” instead of “National Park” (e.g., “Yosemite NP” instead of “Yosemite National Park”) – probably easiest to do this with the initial data wrangling, before the data are provided to ggplot()

Here is an example of a final matrix heat map with all the challenges met.

With this final heat map it’s easy to see regional patterns in visitor counts. In all regions there are a few National Parks with really high visitor counts.The Intermountain region has many National Parks that typically exceed 2 million visitors per year.

Answer to challenge question
tab_visits <- viz_dat %>%
  dplyr::filter(unit_type == "National Park") %>%
  dplyr::mutate(
    visitors_m = visitors/1000000,
    year_f = as.factor(year),
    unit_name_abbrev = gsub("National Park", "NP", unit_name)) # <<<< change to "NP". Could have also used `sub()`

head(tab_visits) # Check the results

(final_heat <- ggplot(tab_visits, aes(y = unit_name_abbrev, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) +
  scale_y_discrete(limits=rev) + # <<<< reverse y-axis order
  scale_fill_viridis_c(direction = -1, name = "Visitors, in millions", breaks = seq(2,10, by = 2)) + # <<< add legend name, set legend breaks
  geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3, show.legend = FALSE) + # <<< turn off text legend. Other ways to do it include `guides(color = "none)`
  scale_color_manual(values = c("black", "white")) +
  labs(x = "Year", y = "NPS Unit", title = "Trends in Annual Visitor Counts by NPS Unit, 2000 - 2015", subtitle = "Higher visitor counts shown as darker cells") + 
  theme_minimal(base_size = 12) + # <<< my theme of choice
  theme(legend.position = "top", # <<<< move legend to top)
        legend.key.width = unit(2.5, "cm")) + # <<< make legend longer
  ggforce::facet_col(vars(region), scales = "free_y", space = "free") # <<< fix facet format. Could also do facet_grid (without `ggforce`) but I like the fact that `facet_col` keeps the facet labels on top
)
Further improvements?

Matrix heat maps are great for showing a lot of information in a highly condensed figure, but they have there limits. Some thoughts on our final heat matrix:

  1. This isn’t a great way to convey a message about trends in visitor counts at an individual National Park. The much higher counts at some National Parks (GRSM!) means that the plot highlights differences among National Parks more than it highlights changes over time (which were a much smaller difference). Check out the oob_squish() function as a possible fix for this problem.

  2. We could make it easier to identify the most visited National Parks per region by ordering the data within region so National Parks with highest average annual visitor counts are at the top, and the lowest are at the bottom. Add a caption to explain the ordering.

  3. As before, it would be useful to add x-axis labels (year) on every facet instead of just at the bottom of the page.

  4. Note that we have zero’s in some of the cells because we showed counts as millions of visitors, rounded only to one digit past the decimal. We could fix this issue by rounding to two digits, but that would clutter the heat map even more than it already is.

  5. A potential problem is that readers may think the different text colors have some special meaning other than just for increasing visibility. Not sure how to deal with that.

  6. Font sizes are pretty small. Perhaps we should split this into three separate heat maps (by region) and use larger font sizes.

Visualizing Spatial Data

GIS in R

Packages used in this section:

library(sf) # for working with simple features
library(tmap) # for mapping spatial data
library(leaflet) # Another package for mapping spatial data

If folks are having trouble installing tmap due to an issue with one of its dependencies, terra, try running the following code, and then reinstall tmap.

install.packages('terra', repos='https://rspatial.r-universe.dev')

If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf and tmap packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.

The sf package is the workhorse for anything you need to do with spatial vector data. File types with sf are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf package is also superseding the rgdal package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf couldn’t do.

The main application of tmap package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2. In practice for tmap, it means that maps are built as a collection of many small functions/layers that get added together with pipes (%>%), and order matters. There are also tmap-enabled functions that you can use in ggplot2 plots too, but you can do a lot more in tmap. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial, which is an add-on package for plotting spatial data in ggplot2.

The raster package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).

Finally, the leaflet package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.

The leaflet package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.


Using sf and tmap

My two favorite features of sf are 1) the attribute table of a simple feature (sf’s equivalent of a shapefile) is a dataframe in R, and 2) package functions are pipeable with tidyverse functions. That means, if you want to delete points in your layer, you can use dplyr’s filter() function to filter out those points. The sf package will update the geometry of the layer to remove the points that were filtered out.

To demonstrate the use of sf and tmap, I’m going to generate a simple GRTS sample using spsurvey, which now connects to sf instead of sp and rgdal. Then we’ll filter out points that don’t fall in a forest habitat to demonstrate how to work with sf data in R. Finally we’ll plot the results using tmap.

I wasn’t able to figure out how to post a shapefile that you could easily download in R with a url. I emailed them to you as data.zip the previous week. I also posted all of the files to our Scientists Team, which can be downloaded using the following link: https://doimspp.sharepoint.com/:f:/s/ScientistsTraining2022/EiXOTYV1l4RCk1sMr5yXhZUB1ZFEaAlAN4elvsYbBfYHRg?e=ktVy5n. To follow along, you’ll need to download (and unzip if you’re using the email attachment) the shapefiles and save them to a “data” folder in your working directory.

Step 1. Load and prep shapefiles using sf The code chunk below has a lot going on, but bare with me. This is the easiest way I’ve found to customize the layers in tmap plots. The basic steps in the code chunk below are:
  1. Load libraries
  2. Set a seed for GRTS sample. As long as you save the seed and layers used to generate the GRTS sample, you should be able to recreate the same GRTS sample.
  3. Read in park boundary and vegetation map layers.
  4. View quick plot of layer in base R, just to make sure it looks right.
  5. Clip the veg layer to the park boundary using intersection, so map looks better.
  6. Look at the attribute tables for each layer.
  7. Add 2 columns to the veg layer to simplify the habitats (simp_veg) and set their colors (fill_col)

Once those steps are completed, we’re ready to generate a GRTS sample and start making a map. Note that I’m using 6-digit hex colors (i.e., “#ffffff” is white) to define the fill color for each habitat type. To find your own or see what color these look like, check out htmlcolorcodes.com

#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%  
library(sf)
library(tmap)
library(spsurvey)

# Generate a random number for the seed
set.seed(62051)
sample(1:100000, 1) #62051

# Read in shapefiles from teams data folder
sara_bound1 <- st_read('./data/SARA_boundary_4269.shp')
sara_veg1 <- st_read('./data/SARA_Veg.shp')

# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83. 
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE

# Quick plot of first column in attribute table
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary

# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields

str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)

# Simplify vegetation types for easier plotting
dev <- c('1. Urban or Built-up Land', '11. Residential', 
         '12. Commercial and Services', '13. Industrial',
         '14. Transportation, Communications, and Utilities', 
         '17. Other Urban or Built-up Land')
crop <- c('21. Cropland and Pasture', 
          '22. Orchards, Groves, Vineyards, and Nurseries', 
          '31. Herbaceous Rangeland')
shrubland <- c('32. Shrub and Brush Rangeland')
forest_up <- c('41. Deciduous Forest Land', '42. Evergreen Forest Land', 
               '43. Mixed Forest Land')
forest_wet <- c('61. Forested Wetland')
open_wet <- c('62. Nonforested wetland', '62. Nonforested Wetland')
water <- c('5. Water', '51. Streams and Canals', '53. Reservoirs')
unk <- 'Unclassified'

# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>% 
  mutate(simp_veg = case_when(ANDERSONII %in% dev ~ 'Developed',
                              ANDERSONII %in% crop ~ 'Open field',
                              ANDERSONII %in% shrubland ~ 'Shrublands',
                              ANDERSONII %in% forest_up ~ 'Forest',
                              ANDERSONII %in% forest_wet ~ 'Forested wetland',
                              ANDERSONII %in% open_wet ~ 'Open wetland',
                              ANDERSONII %in% water ~ 'Open water',
                              ANDERSONII %in% unk ~ 'Unclassified',
                              TRUE ~ 'Unknown'),
         fill_col = case_when(simp_veg == 'Developed' ~ '#D8D8D8',
                              simp_veg == 'Open field' ~ '#f5f0b0',
                              simp_veg == 'Shrublands' ~ '#F29839',
                              simp_veg == 'Powerline ROW' ~ '#F9421D',
                              simp_veg == 'Forest' ~ '#55785e',
                              simp_veg == 'Forested wetland' ~ '#9577a6',
                              simp_veg == 'Open wetland' ~ '#c497d4',
                              simp_veg == 'Open water' ~ '#AFD0F2',
                              simp_veg == 'Unclassified' ~ '#ffffff'))


Sidebar on simple features and shapefiles

Before moving to the next step, I wanted to show how easy it is to create simple features from dataframes that contain X/Y coordinates. We’ll read in a fake dataset in the GitHub repo for this training, and call it wetdat. The dataset contains fake species composition data for wetland monitoring sites in ACAD and includes X and Y coordinates. We’ll use dplyr functions to calculate the number of invasive and protected species in each site by year combination. Then we’ll make it a simple feature, and save it as a shapefile. Note that there are multiple formats you can save simple features as. I only show the shapefile version, in case you find yourself going between R and ArcGIS.

library(dplyr)
# Import data
wetdat <- read.csv(
  'https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv')

# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop') # keeps dplyr quiet in console

# Check first 6 rows of output
head(wetsum)

# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c('X_Coord', 'Y_Coord'), crs = 26919) 
  # ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.

# Write sf to disk as shapefile
st_write(wetsum_sf, 'ACAD_wetland_data.shp')


Step 2. Generate GRTS Sample

The spsurvey package has been updated to point to sf instead of rgdal. It’s a code-breaking change if you have old R scripts that generated GRTS samples. However, the process is even easier now, as you can see below. Here we’re just going to use the simplest of GRTS designs. The output from grts() has multiple slots. The one we want is sites_base, and you can see how we get that slot in the code below.

Note: While I followed the approach documented in the spsurvey vignette to generate reproducable GRTS samples, it does not appear to be working as I’m testing it. Despite running set.seed(62051) after loading spsurvey and then running the code chunk below, each sample appears to be different.

sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)


Step 3. Spatial join and filter

First step here is to spatially join the sara_grts layer to the sara_veg layer. Here we only cared about the sara_veg’s simp_veg field, so I used dplyr’s select() function. After joining, you should see a simp_veg field added to the end of the grts layer (it’s actually to the left of geometry, which is the last). In the next step, we then filter the points in the newly created grts_veg layer to only include points that fall in forest habitat.

# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)

# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c('Forest', 'Forested wetland')

# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points

Challenge: Filter GRTS by priority

How would you filter out GRTS points with a priority higher than 50?
Answer
grts_50 <- grts_veg %>% filter(priority <= 50)
nrow(grts_50)
## [1] 50


Step 4. Create map

Now it’s time to plot. The code below may seem a bit much, but we’ll break it down piece by piece. The great thing about building maps this way is that you’re essentially building a template in code that you can steal from and adapt for future maps. You can even turn these into a function that’s even easier to use in future maps. That’s beyond what we can cover here, but it’s another great benefit of making maps in R instead of ArcGIS.

The code below is broken into the various pieces that make up the map. The way tmap works is that first, you have to add the layer via tm_shape(), and then you specify how you want that layer to look, like tm_fill(), or tm_border(). Each piece has its own legend as well. This is similar to how you start each ggplot2 graph defining the data with ggplot(data, aes(x = xvar, y = yvar)), and then start adding geom_point(), or whatever else to define how the data are plotted. The difference with tmap is that every layer you want in the map has to be coded this way. Finally tm_layout() is similar to theme() in ggplot2, and is where you can customize the map layout.

A couple of notes about the code below:
  • The first line that defines for_legend makes a list of the habitat types in simp_veg and their associated colors. That saves time having to type them all over again, and potentially spelling them wrong.
  • The first tm_shape() in the map sets the projection and the bounding box. If you don’t set the bounding box, the map will show the largest extent of your layers. So if you have a roads layer at the state-level, your map will be zoomed at that extent, instead of the park boundary.
  • Note the difference between tm_fill(), which fills in colors, and tm_borders(), which only plots outlines. If you want both borders and fills, use tm_polygons() instead.
  • The z value in the tm_add_legend() allows you to set the order that each legend appears in the map. Otherwise, they’ll appear in the order you specify the layers.
  • The code for tm_symbols() allows you to change the symbol format in a similar way to ggplot2. We then added tm_text() to label each point using the numbers in the priority column of the grts_forest attribute table. The xmod and ymod allow you to offset labels from the points either horizontally and vertically. In this case, negative xmod moves the label to the left, and a negative ymod moves the label below the point. The default location for labels is directly on top of the point.
  • The tm_layout() code is where you can change the default settings of the figure, including font size, placement of the legend, and margins. The title name and position are also set in the tm_layout().
  • The tmap_save() allows you to write the map to disk and to specify the height and width of the map. I prefer to write maps to disk to see what the size looks like before changing any of the layout and font size settings, because the figure on your disk will look different (and often better) than the plot view in your R Studio session.
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col)) 

sara_map <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_fill('fill_col') +
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('black', lwd = 2) +
  tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'black',
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) + 
  tm_text('priority', size = 0.9, xmod = -0.4, ymod = -0.4) +
  tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21, 
                col = '#EAFF16', border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c('right', 'bottom'),
            title = 'Saratoga NHP GRTS points',
            title.position = c('center', 'top'))

tmap_save(sara_map, 'SARA_GRTS.png', height = 10.5, width = 8, 
          units = 'in', dpi = 600, outer.margins = 0)

Challenge: Change the map

  1. How would you change the vegmap layers to have an outline, instead of just fill?
  2. How would you make the park boundary red instead of black?
Answer
sara_map2 <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_polygons('fill_col') + # A1: changed tm_fill() to tm_polygon()
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('red', lwd = 2) + # A2A: changed 'black' to 'red'
  tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'red', # A2B
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) + 
  tm_text('priority', size = 0.9, xmod = -0.5, ymod = -0.5) +
  tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21, 
                col = '#EAFF16', border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), 
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c('right', 'bottom'),
            title = 'Saratoga NHP GRTS points',
            title.position = c('center', 'top'))

Map dimensions

The dimensions and scaling in the plot viewer of your R session tends to be a bit funky. Instead of trying to make the map perfect in the plot viewer, I usually save it to disk and tweak that version to look the way I want it to. To demonstrate, here’s what sara_map looks like in your plot viewer:

sara_map



Here’s what the map looks like after it’s written to disk and the dimensions are set using tmap_save():


Interactive mode

The last cool thing to show with tmap, is that you can include interactive HTML widgets similar to what leaflet can do (coming next). With the interactive mode, you can change baselayers, turn your layers off/on, and zoom to different extent. The legend in the interactive mode is a bit limited to only show fills, but it’s still pretty cool. You can also add custom basemaps (e.g. parktiles) to the interactive mode either by adding tm_basemap(url), or by changing the default basemap via tmap_options(), which I show below.

# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'

# Set parktiles as default basemaps for all interactive maps
tmap_options(basemaps = c(Map = NPSbasic,
                          Imagery = NPSimagery,
                          Light = NPSlight,
                          Slate = NPSslate))

# Switch to interactive mode, and plot the sara_map
tmap_mode('view') # turn interactive mode on
sara_map
tmap_mode('plot') # return to default plot view


Using leaflet and parktiles

Now I’ll show you how to create a simple leaflet map including NPS parktiles as a basemap. By default, layers and basemaps that feed into leaflet need to be specified in latitude and longitude using WGS84 projection (EPSG:4326). This projection may introduce some inaccuracies if projecting UTM to WGS84, such as plots that are at the edge of park boundaries appearing outside of the boundary. It has to do with the geographic transformation used to go from UTM to WGS. Recent improvements, including the Proj4Leaflet plugin, allow you to change coordinate systems, which may improve this. For more on that see the Projections section of R Studio’s leaflet page. For the purposes here, we’ll use the default WGS84, and reproject our data to match the projection.

We’re going to use some of the SARA layers we used with tmap, but we have to reproject the layers to WGS84. First we’ll define the four different park tiles that are available. These are the same tiles you see if you view the map on a given park’s website.

# Load library
library(leaflet)

# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'

# Reproject SARA layers
sara_bnd_wgs <- st_transform(sara_bound, crs = 4326)
grts_forest_wgs <- st_transform(grts_forest, crs = 4326)

# Calculate centroid for leaflet map view
sara_cent <- st_centroid(sara_bnd_wgs)$geometry
str(sara_cent) # -73.63459, 42.99543


Now we add the pieces together. The leaflet package operates similar to tmap and ggplot2, in that you build the map by adding consecutive layers and connecting the code with pipes.

A couple of notes on the code below:
  • The setView() allows you to set the center (in WGS84) and the zoom of the map. A zoom of 0 is the entire world. The higher the zoom, the closer the zoom. You can also set the min/max zoom a user can view in a map, along with a maximum bounding box that a user can pan or zoom within.
  • addTiles() is how you add individual tiles. The group = “text” names the tiles, which are then shown in the map’s layer control via addLayersControl().
  • addCircleMarkers() is how the GRTS points are added to the map. I also had to specify the lat/lng coordinates for each point, and I used sf’s st_coordinates() to extract them.
  • The radius of the point is how you specify the size of the markers.
  • The stroke refers to the outline. I turned it off in this case.
  • I also added a scalebar and set the width and units of it using addScaleBar(), and scaleBarOptions().
  • There are all kinds of bells and whistles you can add to a leaflet plot, including popups on hover or click, custom colors and symbols based on the data, etc. For more examples, check out Leaflet for R.
sara_map_lf <-
  leaflet() %>% 
  setView(lng = -73.63, lat = 42.99, zoom = 13) %>% 
  # parktiles
  addTiles(group = 'Map',
           urlTemplate = NPSbasic) %>%
  addTiles(group = 'Imagery',
           urlTemplate = NPSimagery) %>%
  addTiles(group = 'Light',
           urlTemplate = NPSlight) %>%
  addTiles(group = 'Slate',
           urlTemplate = NPSslate) %>% 
  addLayersControl(map = .,
    baseGroups = c('Map', 'Imagery', 'Light', 'Slate'),
    options = layersControlOptions(collapsed = T)) %>% 
  # points
  addCircleMarkers(data = grts_forest_wgs,
                   lng = st_coordinates(grts_forest_wgs)[,1],
                   lat = st_coordinates(grts_forest_wgs)[,2],
                   label = grts_forest_wgs$priority,
                   labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                      direction = 'bottom', textsize = '12px'),
                   fillColor = '#EAFF16',
                   radius = 4,
                   stroke = FALSE, # turn off outline
                   fillOpacity = 1) %>% 
  # scale bar and settings
  addScaleBar(position = 'bottomright') %>% 
  scaleBarOptions(maxWidth = 10, metric = TRUE) 
sara_map_lf
Note that I can’t figure out why I’m not able to make this map render on the website, but I’ll demonstrate it in R Studio.


Resources

Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.



Day 4: Best Practices and Iteration

Good Coding Practices

A note about best practices

As you continue to work in R, your code projects will quickly become bigger and more complex than the examples we’ve seen this week. The larger the project, the harder it is to keep your thoughts organized when you are working on it - especially if you have to pick it back up after some time away. Fortunately, there are some things you can do to keep your code tidy and understandable.

Organize your files

As we learned earlier in this class, you should always use RStudio projects to organize your code projects. If you have a lot of files, it’s often helpful to include sub-folders within your project folder for things like raw data and final outputs. Avoid using setwd() in your code, as this will likely cause your code to break when it’s run from a different computer or user account. Instead, make sure any file paths in your code are relative to the project folder. That means that if you have a folder called data inside your project folder, you would just refer to it using the path ./data (the ‘.’ is shorthand for your current working directory). A file inside the data folder called my_raw_data.csv would have a relative path of ./data/my_raw_data.csv.

example of project folder

Comment your code

Comments in your code are a gift to your future self and others. You don’t need to explain every single line of code, but if it’s not obvious what your code is doing, it’s a good idea to leave a note. If it took you a while to figure out how to write a chunk of code, or if it was hard to get it to work, that’s a good sign that you should take the time to write some comments.

For longer scripts, you can use comments to break your code up into sections (e.g. Setup, Analysis, Plots). It’s often helpful to make an outline in comments before you even start coding.

Style your code consistently

When writing for an audience, it’s helpful to follow a style guide so that inconsistencies and poor grammar don’t distract readers from you’re trying to say. The National Park Service even has its own style guide. The same goes for code. If the “grammar” of your code is tidy and organized, it’ll be easier for you and others to focus on what the code actually does.

There are multiple R style guides out there, but the Tidyverse style guide is a good place to start. Using the styler package, you can even style your code automatically!

If you find yourself getting bogged down in the finer details of coding style, take a step back and just focus on being generally tidy and consistent. Especially with tools like styler, conforming to a style guide shouldn’t have to feel like a burden.

Be efficient*
# Retrieve the latest water quality data from the full dataset
pH_2020 <- dplyr::filter(pH_Data_All, WaterYear == 2020)
DissolvedOxygen_2020 <- dplyr::filter(DO_Data_All, WaterYear == 2020)
SpecificConductance_2020 <- dplyr::filter(SpCond_Data_All, WaterYear == 2020)

The code above should work fine, assuming that the datasets pH_Data_All, DO_Data_All, and SpCond_Data_All are defined somewhere. But let’s imagine what this code would look like if we translated it into an email. It would go something like this:

Dear Dr. Hydrologist,

I would like a copy of your pH data from 2020. Please send me the pH dataset from 2020. I need your dissolved oxygen data from 2020. Please send me the dissolved oxygen dataset from 2020. I need your specific conductance data from 2020. Please send me the specific conductance dataset from 2020.

An aspiring R programmer

Well, we certainly got the information across. But our collaborator might look at us funny the next time we see them. We only wanted data from one year, but we wrote it six times! Let’s make some minor edits.

Dear Dr. Hydrologist,

I need a copy of your most current pH, dissolved oxygen, and specific conductance data. Please send me the 2020 datasets for pH, dissolved oxygen, and specific conductance.

An aspiring R programmer

That’s a big improvement. We could keep editing but let’s call that good enough for now. Translating back into R code, we get something like this:

current_water_year <- 2020

pH_Current <- dplyr::filter(pH_Data_All, WaterYear == current_water_year)
DissolvedOxygen_Current <- dplyr::filter(DO_Data_All, WaterYear == current_water_year)
SpecificConductance_Current <- dplyr::filter(SpCond_Data_All, WaterYear == current_water_year)

Notice how the water year is only specified in one place at the top of the script, instead of being hardcoded in multiple places throughout the code. Not only does this reduce the chance of typos, it makes it much easier to update the water year next year. This doesn’t seem like a big deal in an example that’s only three lines of code, but imagine having to search through 100 or 1000 lines of code for every occurrence of the current water year or field season. Hardcoding values (e.g. field season, park code) throughout your code, especially when they appear more than once, will cause headaches down the line. Instead, assign values to variables at the top of your code so that you can keep track of them in a single place.

This is an example of the DRY (Don’t Repeat Yourself) Principle. If you find yourself hardcoding duplicate information (like the water year above) or repeatedly copy-pasting the same piece of code, it’s probably worth editing your code to make it less redundant.

*The caveat: Don’t optimize your code too early. As with most types of writing, it’s usually best to produce a working draft and then edit it until you’re satisfied with the result. Trying to get it perfect on the first try usually just leads to analysis paralysis! But the more you practice writing DRY code, the more naturally it will come.

There are a number of ways you can structure your code to make it less redundant and save yourself some work. We’ll talk more about this in Control Structures.

Vectorization

On the first day of this class, we introduced vectors. Here are some examples:

# Numeric vectors
numbers <- 1:10
other_numbers <- 6:15

# Character vectors
code_lang <- c("R", "C++", "Java", "Python", "C")
human_lang <- c("English", "Spanish", "Chinese", "Russian")

What if we want to multiply numbers by two? The brute-force method looks something like this:

new_numbers <- c()  # Create a new NULL vector

new_numbers[1] <- 2 * numbers[1]
new_numbers[2] <- 2 * numbers[2]
new_numbers[3] <- 2 * numbers[3]
new_numbers[4] <- 2 * numbers[4]
new_numbers[5] <- 2 * numbers[5]
new_numbers[6] <- 2 * numbers[6]
new_numbers[7] <- 2 * numbers[7]
new_numbers[8] <- 2 * numbers[8]
new_numbers[9] <- 2 * numbers[9]
new_numbers[10] <- 2 * numbers[10]

new_numbers
##  [1]  2  4  6  8 10 12 14 16 18 20

Obviously we’re making this harder than it needs to be. Many functions and operations in R are vectorized, meaning that they will be applied in parallel to each element of a vector. In fact, you’ve already seen examples of vectorization earlier in this class. Because data frame columns are just vectors, much of the data wrangling you’ve done so far takes advantage of vectorization. Here, we’ll talk about vectorization in a little more detail, but a lot of this may look familiar!

As you probably know, all we need to do to multiply each element in numbers by two is:

new_numbers <- numbers * 2
new_numbers
##  [1]  2  4  6  8 10 12 14 16 18 20

Much less tedious. In general, if you feel like you’re doing something by brute force in R, that’s a good indicator that there’s a better way to do it.

As you might expect, division, addition, and subtraction are also vectorized. In fact, this even works with two or more vectors.

numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
other_numbers
##  [1]  6  7  8  9 10 11 12 13 14 15
other_numbers - numbers
##  [1] 5 5 5 5 5 5 5 5 5 5
other_numbers * numbers
##  [1]   6  14  24  36  50  66  84 104 126 150

So far we’ve only looked at operations between vectors and single values or two vectors of the same length. But what happens if we try to add two vectors of different lengths? Let’s try it out.

ones <- rep(1, 6)  # Vector of 1's, length 6
ones
## [1] 1 1 1 1 1 1
add_this <- c(1, 2)
this_one_breaks <- c(1, 2, 3, 4)

ones + add_this  # This works fine
## [1] 2 3 2 3 2 3
ones + this_one_breaks  # This generates a warning
## Warning in ones + this_one_breaks: longer object length is not a multiple of
## shorter object length
## [1] 2 3 4 5 2 3

What happened here? When we added the vector [1, 2] to the vector [1, 1, 1, 1, 1, 1], we got [2, 3, 2, 3, 2, 3]. When we added [1, 2, 3, 4] to [1, 1, 1, 1, 1, 1], we got a result, but we also got a warning along with it. The text of the warning actually gives a clue as to what R is doing. It says “longer object length is not a multiple of shorter object length” because when R tries to operate on vectors of different lengths, it recycles the shorter vector until it matches the length of the longer vector. ones is a vector of length 6. Since add_this is length 2 and 2 is a multiple of 6, R adds [1, 1, 1, 1, 1, 1] + [1, 2, 1, 2, 1, 2], repeating add_this exactly three times. However, this_one_breaks is a vector of length 4, which is not a multiple of 6. Therefore, we get a warning because R cannot fit this_one_breaks into a vector of length 6 without truncating it. Operating on vectors of different lengths is often useful. However, it’s easy to make mistakes if you forget that vectors in R work this way! Keep this in mind when you are debugging.

As you may remember from working with data frames, lots of operations in R are vectorized, not just basic arithmetic. This includes comparison operators (<, >, ==, <=, >=, !=), logical operators (&, |, !), and many functions (is.na(), gsub(), grep(), and too many others to list).

Here’s another example. It’s fairly common to end up with data that was entered as text and has extra whitespace:

messy_data <- tibble::tibble(plot_id = c("1 ", " 2 ", " ", " 4", "5."),
                             year_monitored = c(2019, 2019, 2019, 2020, 2020))
messy_data
## # A tibble: 5 x 2
##   plot_id year_monitored
##   <chr>            <dbl>
## 1 "1 "              2019
## 2 " 2 "             2019
## 3 " "               2019
## 4 " 4"              2020
## 5 "5."              2020

The dplyr way to fix this might look something like this:

clean_data <- messy_data %>% dplyr::mutate(plot_id = trimws(plot_id),  # Get rid of leading and trailing spaces
                                           plot_id = gsub(pattern = ".", replacement = "", x = plot_id, fixed = TRUE)) %>%  # Replace the stray "." with nothing
  dplyr::filter(plot_id != "") %>%
  dplyr::mutate(plot_id = as.integer(plot_id))
clean_data
## # A tibble: 4 x 2
##   plot_id year_monitored
##     <int>          <dbl>
## 1       1           2019
## 2       2           2019
## 3       4           2020
## 4       5           2020

trimws, gsub, and != are all vectorized operations being applied to the column plot_id. Let’s see what’s happening to just the plot_id column step by step under the hood.

plot_id_vec <- messy_data$plot_id
plot_id_vec
## [1] "1 "  " 2 " " "   " 4"  "5."
plot_id_vec <- trimws(plot_id_vec)
plot_id_vec
## [1] "1"  "2"  ""   "4"  "5."
plot_id_vec <- gsub(pattern = ".", replacement = "", x = plot_id_vec, fixed = TRUE)
plot_id_vec
## [1] "1" "2" ""  "4" "5"
id_is_valid <- plot_id_vec != ""
id_is_valid
## [1]  TRUE  TRUE FALSE  TRUE  TRUE
plot_id_vec <- plot_id_vec[id_is_valid]
plot_id_vec
## [1] "1" "2" "4" "5"
plot_id_vec <- as.integer(plot_id_vec)

Vectorization in R is a powerful tool. However, not everything is vectorized. One common example is read.csv(). read.csv() takes the path to a .csv file and returns a data frame of the data in the csv. But what if the data is stored across multiple .csv files? read.csv(c(file_path_1, file_path_2, file_path_3)) doesn’t work because read.csv isn’t vectorized. Fortunately, vectorization isn’t the only way to avoid repeating yourself in R.

Control Structures

What are control structures? The simplest way to think about this is that these are programming elements that control the flow of data in your code depending on the value of logical conditions. There are six basic control structures and you can find them with help("Control").

There are two control structures we are going to focus on: for loops and if/else statements. for loops are a control structure that allow you to repeat or iterate a section of code multiple times (without copy-pasting it over and over!). if/else statements control whether or not a section of code is run based on certain conditions. If we have time we are going to touch on while at the end, mostly for fun.

if/else

if/else Statement Overview

Most of you are probably familiar with this. It has 3 pieces (in order):

  1. A conditional statement (“if”) to evaluate as a (TRUE / FALSE), e.g. (x>1) 2. An action to perform if it is TRUE
  2. An action to perform if it is FALSE

On any individual data point or set of data, only two of these code pieces will be applied. The conditional statement (1) will always be evaluated. Then only the appropriate T or F action (2 or 3) will be applied - the other will be skipped. There are a two different ways to invoke this in R and they do different things:

This first method if(){}else{} is a true control structure. This is useful for allowing your code to do different things for different data situations. However, it only evaluates a single TRUE/FALSE value.

if(...logical statement...){...TRUE...}else{...FALSE...} (hereafter if/else)

You will also see ifelse show up often.

ifelse() is the ‘vectorized’ version - this is most useful when you need to change the value in a vector i.e. (recoding). However you can make it work on single logical values. This is not recommended.

ifelse(test= , yes= , no= ) (hereafter ifelse)

if/else
if/else Examples

First thing to know about the if/else construction in R that it is very sensitive to the bracket placement, else must always be between a set of open brackets like this }else{. If it isn’t, you will end up with errors that are hard to track down.

R will not comprehend this:

if(...){
  "thing to do if TRUE"}
else{
  "thing to do if FALSE"
}

R will only comprehend this (for now):

if (...) {
  "thing to do if TRUE"
} else {
  "thing to do if FALSE"
}

This statement expects a single boolean as the result of if() statement. Sometimes I like to use this to create switches or bypasses when I am writing loops/iteration. For example: If I have a dataframe that is non-empty, then do the analysis. If the data frame is empty do something else.

d <- c(1:4, 1:4 * 100, NA)

if (all(is.finite(d))) {
  "no NA, NaN, or Inf present"
} else {
  "NA present"
}
## [1] "NA present"
if/else Nesting

This also nests nicely and we will get back to that later, but here is what that formally looks like:

d <- c(1:4, 1:4 * 100)

if (all(is.na(d))) {
  "NA present"
} else if (all(d < 100)) {
  "all items less than 100"
} else if (any(d > 100)) {
  paste(sum(d > 100), "items more than 100")
} else {
  "I'm stumped"
}
## [1] "3 items more than 100"

This also works:

d <- c(1:4, 1:4 * 100)

if (all(is.na(d))) {
  "NA present"
}
if (all(d < 100)) {
  "all items less than 100"
}
if (any(d > 100)) {
  paste(sum(d > 100), "items more than 100")
} else {
  "I'm stumped"
}
## [1] "3 items more than 100"

Sarah - “You do not need to say else each time, you can leave that off and it is fine. In fact you see that a lot in the underlying code, just execute dplyr::bind_row in the console and see for yourself.

Thomas - {hyperventilating} “Iknow, but I can’t handle that - it makes it hard to read for me.

ifelse()
ifelse() Examples

Let’s see how ifelse works to recode some data. Let me build this up starting from the data:

d <- c(1:4, 100*1:4)
d
## [1]   1   2   3   4 100 200 300 400

A simple vector of numbers.

Now lets evaluate that dataframe as a logical expression:

q_logic<- d < (mean(d) / 10)
q_logic
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

This will evaluate the expression and will return TRUE for all values of d that are less than the mean of d divided by 10. Do you remember PEMDAS? Well in R it is PEMDASL. Logicals are evaluated after the math expression.

Under the hood, this vector of logicals is what ifelse is actually evaluating. In fact, we can just drop that into the ifelse directly if we want.

ifelse(q_logic, yes="verify", no="OK") #yes is the true condition no is the false condition
## [1] "verify" "verify" "verify" "verify" "OK"     "OK"     "OK"     "OK"

It goes through and replaces every TRUE value in that vector of logicals with “verify” and all of the FALSE values with “OK”.

In reality, we seldom develop this kind of statement step by step like that. We would normally just write it on a single line.

q<-ifelse(d < (mean(d) / 10), "verify", "OK")

ifelse() Missing Data

Now, let’s see what happens if we have some missing data.

d <- c(1:4, 100*1:4, NA)
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
## [1] "verify" "verify" "verify" "verify" "OK"     "OK"     "OK"     "OK"    
## [9] NA
That NA, gets passed through. Maybe we want that, maybe we don’t. In mean() and many other functions we were able to set na.rm=TRUE and omit the NA value, but reading through help(ifelse), we don’t really have that option. One option is that we can try to have two nested ifelse conditions.

ifelse() Nesting

‘Nesting’ refers to the situation where you use a function, and then you use it again within itself. This frequently shows up in iteration tasks.

d <- c(1:4, 100*1:4, NA)
ifelse(is.na(d) == T, "Missing",
  ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
)
## [1] "verify"  "verify"  "verify"  "verify"  "OK"      "OK"      "OK"     
## [8] "OK"      "Missing"
is.na() returns a logical vector with F for all finite elements and T for all NA and NaN (but not Inf). In this nested structure, the F cases are futher evaluated and labeled with the second statement. Nesting is useful, but chances are if you find yourself nesting more than 2 or 3 conditions deep, there is probably a better way to do things.

case_when

That better way is dplyr::case_when(). If you look at the above code we were using to demonstrate if/else constructions we flagrantly violated ‘DRY’ principles. You might also remark that it could get a little confusing if we added anymore elses or cases to those statements. You would be correct on all counts. When your only application of an if/else construct is to recode/reclassify data and you have more than 1 set of criteria for recoding, you should really be using case_when(). It behaves like an if/else statement, but under the hood it is a pair of clever for loops, which we will talk about shortly.

d <- c(1:4, 100*1:4, NA)
dplyr::case_when(
  is.na(d) ~ "Missing",
  d < mean(d, na.rm = T) / 10 ~ "Verify",
  d > mean(d, na.rm = T) / 10 ~ "OK"
)
## [1] "Verify"  "Verify"  "Verify"  "Verify"  "OK"      "OK"      "OK"     
## [8] "OK"      "Missing"
There are a few things to note here. First this is a clear statement of what is happening and there are few {} or () to track. Also, there is no ‘else’ option. Because there is no else option we have to specify all of the cases of interest. If we do not fully specify the cases the resultant vector will add an NA for those cases that are not specified. It will evaluate each logical statement in order. Finally, we don’t show it here, but it passes NA values into the output if they are present. These are not bad things, sometimes these are very useful behaviors. You just need to know what they are doing so you can decide when to take advantage of them. We will come back to this function in a for loop below if we have time.
Wrap up Wrapping up, I most commonly use the if/else constructions to create a switch in the code for special handling of undesirable data. However, I usually only go to this if there isn’t a more direct solution. Also, getting the {} placement right can be infuriating. In my mind, ifelse gives a false sense of security. What all is in that ‘else’?

Iteration

Iteration Overview

This brings us to iteration. Iteration is a control structure that tells code to repeat itself until some condition is satisfied. It is really good at doing the same thing over and over again on datasets that are pretty similar to one another. In other words, you can get a lot of work done fast. For today, we will focus on the for loop as our main approach to iteration.

Iteration consists of a few pieces:

  1. The data you need to do something to
  2. The method of iteration (for)
  3. The task you need to do
  4. The result of your task
What are some kinds of repetitive tasks you find yourself doing that you are wanting to do more easily in R?

for() Loops The first iterator you may be familiar with is the for loop. Before we dive in, I want to pause and note that you will hear us tell you that there are better ways to iterate than a for loop. However, for loops are a fundamental skill for several reasons.
  • They reinforce a fundamental understanding of iterative processing.
  • Everyone (probably) falls back on a for loop (eventually).
  • They are the precursor to writing your own functions.

How does it work? You give it a vector of numbers or characters, tell it what you want it to do, and it will do that thing for each item in the vector. In simplest terms that looks like:

for(i in vec){ 
  do thing 1
  do thing 2
  ...
  append result to results of previous iterations
}

A few things to pick apart here:

i - i is an arbitrary variable, you could call it q. It will take on the value for the first item in the vector on the first step of iteration. q is then available in the loop as a variable and it takes on each successive value in vec with each step of the iteration.

vec - vector is that vector of items aka the conditions for iteration. These could be numeric or they could be characters or strings. Numeric vectors are common is you need to run through all of the cases. Character vectors are great if you have groups of data you want to work through. All ’vector` does it gives it some instructions on what it should work on and effectively when to stop.

Let’s make sure we understand those two things

vec<-1:9 #vector over which we will iterate
for (i in vec) { # for the number 1 through 9
         i
}   

Nothing appears to have happened in the console. Why? Because I did not actually tell it to return something from that loop.

Also, that isn’t entirely true… something happened:

i 
## [1] 9

i was created in the environment and it has a value of 9 – the last value of vec. Logically, that means the for loop must have made it to the end of vec.

An important thing to learn here is that for will not return information after each step on its own accord. You have to tell it to do that.

Getting information out of a for loop

Getting information out of a for loop is harder than it sounds and doing it right can make all of the difference in your code execution. First let’s just get that information into the console so we can see it.

for (i in vec) { # for the number 1 through 9
         print(i+1)
}   
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

We can see it clearly walking though that loop and adding 1 to each value. But, what we really want is a vector of values out of it.

#alternatively predefine a container to receive the data
OutDat <- NULL # define an empty variable. 
for (i in 1:9) {
  OutDat[i] <- i + 1 # store that in the ith position of the variable OutDat
}
OutDat
## [1]  2  3  4  5  6  7  8  9 10
This has now put the sum of i+1 into the ith position of the vector OutDat
for loops for real IMD data tasks
Preparing the data and the loop So let’s make a useful for loop. We are going to use raw hobo air temperature files. Just a little bit of background here, there are ~5 columns in most of these files.
  • Col. 1 is a sequential index column
  • Col. 2 is a date and time string
  • Col. 3 is the air temperature in degrees F
  • Col. 4 may be light measured as lux or lumens per ft2 (if it was measured)
  • Col. >5 are comment columns
We will want to do 4 things with those data. These should sound familiar:
  • Look at the data
  • We want to calculate some summary statistics.
  • We want to do a rudimentary QAQC check and flag potentially bad data
  • We want to save it to a file so we can do some further visual assessment.
library(magrittr)
library(ggplot2)

#create a list of file names that we will read in
fPaths <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)
fPaths
## [1] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS01_20548905_2021_temp.csv"
## [2] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS02_20549198_2021_temp.csv"
## [3] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS03_20557246_2021_temp.csv"
## [4] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS04_20597702_2021_temp.csv"
## [5] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS05_20597703_2021_temp.csv"

Note, paste0() is a special version of paste() that assumes no spaces between arguments.

Let’s get some idea of what we are dealing with for the data so we can plan the code. It is almost impossible to read in data from files blindly so let’s preview it.

#read in and inspect the first element (for training purposes)
temp<-read.csv(fPaths[1], skip = 1, header = T) #hobo data has an extra line at the top we need to skip
head(temp)
##   X. Date.Time..GMT.05.00
## 1  1    03/14/21 17:00:00
## 2  2    03/14/21 18:00:00
## 3  3    03/14/21 19:00:00
## 4  4    03/14/21 20:00:00
## 5  5    03/14/21 21:00:00
## 6  6    03/14/21 22:00:00
##   Temp..Â.F..LGR.S.N..20548905..SEN.S.N..20548905..LBL..APIS01.
## 1                                                        78.274
## 2                                                        36.531
## 3                                                        28.762
## 4                                                        26.668
## 5                                                        26.031
## 6                                                        25.390
##   Bad.Battery..LGR.S.N..20548905. Coupler.Attached..LGR.S.N..20548905.
## 1                                                                     
## 2                                                                     
## 3                                                                     
## 4                                                                     
## 5                                                                     
## 6                                                                     
##   Host.Connected..LGR.S.N..20548905. Stopped..LGR.S.N..20548905.
## 1                                                               
## 2                                                               
## 3                                                               
## 4                                                               
## 5                                                               
## 6                                                               
##   End.Of.File..LGR.S.N..20548905.
## 1                                
## 2                                
## 3                                
## 4                                
## 5                                
## 6

Gross, but we can fix that. Because I know what is in there and what the order is, let’s redo this with headers set to F. (Note, I am changing skip=2 otherwise this will treat the headers as data.)

temp<-read.csv(fPaths[1], skip = 2, header = F) #Set header to F so read.csv doesn't treat data as a header.
names(temp)[1:3]<-c("idx", "DateTime", "T_F") #let's just assign some column names really quick
head(temp)
##   idx          DateTime    T_F V4 V5 V6 V7 V8
## 1   1 03/14/21 17:00:00 78.274               
## 2   2 03/14/21 18:00:00 36.531               
## 3   3 03/14/21 19:00:00 28.762               
## 4   4 03/14/21 20:00:00 26.668               
## 5   5 03/14/21 21:00:00 26.031               
## 6   6 03/14/21 22:00:00 25.390
str(temp) #look at data types
## 'data.frame':    4554 obs. of  8 variables:
##  $ idx     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DateTime: chr  "03/14/21 17:00:00" "03/14/21 18:00:00" "03/14/21 19:00:00" "03/14/21 20:00:00" ...
##  $ T_F     : num  78.3 36.5 28.8 26.7 26 ...
##  $ V4      : chr  "" "" "" "" ...
##  $ V5      : chr  "" "" "" "" ...
##  $ V6      : chr  "" "" "" "" ...
##  $ V7      : chr  "" "" "" "" ...
##  $ V8      : chr  "" "" "" "" ...
rm(temp) #do a little cleanup 

We are only interested in the first 3 columns today.

So, things we are aware of for coding so far:
    -Three columns of data is all we want right now
    -Two columns are numeric/integer
    -Lastly, because I know these data, one of the files is empty except for the headers.

A Loop to Calculate Summary Statistics What do we want to do?
    We are going to create a simple summary statistics dataframe.
    The loop will need to handle the file with no data somehow.
    We will need to initialize a empty variable so you can get data out of the for loop. We will use a list for reasons discussed later on.
len<-length(fPaths) #calculate the number of iterations

SummaryData <- list() #Generate an empty list that we will put output into. Always rerun prior to loop so that it is empty.

for (i in 1:len) {
  # 1 Read in and generate the data to use
  # 1a extract the filename from a filepath and parse the metadata
  fName <- basename(fPaths[i])
  Site <- stringr::str_sub(fName, 1, 6)
  Year <- as.numeric(stringr::str_sub(fName, -13, -10))
  # 1b read in the data, change headers,
  d <- read.csv(fPaths[i], skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))
  # 2 summarize and 3 output the data
  SummaryData[[i]] <- data.frame( #output the data
    # 2 what to summarize for all files 
    Site = Site,
    Year = Year,
    FileName = fName,
    # Variable output
    # 2a What to summarize so long as there is 1 non-NA record
    if (any(is.numeric(d[, 3]))) { # check that there is at least 1 non-NA record
      list(
        T_F_mean = mean(d$T_F, na.rm = T),
        T_F_min = min(d$T_F, na.rm = T),
        T_F_max = max(d$T_F, na.rm = T),
        nObs = nrow(d),
        Flag = NA
      ) 
      # 2b what to summarize if the data are missing.
    } else {
      list(
        T_F_mean = NA,
        T_F_min = NA,
        T_F_max = NA,
        nObs = nrow(d),
        nDays = NA,
        Flag = "Hobo File Present But No Data"
      )
    }
  )
}
dplyr::bind_rows(SummaryData)
##     Site Year                      FileName T_F_mean T_F_min T_F_max nObs
## 1 APIS01 2021 APIS01_20548905_2021_temp.csv 57.55168  14.608 100.283 4554
## 2 APIS02 2021 APIS02_20549198_2021_temp.csv 47.90618 -25.123  92.737 6287
## 3 APIS03 2021 APIS03_20557246_2021_temp.csv       NA      NA      NA    0
## 4 APIS04 2021 APIS04_20597702_2021_temp.csv 49.05190 -15.628  99.885 6286
## 5 APIS05 2021 APIS05_20597703_2021_temp.csv 47.68413 -22.549  91.612 6287
##                            Flag nDays
## 1                          <NA>    NA
## 2                          <NA>    NA
## 3 Hobo File Present But No Data    NA
## 4                          <NA>    NA
## 5                          <NA>    NA

Lets take some time to pick this apart and understand what it is doing.

  1. The read and generate data step. Is a pretty standard part of a loop where you define the data you want to use. This might be some variables that you are going to reuse late or actual data you are reading in. Here, we are just parsing some metadata from the filename.

  2. The getting stuff done step. This step is setting up where the data goes – “list element i” – and what data to put there. The nifty thing is that we are calling an if/else statement within the construction of the dataframe. There are dataframe elements that will be fixed and dataframe elements that will be variable.

  3. The output step. Here, the output step is integrated into the creation of the output dataframe.

A QA/QC Data Flagging Loop

The next thing we want to do is some data QA/QC. Here we want to apply some rules from our SOP for flagging data that requires follow-up verification. We don’t know if the data are good or bad, but we do know we don’t want to waste time trying to flag it in an Excel spreadsheet.

OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this later)
# set up some QA thresholds
TempUpper <- 40
TempLower <- (-40)

for (i in fPaths) {
  # 1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))

  # 2 Generate your output in this case, some QA/QC flagging
  d_com <- d %>%
    dplyr::mutate(
      # checks for too big of a temp change
      TChangeFlag = ifelse(c(abs(diff(T_F, lag = 1)) > 10, FALSE), "D10", NA),
      # flag if a temp is too high or too low
      Flag = dplyr::case_when(
        is.na(T_F) ~ "MISSING",
        T_F > TempUpper ~ "High",
        T_F < TempLower ~ "Low"
      )
    ) %>%
    # merge so you can have multiple flags applied
    tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "_", remove = TRUE, na.rm = TRUE)

  # 3 output the data
  dir.create(OutPath, showWarnings = F)
  write.csv(d_com[, c("DateTime", "T_F", "Flag")],
    paste0(OutPath, gsub(".csv", "_QC.csv", fName)),
    row.names = F, quote = F
  )
}

We used some new functions there, but let’s start at a higher level and work our way down to the lower level functions.

What changed with the for loop? A for loop can accept any vector as input. It could be a vector of names or a could be a vector of numbers. In the first loop we did, I wanted that vector of numbers so I had an automatic index to use. In this case, I just need the name, I don’t need that index.

  1. This is mostly the same as before.
  2. There is new stuff here but this boils down to: “I want to create 2 variables and then merge them. mutate() is creating the variables Flag and TChangeFlag. Flag is using a dplyr version of ifelse called case_when() that allows you to specify what to return if certain conditions are true. TChangeFlag is the first derivative of the temperature data - rate of change of the data. If the rate of chagne is too high, then we flag it. This could have been done with case_when(), but we are talking about control structures today.
  3. I want to point out the ‘on the fly’ file name generation. This takes some practice in paste(). This is just string manipulation. All this is doing is slightly modifying the old string, and attaching a new string to create the desired filepath.
A Plotting Loop

Let’s do our last common for loop. That is the ‘barf out a bunch of plots’ loop. I am visual and can quickly diagnose data by looking at it. I like to do this so I have some PDFs on hand.

OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this

for (i in fPaths) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  #2 Do the thing you want to do, in this case make and save plots
  p <- ggplot(d, aes(x = DateTime2, y = T_F)) +
    geom_point(size = 0.5) +
    ggtitle(fName)
  #3 Output the results outside of the for loop
  dir.create(OutPath, showWarnings = F)
  ggsave(
    filename = gsub(".csv", "_plot.pdf", fName), 
    path = OutPath, device = "pdf",
    plot = p, width = 5, height = 5, units = "in"
  )
}

Let’s break this down because practice makes perfect - we will get into the nitty gritty details in class, but this is the basic outline of what is happeneing.

  1. We are iterating based on the vector fPaths. There will be as many iterations as there are items in fPaths. In our case fPaths is a vector of file paths. This means that on each iteration, i will become a different file path.
  2. Read in the data and do a little data wrangling.
  3. Create a simple plot so we can look at the data.
  4. Use the ggsave() function to save a pdf. What I want to point out here is that we don’t need to build the full file path outside of ggsave. We specify the file name as discussed before and feed it a path separately. ggsave glues them together by calling some base R functions.

Recap and final thoughts on for loops and control structures What did we do?
  • Learned the basic structure of most for loops. Probably the most important thing.
  • Used some for loops to do some actual tasks on multiple datasets.
  • Read files into and wrote files out of a for loop.
  • Wrote data calculated in a for loop to a variable in the environment.
  • Used ifelse() logic to flag some data for checking.
  • Use if...else to create a switch for NA handling in a dataframe in a for loop.
  • Maybe learned a few new functions.
  • …?

At this stage you may be thinking. Ok, but you have read the data in multiple times and repeated a lot of code. Couldn’t we just wrap this in one giant for loop?

  • We could have written the loop to read in the data, store it in memory, and then loop only on the stored data for flagging, statistics, and plotting. (We are actually going to do that next.) But, loading stuff in R takes up memory. Depending on the size of your datasets, you may not want unused information floating around hogging memory. Continuous data files can become large objects. So, it depends.

  • There are only rare instances where you want to wrap stuff up in a complicated for loop. The more complex the loop the harder it is to debug. Complex for loops are also hard to build flexibility into (e.g. handling NA values). In general you should shoot for one task or task family per loop.

Preallocation

Preallocating Output Objects

Before we move on let’s take time to discuss a more advanced R topic: preallocating the for loop output. This primarily refers to dataframe-like loop output variables and it makes a huge difference in how quickly your code executes. There are many ways to initialize this, I have shown two ways – there are more.

In the first method we defined OutDat<-NULL and in the body of the loop proceeded to assign each iteration’s output to a specific location in OutDat using indexing OutDat[i]<-i+1. We did something similar in the above example where we created a list SummaryData <- list() and then added an new list element for each iteration’s output SummaryData[[i]] <- data.frame(...). These are both acceptable ways to capture output.

But, most of us think in dataframe most of the time so we expect a dataframe out. This means we need an extra step to ensure we have dataframe output. This leads some people to say ‘hey, why don’t we just put it into a dataframe to begin with?’ The most natural thing in the world is to do:

#alternatively predefine a container to receive the data
nIter=10000
OutDat <- NULL # define an empty variable. 
system.time(for (i in 1:nIter) {
  d <- data.frame(x=i + 1,y=i)
  OutDat<-rbind(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.47 seconds on my system

#similarly you could try append() - it works
nIter=10000
OutDat <- NULL # define an empty variable. 
system.time(for (i in 1:nIter) {
  d <- data.frame(x=i + 1,y=i)
  OutDat<-append(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.93 seconds on my system

Instant dataframe and everyone is happy. Or are they? They aren’t. This is called growing an object. Patrick Burns also refers to this as the “Second Circle of R Hell” in his book The R Inferno (Dante’s Inferno… the R community was a wonderful sense of humor). Look it up, it is great (no really, it is). To do this 10,000 times requires 4.96 seconds, 1,000,000 time took… well I quit after 13 minutes. If this were a strictly linear process it should have taken around 4 minutes to do 1,000,000 iterations - clearly it did not.

Why? When you use rbind() or append() in this manner it copies the dataframe each time and tells R to go look for/request another giant block of memory instead of the one that is already there. This slows your processing way down. So…. don’t do that. Instead you should preallocate your dataframe, especially if it is going to be big, it even makes a difference on smaller dataframes.

What does preallocate mean? x<-data.frame() just tells R you need an object that is a dataframe. What you really want to do is tell R the exact dimensions of the object you will need. If you want dataframe output, just generate a blank dataframe the same dimensions as your output. This is sometimes simple. The number of rows is probably the number of iterations vec and you already are defining columns so you could just start out with x<-data.frame(Mean=rep(NA,vec), Min=rep(NA,vec)) and use the iteration indexing to tell R which row to drop stuff in.

Our example above, becomes:

nIter=10000 # number of iterations
OutDat <- data.frame(x=rep(NA,nIter),y=rep(NA,nIter)) # define an preallocated dataframe
system.time(for (i in 1:nIter) {
  OutDat[i,] <- data.frame(x=i + 1, y=i)
}) #2.42 seconds on my system

In this simple example preallocating the dataframe speeds up the loop execution by 50%. In this example a 2.5 second gain seems small. However, across larger or multiple datasets this can become a significant time cost if you don’t preallocate.

Preallocation Wrapup

    When do we preallocate? Pretty much just when we use for loops to create data objects. It is important to note: apply family and purrr will take care of that preallocation for you.

    Should you always preallocate? If you are using a for loop to output a data object, probably.

    Do you really need to? Not until your code execution takes longer than you are willing to wait.

    Why not install more memory? That will appear to solve the problem. But datasets will just get bigger over time.

    So I shouldn’t use a for loop? No, you can when you need to, but there is an art to it. This course is all about equipping you with tools to help avoid some frustration.

apply Family

apply Family AKA better ways to iterate in R

Now that you have the basics down, we are going to forge ahead with better and better ways to iterate. One family of functions that often replaces for() is the apply family of functions. We will get you started here and then dive deeper during the avanced R course on “Functional Programming” next week. For now:

apply - operates on things resembling a dataframe. Okay, really it operates on a matrix where all values are the same type (all number or all letters). It can operate by row or by column and the output will be a

mapply - Is a way to vectorize a function that isn’t vectorized. I don’t think I have used this in the past year, but it is in the apply family and you might encounter it.

lapply - Apparently, this is a special case of mapply and it operates on lists of things and stores the output in a list of things. sapply and vapply are variations on this theme.

sapply - A ‘wrapper’ for lapply that will coerce the list output into a matrix or array.

tapply - apply by groups. This lets you apply functions by a group. So, quick summary statistics, but you will quickly outgrow this and there are better ways to do this.

mtcars

To explore using simple functions in apply, let’s look at mtcars. This is a wonderful little built in datast that I like to use for developing and demoing code (see help(mtcars) for description). It has continuous and categorical variables that cover a good bit of what I end up needing to do. Another reason to know about mtcars is that a lot of answers on stackoverflow will call this dataset.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

apply()

First let’s calculate the mean across the rows. It doesn’t make a whole lot of sense as to why we would do this because the columns contain values with different units, but we can if we set margin=1.

apply(mtcars,MARGIN=1, FUN=mean) #calculate mean across the row
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            29.90727            29.98136            23.59818            38.73955 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            53.66455            35.04909            59.72000            24.63455 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            27.23364            31.86000            31.78727            46.43091 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            46.50000            46.35000            66.23273            66.05855 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            65.97227            19.44091            17.74227            18.81409 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            24.88864            47.24091            46.00773            58.75273 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            57.37955            18.92864            24.77909            24.88027 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            60.97182            34.50818            63.15545            26.26273

Note: apply() will operate on all the data it is provided. If your data had a non-numeric element in the first column, for the function mean() to work you will have to exclude that column from the data provided to apply.

Apply is preserving/using the row name. That is nice in this case. The make/model of the vehicle was stored as row names so is not technically data being passed to mean for analysis. We get out a vector with the (meaningless) average value for each make and model of vehicle.

It might make more sense to calculate the mean value for each of the attributes of a car. We can do that by setting margin=2.

apply(mtcars, MARGIN=2, FUN=mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500
That returns the mean of each variable. What is happending is the apply is disassembling the data frame and passing it line by line (margin=1) or row by row (margin=2) into mean() and collecting the output into a named vector.

lapply()
lapply() overview

Let’s change gears a little bit and return to those hobo data to take a look at lapply(). First, a quick review of what is a list. A list is an R object that can hold a bunch of different elements. A list is ok if one element is a string and the next element is a character or if one element has 100 columns and 10 rows while the next element has 10 columns and 100 rows. It even can hold a dataframe as one element and a matrix as another element without any trouble. A list says you’re ok just the way you are.

We are going to use our list to read in data that are mostly the same. If we wanted to store them in a single dataframe: 1) it would get unwieldy and 2) you would have to write code that ‘fixed’ inconsistencies between the dataframes. By storing them as independent list elements we don’t necessarily need to write code that ‘fixes’ the inconsistencies.

We discussed this earlier, in the previous for loop we operated on each of our hobo csvs one at a time. What we are going to do this time is open them all into a single list object with 5 different elmenets (1 for each csv file).

lapply() to read in data

You should still have fPaths hanging out somewhere. This is that list of hobos.

fPaths

What we will do next is iterate through that list of names and store it in a list. We do that with lapply. What does lapply do? It takes in a list or vector of data and returns list of data of the same length. lapply(X, FUN,...) has 3 arguments. + X the data or the vector to FUN + FUN what to do + ... elipsis - this is a bit of an odd one. This is where you put whatever other arguments you want passed to FUN. Looking at the examples below will help.

Get the data

HoboList<-lapply(fPaths,read.csv, skip=1, header=T)
HoboList

This is passing, sequentially, file paths into read.csv(). Just like our for loop more or less. It is also passing skip and header values to read.csv(). It returned a list of all the csvs with everything in them.

Simplify the Data

HoboList<-lapply(HoboList,"[",,1:3)
HoboList  

This then takes that list as input and passes each of the dataframes contained within it to the indexing function. That is right, indexing [] is a function and we can pass it into FUN by calling "[". What fun! Not the empty commas there "[",,1:3 are intentional and will be interpreted the same as saying x[,1:3] or… go get columns 1 through 3.

Clean those headers

HoboList<-lapply(HoboList,setNames, c("idx","DateTime","T_F"))
HoboList

Then we simply just rename the headers because lapply passes each dataframe to the list.

All together now with the piping!

HoboList<-lapply(fPaths,read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(.,"[",,1:3)%>% #2. Grab only first 3 columns
          lapply(.,setNames, c("idx","DateTime","T_F")) #3. fix column names
Now all of those hobo files are available in the HoboList object that we could summarize, do some flagging, or plot from. How do we do that? We could do for loops again, or we could keep going with lapply.

lapply() to summarize data

Now, let’s summarize that object and .

HoboSummary<-lapply(HoboList[-3], FUN=function(x){ #-3 drop that blank file for now
  d<-data.frame(
    Mean = mean(x$T_F, na.rm=T),
    Min = min(x$T_F, na.rm=T),
    Max = max(x$T_F, na.rm=T),
    nObs = nrow(x)
  )
})
do.call(rbind, HoboSummary) #equivalent to dplyr::bind_rows(HoboSummary)

This is a stripped down version of the for loop we did before. We then use do.call(), the base equivalent to dplyr::bind_rows(), to generate a dataframe. do.call takes a function (here rbind) and passes it a list of arguments (here list elements). So, it passes all of the individual elements of HoboSummary to rbind as individual arguments to glue together.

We could then do the flagging step and the plotting steps. However, we are going to stop here and pick those up in the advanced R session because we really need to talk about how to make a function() before we go a whole lot further and that is a topic for next time.


#### while() Another type of loop you should be aware of is the while loop. It probably won’t come up much when you’re working with data in R, but as a programmer you should have it in your toolbox. While the apply family of functions is specific to R, for loops and while loops are present in most programming languages.

In a for loop, you decide ahead of time how many times the code inside your loop should execute. The for loop iterates through a vector of a certain length and stops when it reaches the end.

In a while loop, on the other hand, your loop executes while some condition is true. When that condition is no longer true, the loop stops. while loops are good for situations when you can’t predict ahead of time how many times the loop needs to execute.

As an example, we’ll make a simple guessing game. The code below selects a random integer between 1 and 10. It prompts the user to type a guess at the console. If they guess correctly within three tries, they win.

guess <- 0  # Initialize the guess variable
number <- sample(1:10, 1)  # Get our random number
guess_count <- 0  # Number of guesses so far

while (guess != number && guess_count < 3) {  # Keep looping if the guess is wrong and we haven't hit our 3 guesses yet
  guess_count <- guess_count + 1
  message <- paste0("Guess ", guess_count, ": what is the random number?  ")
  guess <- as.integer(readline(prompt = message))
  
  if (guess == number) {
    print("Woohoo! You won the game!")
  } else if (guess_count < 3) {
    print("Nope, try again.")
  } else {
    print("You lost :(")
  }
}

When you are writing a while loop, make sure that it will eventually terminate! If the test expression (in this case guess != number && guess_count < 3) never becomes false, your while loop will go on forever! An infinite while loop will ungracefully crash your code, so be sure to test your loop thoroughly, especially if it’s more complex.

Troubleshooting

Overview

Believe it or not, there are only two types of errors and both are human. 1) Your code is written incorrectly (let’s go find that missing ‘)’…). 2) You are telling your code to do something with data that it can’t do with that data (let’s figure out where we need to set na.rm=T). Another branch of troubleshooting is optimization - “why is my code so slow?” We will cover that next week.

Troubleshooting iteration is often challenging and frustrating. Sometimes errors are obvious sometimes errors only show up when you start digging into outputs. Sometimes it works fine on the test data and fails on the real data. Ultimately, debugging is an exercise in hypothesis testing “If I do this, I should get this result. Where did I not get the expected result and why?” We present you with some basic manual approaches and resources here that should help you debug iteration. As you code you will find the approach that works best for you in each situation.

Clues you need to troubleshoot

Warnings - Warnings are you prompt to check if there is a problem. Code will seldom fail to provide output on a warning, the question is whether or not it is the output you are after. When you see warnings you should run warnings() and see if it makes sense as to why there are warnings. ggplot, for example, will issue warnings for each incomplete record (e.g. a value for y, but not x) and will continue to plot everything that is complete. It is always good to pay attention to the number of warnings and periodically check that they are expected. If they are unexpected you may wish to dig deeper.

Errors - In short, errors occur when the functions used in your code encounter data that they can’t use. Sometimes an error tells you what the problem is, more often it doesn’t. Many errors are general and not specific to a function so you will want to Google the function and the error to see what people are saying. May take a few iterations to get the right search terms.

Unexpected behavior - If you expect that 1+1=2 but your code returns 3, that is a good indicator you need to troubleshoot. It is always good to feed your code some data for which you already know the answer or output. Both positive and negative examples should be used. A positive example is one where you know it should work normally and a negative example is one where you know it should break. I often do this during code development as I am building different modules or chunks. Fundamentally I have two checks: 1) Is it doing what I expect it to do? 2) What happens if I slip an NA/NaN/Inf or an empty dataframe in there?

Finding Errors

Tracking down the error

Tracking the error down is usually the first step in fixing it. Once you track it down you can see which functions are involved in it or which data are involved in it. There are generally two approaches to this: manual and R-assisted. I am going to focus on the manual, because at some point you will have to fall back on that so it is good to have a solid understanding of that approach before you branch out to R-assisted approaches. For more on R-assisted approaches see

  1. Don’t try and pack too much into your code. Writing ‘everything loops’ means that you will have a lot of functions. Different functions have different inputs and outputs. The more different data structures and outputs your iteration will have to handle the more likely it is to break or provide unexpected results. Iteration works really well when it has 1 input structure and 1 output structure.

  2. Make your code verbose - Iteration usually works until it doesn’t - it will loop up to the point or dataset that breaks it. Depending on how you have written your code, you likely don’t know where it actually failed. That is a fact of life with R and iteration. Iteration tends to hide errors and warnings and makes it difficult to see where things went wrong. Adding a print() command that prints a dataset name or a piece of data that allows you to evaluate progress allows you quickly to determine if the code is functioning properly. When you are done testing a developing a function, you can then remove it. A few rules of thumb: if it returns an error on the first step of iteration, then it is a code or a data-code match problem. If it can iterate properly at least once, then the code ‘works’, but there is a code-data mismatch.

  3. Step away from the iteration - The problem is seldom the iteration itself. In this approach you set the input data to the known issue or case you want to test and don’t attempt to iterate over all the data. Without invoking the iteration, you then run each step of your code line by line to determine where the error occurs. This is used for fixing a code error and for troubleshooting a dataset. What does that look like?

#effectively these are manual "breakpoints"

i=fNames[3] #code dev/debugging fragment
# rm(i) #debug.dev frag
SummaryData <- NULL # make sure this gets cleared before rerunning this code
# for(i in fNames){
d <- read.csv(i, skip = 1, header = T)[1:3]
d # check output
fName <- basename(i)
# print(fName)#dev/debug frag
names(d) <- c("idx", "DateTime", "T_F")
# summarize the data
OutDat <- data.frame(
  FileName = fName,
  T_F_mean = mean(d$T_F, na.rm = T),
  T_F_min = min(d$T_F, na.rm = T),
  T_F_max = max(d$T_F, na.rm = T),
  nObs = nrow(d)
) # days
# Append new summary data to old summary data
SummaryData <- rbind(SummaryData, OutDat)
#  }
  1. Step away from your data - Sometimes we let our data over-complicate the coding task. There are few coding sessions where I don’t end up trying to see if I can make something work on a built in dataset (e.g. mtcars). Often when trying to do the same thing, but with different data, I end up seeing and solving the problem. This approach also forces you to reduce your code to the bare basics… which you may need when you go to ask for help.

Lastly, don’t spin your wheels by yourself for too long. Coding is a team effort.

Reproducible Examples

Ask for help

We have talked about getting help before, but here we want to cover the “Minimim Reproducible Example (MRE)” and asking for help on Stack Overflow. This is actually a crucial part of debugging. Often the process of developing the MRE will actually answer your question.

When you ask for help (on stackoverflow or in your office) you should provide an example of the problem you are having complete with data. This is called the minimal reproducible example (MRE). It is easiest on you and the people trying to help you if it does not use your data. In the end you want to be able to create a fairly simple statement: “I am trying to _____. Here is the code I have so far. I want it to produce output that looks like ____. However, it is not currently doing that, instead it is providing this output _____. What did I do wrong?

Steps for getting help / developing an MRE on Stackoverflow:

  1. Identify a common built-in dataset that has a structure that is vaguely similar to your data or write some simple code that mimics what you are trying to deal with. A good one to use is often mtcars which is in ggplot2. It has integer data, continuous data, and categorical data. If you are using a random number generator for you data (e.g. rnorm then using set.seed() is advisable.) Don’t try to reproduce your entire dataset, reproduce only what your code is using. If you have 40 variables, and your code only operates on 8, provide a dataframe with 8 variables (or 9 in the unlikely event the problem is in the unused variable).

  2. Strip your code down to the essentials - sometimes the bigger context of your code is relevant, usually it isn’t. A. Remove variable names that are special to your data. Rename them simple things like x and y. B. Decide which steps are ‘bells and whistles’ and what steps are essential to the problem you are having or essential to what the code needs to do. Remove the bells and whistles. Check to see if the code behavior changes with each removed step.

  3. Be able to clearly provide an example of what the code should produce as output. Your code does not need to be generating that output be doing that.

  4. Type your question in plain english avoiding jargon. Use proper punctuation, grammar, etc. Keep it short, but provide details.

  5. Assign some useful R tags, usually r and the name of the function you think is the issue.

  6. Reply to people, thank them for their suggestions, provide clarificatin, select an answer. Avoid initiating a wholly new question in the comments/discussion.

Here is an example of a question I posted in preparation for this training. I had a way for it to work, but knew there was a cleaner way.

https://stackoverflow.com/questions/70747292/assigning-multiple-text-flags-to-an-observation/70748841#70748841

Is that solution better than what I had? Is it better than the other answer? Maybe. In the end I think the solution I selected will make it easier to add other flags that may come up. Maybe there is slightly less code here. It is visually more appealing. I do like the direct handling of NA values.

Resources

There’s a lot of great online material for learning new applications of R. The ones we’ve used the most are listed below.

Online Books

  • R for Data Science First author is Hadley Wickham, the programmer behind the tidyverse. There’s a lot of good stuff in here, including a chapter on using R Markdown, which is what we used to generate this training website.
  • ggplot2: Elegant Graphics for Data Analysis A great reference on ggplot2 also by Hadley Wickham.
  • Mastering Software Development in R First author is Roger Peng, a Biostatistics professor at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. His intro to ggplot is great. He’s also got a lot of more advanced topics in this book, like making functions and packages.
  • R Packages Another book by Hadley Wickham that teaches you how to build, debug, and test R packages.
  • Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.
  • Mastering Shiny And another Hadley Wickham book on building shiny apps.

Other useful sites

  • NPS_IMD_Data_Science_and_Visualization > Community of Practice is an IMD work group that meets once a month talk about R and Data Science. There are also notes, materials and recordings from previous meetings, a Wiki with helpful tips, and the chat is a great place to post questions or cool tips you’ve come across.
  • STAT545 Jenny Bryan’s site that accompanies the graduate level stats class of the same name. She includes topics on best practices for coding, and not just how to make scripts work. It’s really well done.
  • RStudio home page There’s a ton of info in the Resources tab on this site, including cheatsheets for each package developed by RStudio (ie tidyverse packages), webinars, presentations from past RStudio Conferences, etc.
  • RStudio list of useful R packages by topic
  • Happy Git with R If you find yourself wanting to go down the path of hosting your code on github, this site will walk you through the process of linking github to RStudio.

Resources by Day:

Day 3: Data Visualization

Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.


Code printout

This tab prints all of the code chunks in this file in one long file.

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

rm(list = ls())
packages <- c("tidyverse", "ggthemes", "GGally", "RColorBrewer", 
              "viridis", "scales", "plotly", "patchwork", 
              "sf", "tmap", "leaflet", "spsurvey")

install.packages(setdiff(packages, rownames(installed.packages())))  
install.packages('terra', repos='https://rspatial.r-universe.dev')
# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code

# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")

# one plus one
1+1

# two times three, divided by four
(2*3)/4

# basic mathematical and trigonometric functions are fairly similar to what they would be in excel

# calculate the square root of 9
sqrt(9)

# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)

# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)

# calculate 5 to the tenth power
5^10
# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:

a*b

(a^b)/((b+a))

sqrt((a^7)/(b*2))
# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")

# display the 'tree_data' data frame we just created
tree_data
head(tree_data)  # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
head(tree_data, n = 10)  # Show the first 10 rows

summary(tree_data)  # View a basic statistical summary

str(tree_data)  # Get info about the structure of the data
View(tree_data)  # Open a new tab with a spreadsheet view of the data
digits <- 0:9  # Use x:y to create a sequence of integers starting at x and ending at y
digits
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
is.vector(digits)  # Should be TRUE
is.vector(favorite_birds)  # Should also be TRUE

length(digits)  # Hopefully this is 10
length(shoe_sizes)

# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
is.vector(length_one_int)
length(length_one_chr)
length(length_one_int)
class(favorite_birds)
class(shoe_sizes)
class(digits)
class(is_odd)
second_favorite_bird <- favorite_birds[2]
second_favorite_bird
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
odd_digits <- digits[is_odd]
odd_digits
tree_data$Species
tree_data$Age_yr
tree_data[, "Species"]  # Same as tree_data$Species
tree_data[, c("Species", "Age_yr")]  # Data frame with only Species and Age_yr columns
tree_data[1:5, "Species"]  # First five elements of Species column
is.vector(tree_data$Species)
class(tree_data$Species)
is.vector(tree_data$Age_yr)
class(tree_data$Age_yr)

str(tree_data)  # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)

#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)

#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
sd(tree_data$Age_yr)
min(tree_data$Age_yr)
unique(tree_data$Species)  # in this line, we are printing all of the unique species names in the dataset (in this case 2).
species_names <- unique(tree_data$Species)  # assign unique species names to a variable
in_per_ft <- 12  # Since there are 12 inches in a foot
# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft

# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
hist(x = tree_data$Age_yr)
# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)
# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)
?plot
?dplyr::filter
install.packages("janitor")
library('janitor')
library('janitor')
install.packages("tidyverse")
library('tidyverse')
library('janitor')
library('tidyverse')
messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools', 
                                   'Monument', 'Tidepools', NA),
                     Visitor_count = c(200, 300, 400, 500, 400, NA),
                     `empty row` = NA)
messy_data
clean_data <- clean_names(messy_data)
clean_data
remove_empty(
  # first argument is the data
  dat, 
  # the next specifies if you want to get rid of empty rows, columns, or both
  which = c("rows", "cols"), 
  # this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
  quiet = TRUE)
clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
# note that replacing %>% with |> also works
clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
              distinct()
clean_data
messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
                     visitor_count = c(NA, NA, 400, 500, 400, 400),
                     `empty row` = NA)
clean_data2 <- clean_names(messy_data2) %>%
               remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
               distinct()
library(readr)
library(janitor)
library(tidyverse)
park_visits <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# let's see the column names in park_visits
colnames(park_visits)

# we can also look at the first few rows by using head()

head(park_visits)

# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)

# you can check if you have the same result by using colnames()
colnames(park_visits1)
colnames(park_visits2)

# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)

# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)

# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
wa_park_visits <- filter(
                        # name of data.frame
                        park_visits, 
                        # condition
                        state == 'WA'
                        )

# head() shows the first few rows of data
head(wa_park_visits)

# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)
wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000)

# head() shows the first few rows of data
head(wa_park_visits)
pr_park_visits <- park_visits %>%
                  # filter for parks in Puerto Rico
                  filter(state == 'PR' &
                  # filter for total visitation
                  year == 'Total')

# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!

# remove result from environment
remove(pr_park_visits)
wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000) %>%
                  # arrange by visitation - default is ascending order
                  arrange(visitors)

# look at result (df is short so head() not needed)
wa_park_visits

# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
                  arrange(desc(visitors))

# look at result 
wa_park_visits2

# remove one result
remove(wa_park_visits2)
# two ways to get the number of observations per state

# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
           # pull the state column only
           pull(state) %>%
           # get # of observations by state in table output
           table()

# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
           # group by state
           group_by(state) %>%
           # get tally of observations by state in output
           tally()

# remove created objects
remove(state_n, state_n2)
wa_park_visits_millions <- wa_park_visits %>%
                           mutate(visitors_mil = visitors/1000000)

wa_park_visits_millions
park_visits2 <- park_visits %>%
                # remove the rows with Totals per park
                filter(year != "Total")  %>% 
                # do calculations by park
                group_by(park_name) %>%
                # add visit_percent column
                # visits fore each year divided by the sum of total visits for each park
                # the na.rm = TRUE means that NA values are not included in calculations
                # round(2) is rounds result to 2 decimal places for easier reading
                mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
                         round(2)) 

# take a look at the result
head(park_visits2)

# remove from environment
remove(park_visits2)
state_summary <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # sd of total visitors across parks in each state
                  sd_visit = sd(visitors, na.rm = TRUE),
                  # min total visitors across parks in each state
                  min_visit = min(visitors, na.rm = TRUE),
                  # max total visitors across parks in each state
                  max_visit = max(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                )

# take a look at the result
head(state_summary)

# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
                 ungroup() %>%
                 mutate(dif = max_visit - min_visit)

# remove from environment
remove(state_summary)
challenge2 <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                ) %>%
                # ungroup to do another calculation
                ungroup() %>%
                # calculate visit/n
                mutate(visit_per_park = mean_visit/n_parks) %>%
                # select relevant columns
                select(state, visit_per_park) %>%
                # arrange in descending order
                arrange(desc(visit_per_park))
                

# take a look at the result
head(challenge2)

# remove from environment
remove(challenge2)
# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
  # get parks in MEDN and no totals, years 2010-2015
  filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
  # make year a number, since no more text
  mutate(year = as.numeric(year)) %>%
  # get years 2010:2015
  filter(year >= 2010 & year <= 2015) %>%
  # arrange in ascending order
  arrange(year) %>%
  # select relevant columns
  select(unit_code, year, visitors)

# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits

# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
# check out the original data (medn_visits) in long format
medn_visits

# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
  pivot_wider(
    # names from is the thing you want to be the columns
    names_from = year,
    # values from is what you want to fill the columns with
    values_from = visitors
  )

# check out the result
medn_visits_wide

# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
  pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
# check out the original data (medn_visits_wide) in wide format
medn_visits_wide

# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
  pivot_longer(
    # first argument is the columns you'd like to transform
    SAMO:CHIS,
    # next is what you'd like to name the former name cols
    names_to = "unit_code",
    # last is what you'd like to name the former values cols
    values_to = "visitors"
  )

# check out the result
medn_visits_long

# this should look familiar - like the original medn_visits data
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
  filter(state %in% c("FL", "AK", "CA") &
    year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
  pivot_wider(
    names_from = "year",
    values_from = "pop"
  )

# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
  pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")
# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
  select(year, parkname, unit_code, state, visitors) %>%
  filter(year != "Total") %>%
  mutate(year = as.integer(year))

# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
  # get data from Shenandoah.
  filter(unit_code == "SHEN")

SHEN_visits

ACAD_visits <- park_visits %>%
  # get data from Acadia.
  filter(unit_code == "ACAD")

ACAD_visits

# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)

# check that data from both parks is there
SHEN_ACAD_visits %>%
  pull(unit_code) %>%
  table()

rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)

# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
  # get first 2 columns.
  select(year, parkname)

visits_left

visits_right <- park_visits %>%
  #  get last 3 columns.
  select(unit_code, state, visitors)

visits_right

visits_bound <- bind_cols(visits_left, visits_right)

visits_bound

rm(visits_left, visits_right, visits_bound)

# Atlantic islands
Atlantic <- park_visits %>%
  filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
  filter(state %in% c("AS", "GU", "HI"))

# All islands
Islands <- bind_rows(Atlantic, Pacific)


rm(Atlantic, Pacific, Islands)

# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)

# left_join example
state_pop_gas1 <- left_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas1

# left_join example
state_pop_gas2 <- right_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas2

# inner_join example 

state_pop_gas3 <- inner_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas3

# full_join example 

state_pop_gas4 <- full_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas4

rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)


park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))

rm(park_visits_pop)

# semi_join example 

state_pop_gas5 <- semi_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas5

# anti_join example 

state_pop_gas6 <- anti_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas6

# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)


## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.

park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>% 
  semi_join(gas_price, by="year")

park_visits_filtered


rm(park_visits_filtered)
library(knitr)
library(kableExtra)
library(here)

# Packages used in this section
library(tidyverse)
library(viridis) # for colorblind-friendly palettes
covid_numbers <- readr::read_csv(here::here("Data", "covid_numbers.csv"))
head(covid_numbers, 7) %>%
  knitr::kable(align = "c", caption = "Daily Covid cases and population numbers by state (only showing first 7 records)") %>%
  kableExtra::kable_styling(full_width = F)
acme_in <- read_csv(here::here("Data", "acme_sales.csv")) %>%
  dplyr::arrange(category, product) 
acme_in %>%
  knitr::kable(align = "c", caption = "Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020") %>%
  kableExtra::kable_styling(full_width = F)

acme <- acme_in %>%
  pivot_longer(-c(category, product), names_to = "month", values_to = "revenue")
acme$month <- factor(acme$month, levels = month.abb)

ggplot(acme, aes(x=month, y=product, fill=revenue)) + 
  geom_raster() +
  geom_text(aes(label=revenue, color = revenue > 1250)) + # color of text conditional on revenue relative to 1250
  scale_color_manual(guide = "none", values = c("black", "white")) + # set color of text
  scale_fill_viridis_c(direction = -1, name = "Monthly revenue,\nin $1000's") +
  scale_y_discrete(limits=rev) + # reverses order of y-axis bc ggplot reverses it from the data
  labs(title = "Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020", x = "Month", y = "Product") + 
  theme_bw(base_size = 12) +
  facet_grid(rows = vars(category), scales = "free") # set scales to free so each facet only shows its own levels
ansc <- anscombe %>%
  dplyr::select(x1, y1, x2, y2, x3, y3, x4, y4) 

ansc %>%
  knitr::kable(align = "c", caption = "Anscombe's Quartet - Four bivariate datasets with identical summary statistics") %>%
  kableExtra::column_spec (c(2,4,6),border_left = F, border_right = T) %>%
  kableExtra::kable_styling(full_width = F)

sapply(ansc, function(x) c(mean=round(mean(x), 2), var=round(var(x), 2))) %>%
  knitr::kable(align = "c", caption = "Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.") %>%
  kableExtra::column_spec (c(1,3,5,7),border_left = F, border_right = T) %>%
  kableExtra::kable_styling(full_width = F)
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv") # visits to US National Parks

gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv") # gas prices over space and time
# Examine the data to understand data structure, data types, and potential problems
View(park_visits); summary(park_visits)
View(gas_price); summary(gas_price)

# NOTES: 
# 1. When we preface function names with the package name, we avoid potential issues with duplicate function names among packages
# 2. The dplyr::select() function can simultaneously select columns, rename them, and order them as specified. See `?select`
# 3. When we `dplyr::filter()` by multiple conditions, we can use a comma (,) or and-sign (&) to combine the multiple conditions
# 4. `gas_price$gas_constant` is gas price in constant 2015 dollars/gallon

joined_dat <- park_visits %>%
  dplyr::filter( # filter by multiple conditions
    region %in% c("IM", "PW", "SE"), 
    unit_type %in% c("National Monument", "National Historic Site", "National Park"),
    year %in% "2000":"2015") %>%
  dplyr::select(region, unit_type, unit_name, unit_code, year, visitors) %>% # keep these columns only, and arrange them in this order
  dplyr::mutate(year = as.numeric(year)) %>% # convert `park_visits$year` to numeric, to match the data type of `gas_price$year`
  dplyr::left_join(gas_price[c("year", "gas_constant")], by = "year")

# summary(joined_dat) # always check the data before and after wrangling!
glimpse(joined_dat)

# NOTES:
# 1. The factor data type is similar to (and often interchangeable with) the character data type, but limits entries to pre-specified values and allows us to specify an order to those values (other than the default alphabetical order)
# 2. Factors are especially useful for ordered categories (e.g., low < medium < high) with a small number of possible values
# 3. Under the hood, `R` store factors as integer vectors representing the order of the value (with character string labels)
# 4. I usually classify categorical variables as character data types, unless I want to limit entries to a few pre-specified values, or if I want to enforce a particular order to the category levels

unique(joined_dat$region) # check what values are in `region`

joined_dat$region <- 
  factor(joined_dat$region, 
         levels = c("PW", "IM", "SE"), # specifies the factor levels and an order for them. If we omit this argument, the factor levels will be the unique set of values in the column, ordered alphabetically
         labels = c("Pacific West", "Intermountain", "Southeast")) # relabels the levels with more descriptive names, keeping the same order as the `levels` argument

# Check the data
glimpse(joined_dat) # `region` is now a factor data type
levels(joined_dat$region) # shows the (renamed) factor levels and their order
str(joined_dat$region) # shows the values are coded as integers indicating factor level order
summary(joined_dat) # the summary for `region` now counts the number of times each factor level occurs
# NOTES: 
# 1. Check out `?expand_grid` for an alternative approach
# 2. The `table` data class describs

# Run the code in pieces to see what the different lines do. For example, copy `table(uc = park_sub$unit_code, yr =park_sub$year)` and run it in the console to see the output. Then copy `table(uc = park_sub$unit_code, yr =park_sub$year) %>% as.data.frame()` and run it in the console to see the output. Repeat until the end of the code chunk.
remove_units <- 
  table(unit = joined_dat$unit_code, yr = joined_dat$year) %>% # build a contingency table of counts for every combination of `unit_code` and `year`
  as.data.frame() %>% # convert from `table` to `data frame` class
  dplyr::filter(Freq != 1) %>% # filter for unit-year combinations that are less than or greater than 1
  distinct(unit) # identify the corresponding park units

# Remove those park units from the data
viz_dat <- joined_dat %>%
  dplyr::filter(!unit_code %in% remove_units$unit)
  
# View(viz_dat) to make sure the park units have been removed. The final data frame should have 1856 rows and 7 columns.
# If you want to save `viz_dat` to your current working directory, run the code below. This code saves `viz_dat` as an .RDS file, which just means it saves it as an `R` object instead of as a .csv file. 

# When we save a data frame as an `R` object we retain the accompanying attribute information (e.g., the order of factor levels for `region`)--this information would not be saved in a .csv file.

saveRDS(viz_dat, "viz_dat.RDS")

# To read the RDS file from your working directory and assign it to `viz_dat`, use the code below. You can assign it to any name you want. If you're reading it from a location other than your working current directory, provide the file path in the function argument.
viz_dat <- readRDS("viz_dat.RDS")

tab_count <- viz_dat %>% 
  dplyr::distinct(region, unit_type, unit_code) %>%
  dplyr::count(region, unit_type) %>%
  tidyr::pivot_wider(names_from = region, values_from = n)

# View(viz_dat) to double-check the numbers
tab_count

tab_count <- viz_dat %>% 
  dplyr::distinct(region, unit_type, unit_code) %>%
  dplyr::count(region, unit_type) %>%
  tidyr::pivot_wider(names_from = region, values_from = n)

# View(viz_dat) to double-check the numbers

viz_dat <- readRDS("Data/viz_dat.RDS")

# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

# Packages used in this section
library(tidyverse)

# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.

# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the name `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots
# Check out `?boxplot`

# Subset the data
subdat <- viz_dat %>%
  dplyr::filter(
    year %in% c(2000, 2005, 2010, 2015),
    unit_type == "National Monument",
    region == "Intermountain")

boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula

# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument

# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
        data = subdat,
        main = "National Monuments in the Intermountain Region", # plot title
        xlab = "Year", # x-axis title
        ylab = "Number of visitors, in 1000's") # y-axis title
p_box

# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")

# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")
# Visitors vs. year
plot(visitors/1000 ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's")
# Gas price vs. year
plot(gas_constant ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Average National Gas Prices, 2000 - 2015", xlab = "Year", ylab = "Gas price (constant 2015 dollars/gallon)")

# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points

# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points

viz_dat <- readRDS("Data/viz_dat.RDS")

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park

# This section requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park

# Packages used in this section
library(tidyverse)
library(ggthemes)

# NOTES:
# 1. The first argument provided to the `ggplot()` function is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to the name `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.

(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))

summary(p) # summary of the information contained in the plot object

p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object. 
# NOTE: We combine ggplot object layers with `+`, not with `%>%`. With a ggplot object we are not piping sequential results but rather layering pieces. In this example we are building our ggplot object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.

p + geom_line()
# Example of setting aesthetic mappings within the layer only
ggplot() +
  geom_line(data = arch_dat, aes(x = year, y = visitors/1000))
# This works because we had defined default data and aesthetics in `p`, so don't need to provide additional information to the geometry layers
p + geom_line() + geom_point() 
ggplot() +
  geom_line(data = arch_dat, aes(x = year, y = visitors/1000)) +
  geom_point()
p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to the name `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot2` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`

# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p + 
  geom_line() + 
  geom_point() +
  scale_y_continuous(
    name = "Number of visitors, in 1000's", # y-axis title
    limits = c(600, 1500), # minimum and maximum values for y-axis
    breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)

# NOTE: See `?ggtitle`, `?xlab`, and `?ylab` for alternative ways to set plot labels

(p3 <- p2 + 
  labs(
    title = "Visitor Counts at Arches National Park, 2000 - 2015", # plot title
    subtitle = "Visits to Arches National Park have increased each year since 2004", # plot subtitle
    x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
 )
p3 + 
  theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()... 
p3 + 
  ggthemes::theme_fivethirtyeight()

p3 + 
  ggthemes::theme_fivethirtyeight() +
  theme(
    plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)

subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)

ggplot(subdat, aes(x = year, y = visitors/1000)) + 
  geom_line() + 
  geom_point() +
  labs(
    title = "Annual Visitor Counts at Three National Parks, 2000 - 2015",
    x = "Year", 
    y = "Number of visitors, in 1000's") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5)) + # center the plot title
  facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")

library(kableExtra)
viz_dat <- readRDS("Data/viz_dat.RDS")


# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frame in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
# Packages used in this section

# NOTE: This is a useful chunk of code that installs any missing packages and then loads all of them. Useful if we're sharing code with others who may not have some packages installed. Also increasingly useful if the list of packages required to run your script is somewhat long (I don't consider this a long list, but this code is handy to know.)

pkgs <- c("tidyverse",
          "ggthemes", # nice plot themes
          "GGally", # for stat_prop
          "RColorBrewer", # to display Brewer palettes
          "viridis", # for colorblind-friendly palettes
          "scales", # breaks_pretty(), etc.
          "plotly", # to make plots interactive
          "patchwork") # for page layout

installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies

# Subset the data, only keeping data for the year 2015
# To improve readability, show visitor count as number of visitors in 1000's. This conversion be done during initial data formatting, or can be calculated directly in the `ggplot()` function. I'm doing the conversion in the data wrangling stage so I don't have to keep retyping it in the plot iterations.

dat_2015 <- subset(viz_dat, year == 2015) %>% 
  dplyr::mutate(visitors_k = visitors/1000)

head(dat_2015) # always check the data before plotting

# Basic histogram
ggplot(dat_2015, aes(x = visitors_k)) + # specify the variable to plot
  geom_histogram()

# Compare the plot to the data to make sure things look about right

# NOTES: 
# 1. We can set aesthetics to constant values by excluding the `aes()` function. For example, `aes(fill = unit_type)` groups and fills histograms based on the `unit_type` variable (different fill color for each unit type). In contrast, `fill = "red"` creates a single histogram that is red.
# 2. Instead of adding the `fill` aesthetic to the master template, we could have added it directly in the `geom_histogram()` layer

ggplot(dat_2015, aes(x = visitors_k, fill = unit_type)) + 
  geom_histogram(alpha = 0.3, color = "black") # histograms for the different unit types overlap, so apply transparency (alpha = 0.3) and add a black outline on the bars so we can more easily distinguish them in the histograms.
# Facet wrap
(hist1 <- ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_histogram() +
  facet_wrap(vars(unit_type), ncol = 1)) # facet by unit type, put all plots in a single column

# Now try using `facet_grid()` instead to see what the difference is
# Density histogram
ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_density() + # <<< only changed this line
  facet_wrap(vars(unit_type), ncol = 1)

# Dot plot
ggplot(dat_2015, aes(x = visitors_k)) + 
  geom_dotplot() + # <<< changed this line
  facet_wrap(vars(unit_type), ncol = 1)
View(ggplot_build(hist1)$data[[1]]) 
head(ggplot_build(hist1)$data[[1]], 7)  %>%
  knitr::kable(align = "c") %>%
  kableExtra::kable_styling(full_width = F)

(hist2 <- ggplot(dat_2015, aes(x = visitors_k)) +
   geom_histogram(binwidth = 200) + # each bin now spans 200k visitors
   scale_x_continuous(breaks = seq(0, 11000, by = 2000)) + # label the x-axis from 0 to 11,000 at intervals of 2000
   facet_wrap(vars(unit_type), ncol = 1)
)

# Use `ggplot_build()` to see the new histogram bins

# This is not the only way, but we are going to add new data columns that include the unit type median and a facet label. 
# For alternative ideas, check out the `labeller` argument in `facet_wrap()`

dat_2015_label <- dat_2015 %>%
              group_by(unit_type) %>% 
              mutate(N = n(),
                     type_med = round(median(visitors_k)),
                     type_label = paste0(unique(unit_type),"  (N = ", N, " NPS units; median = ", type_med, "k visitors)")) # `paste` and `paste0` are nice functions for combining strings and variables
                 
# View(dat_2015_label) # look at the new columns    
ggplot(dat_2015_label, aes(x = visitors_k)) +
  geom_histogram( binwidth = 200) +
  scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
  geom_vline(aes(xintercept = type_med)) + # add median lines
   facet_wrap(vars(type_label), ncol = 1) # we are faceting by `type_label`, which splits the data out the same way as `unit_type`. This is not the only way to customize facet labels.
# Final histogram
(hist3 <- ggplot(dat_2015_label, aes(x = visitors_k)) +
    geom_histogram(fill = "lightblue", color = "black", binwidth = 200) +
    scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
    geom_vline(aes(xintercept = type_med), color = "red", size = 1, linetype = "dashed") +
    labs(x = "Number of visitors, in 1000's", y = "Number of NPS units", title = "Distribution of Visitor Counts (Yr 2015)",
       subtitle = "Red dashed line shows median count") +
   theme(plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
 facet_wrap(vars(type_label), ncol = 1)
 )
# Combine two histograms in a single graphic
(hist_page <- hist1 + hist3 +
  patchwork::plot_layout(ncol = 2, widths = c(1, 2)) + 
  patchwork::plot_annotation(
    title = "Basic (A) and Customized (B) Histograms",
    tag_prefix = "(",
    tag_levels = "1",
    tag_suffix = ")"
)
)
# Final histogram
(hist3 <- ggplot(dat_2015_label, aes(x = visitors_k)) +
    geom_histogram(fill = "lightblue", color = "black", binwidth = 200) +
    scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
    geom_vline(aes(xintercept = type_med), color = "red", size = 1, linetype = "dashed") +
    labs(x = "Number of visitors, in 1000's", y = "Number of NPS units", title = "Distribution of Visitor Counts (Yr 2015)",
       subtitle = "Red dashed line shows median count") +
   theme(plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
 facet_wrap(vars(type_label), ncol = 1)
 )
# Combine two histograms in a single graphic
(hist_page <- hist1 + hist3 +
  patchwork::plot_layout(ncol = 2, widths = c(1, 2)) + 
  patchwork::plot_annotation(
    title = "Basic (A) and Customized (B) Histograms",
    tag_prefix = "(",
    tag_levels = "1",
    tag_suffix = ")"
)
)
# geom_bar is designed to make it easy to create bar charts that show
# counts (or sums of weights)
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
# Total engine displacement of each class
g + geom_bar(aes(weight = displ))
# Map class to y instead to flip the orientation
ggplot(mpg) + geom_bar(aes(y = class)) 

# NOTE: We can also use `coord_flip()` to flip the axes

# Stacked bar chart, using the handy `weight` argument
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors)) + # we don't need to divide visitors by 1000 for this one because we will be converting to proportions anyway
  facet_grid(cols = vars(region))
# NOTE: Setting the `position` argument to "dodge" would change the stacked bar to a side-by-side bar chart

# Set the `position` argument to "fill" to convert the counts to proportions
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors), position = "fill") + # <<< changed this line
  facet_grid(cols = vars(region))

# These are colorblind-friendly palettes in `RColorBrewer`
RColorBrewer::display.brewer.all(n=NULL, type="all", select=NULL, exact.n=TRUE, colorblindFriendly=TRUE)

ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type)) +
  geom_bar(aes(weight = visitors), position = "fill") +
  scale_fill_brewer(palette = "Set2") + # pick a palette to apply
  facet_grid(cols = vars(region))

# Google search a cheat sheet of "R ggplot2 colors" to see hundreds of named colors to choose from. Manually set fill for each unit type by replacing `scale_fill_brewer()` with:

scale_fill_manual(values = c("National Historic Site " = "cadetblue", "National Monument" = "mediumpurple2", "National Park" = "burlywood1"))

# A colorblind-friendly palette, with colors identified by hexadecimal code instead of by name (you can replace these with color names)
my_colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

scales::show_col(my_colors) # the `scales` package has a `show_col()` function to display the colors corresponding to a hexadecimal code

# To use your custom palette, replace `scale_fill_brewer()` with:
  scale_fill_manual(values = my_colors) # make sure the number of colors in your custom palette is greater than or equal to the number of levels in your grouping variable (in our case, 3 unit types--so unit types (alphabetically or by factor level order) will be assigned the the first 3 colors in the custom palette
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) + # <<< this `by =` argument is required for the `stat` argument in `geom_text()` to calculate proportion
  geom_bar(aes(weight = visitors), position = "fill") + 
  geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
  scale_fill_brewer(palette = "Set2") +
  facet_grid(cols = vars(region))
# Assign the final bar chart to the name 'final_bar' so we can save it as a .pdf file in the next step
(final_bar <- 
   ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) +
   geom_bar(aes(weight = visitors), position = "fill") + 
   labs(fill = "NPS unit type", title = "Proportion of visits in each NPS unit type and region", subtitle = "Comparison of Years 2000 & 2015", y = "% of visits", x = "Year") +
   scale_y_continuous(labels = scales::percent) + # <<< this line converts proportion to % on the y-axis
   geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
   scale_fill_brewer(palette = "Set2") +
   ggthemes::theme_pander() +
   theme(plot.margin = margin(rep(15, 4)), # increase white margin around plot. This particular theme needs it, others may not.
         plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
   facet_grid(cols = vars(region))
)

# And here is the code to save it as a .pdf file
# ggsave(filename = "final_bar.pdf", plot = final_bar) 

# NOTE: `ggsave()` uses the file extension to guess at the file type to save as. There are function arguments to set figure width and height (and the associated measurement unit). The default is to save the image based on how we manually sized it in the RStudio plots plane. When we save the image this way, we will see a message in the console about the width and height of the saved image. For reproducibility, it's usually a good idea to hard code the output figure dimensions (if you plan to reuse or share a script)
# Assign the final bar chart to the name 'final_bar' so we can save it as a .pdf file in the next step
(final_bar <- 
   ggplot(subset(viz_dat, year %in% c(2000, 2015)),
       aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) +
   geom_bar(aes(weight = visitors), position = "fill") + 
   labs(fill = "NPS unit type", title = "Proportion of visits in each NPS unit type and region", subtitle = "Comparison of Years 2000 & 2015", y = "% of visits", x = "Year") +
   scale_y_continuous(labels = scales::percent) + # <<< this line converts proportion to % on the y-axis
   geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
   scale_fill_brewer(palette = "Set2") +
   ggthemes::theme_pander() +
   theme(plot.margin = margin(rep(15, 4)), # increase white margin around plot. This particular theme needs it, others may not.
         plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5)) +
   facet_grid(cols = vars(region))
)

# And here is the code to save it as a .pdf file
ggsave(filename = "final_bar.pdf", plot = final_bar) 

# NOTE: `ggsave()` uses the file extension to guess at the file type to save as. There are function arguments to set figure width and height (and the associated measurement unit). The default is to save the image based on how we manually sized it in the RStudio plots plane. When we save the image this way, we will see a message in the console about the width and height of the saved image. For reproducibility, it's usually a good idea to hard code the output figure dimensions (if you plan to reuse or share a script)

# Subset the data
line_dat <- viz_dat %>%
  dplyr::filter(region != "Southeast") %>%
  dplyr::mutate(visitors_k = visitors/1000)

ggplot(line_dat, aes(x = year, y = visitors_k)) +
  geom_line() +
  facet_grid(rows = vars(unit_type), col = vars(region))
# NOTES: 
# 1. If we had mapped unit_code to a color or linetype aesthetic we would not have needed to separately specify the group aesthetic
# 2. `ggplotly` doesn't support subtitles created in a ggplot object. To add subtitles to a ggplotly object, we can hack the `ggplotly` title text feature, or use `ggplotly` annotations (but then we have to specify position of text on plot)

(final_line <- ggplot(line_dat, aes(x = year, y = visitors_k, group = unit_code)) + # it didn't know to draw separate line per unit. Could also set as color or line type aesthetic, etc. and that would automatically also map it to group aesthetic. 
  geom_line() + 
  labs(x = "Year", y = "Number of visitors, in 1000's") +
  scale_y_continuous(breaks = scales::breaks_pretty()) +
  facet_grid(rows = vars(unit_type), col = vars(region)) + # so it's separate scales per unit type but same within a row (so good way to decide what should be facet row vs. col)
  ggthemes::theme_fivethirtyeight(base_size = 14) +
    theme(plot.margin=unit(c(2,1,1,1), 'cm')) # needed extra space at top to fit ggplotly hack subtitle
  ) %>%
  ggplotly() %>%
  layout(title = list(
    text = paste0("Trends in visitor counts (in 1000's), by NPS unit type and region", '<br>', '<sup>', 'Each line represents an NPS unit', '</sup>'))) 

# NOTES: 
# 1. If we had mapped unit_code to a color or linetype aesthetic we would not have needed to separately specify the group aesthetic
# 2. `ggplotly` doesn't support subtitles created in a ggplot object. To add subtitles to a ggplotly object, we can hack the `ggplotly` title text feature, or use `ggplotly` annotations (but then we have to specify position of text on plot)

(final_line <- ggplot(line_dat, aes(x = year, y = visitors_k, group = unit_code)) + # it didn't know to draw separate line per unit. Could also set as color or line type aesthetic, etc. and that would automatically also map it to group aesthetic. 
  geom_line() + 
  labs(x = "Year", y = "Number of visitors, in 1000's") +
  scale_y_continuous(breaks = scales::breaks_pretty()) +
  facet_grid(rows = vars(unit_type), col = vars(region)) + # so it's separate scales per unit type but same within a row (so good way to decide what should be facet row vs. col)
  ggthemes::theme_fivethirtyeight(base_size = 14) +
    theme(plot.margin=unit(c(2,1,1,1), 'cm')) # needed extra space at top to fit ggplotly hack subtitle
  ) %>%
  ggplotly() %>%
  layout(title = list(
    text = paste0("Trends in visitor counts (in 1000's), by NPS unit type and region", '<br>', '<sup>', 'Each line represents an NPS unit', '</sup>'))) 

# Subset the data, only keeping data for National Parks
# Instead of showing visitors in 1000's we will show visitors in millions because annual National Park visitor counts are generally higher
tab_visits <- viz_dat %>%
  dplyr::filter(unit_type == "National Park") %>%
  dplyr::mutate(visitors_m = visitors/1000000,
                year_f = as.factor(year)) 

head(tab_visits) # always check the data before plotting

# Starting plot
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile()

# Try setting `x = year` instead of `x = year_f`, to see what changes in the heat map
# Compare the plot to the data to make sure things look about right
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) + # <<< white border with thicker border than default
  scale_fill_viridis_c() # <color-blind friendly and larger range of colors
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
  geom_tile(colour = "white", size = 0.3) +
  scale_fill_viridis_c(direction = -1) + # <<< Reverse color order
  geom_text(aes(label=round(visitors_m, 1)), size = 3) # <<< add the cell labels, rounded to one digit. Adjust font size.
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) +
  scale_fill_viridis_c(direction = -1) + 
  geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3) + # <<< make color conditional on value. Choose a value that makes sense for the color palette you use.
  scale_color_manual(values = c("black", "white")) # set color of TEXT as FALSE(0) = black (for <=7) and TRUE(1) = white (for >7)

tab_visits <- viz_dat %>%
  dplyr::filter(unit_type == "National Park") %>%
  dplyr::mutate(
    visitors_m = visitors/1000000,
    year_f = as.factor(year),
    unit_name_abbrev = gsub("National Park", "NP", unit_name)) # <<<< change to "NP". Could have also used `sub()`

# head(tab_visits) # Check the results

(final_heat <- ggplot(tab_visits, aes(y = unit_name_abbrev, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) +
  scale_y_discrete(limits=rev) + # <<<< reverse y-axis order
  scale_fill_viridis_c(direction = -1, name = "Visitors, in millions", breaks = seq(2,10, by = 2)) + # <<< add legend name, set legend breaks
  geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3, show.legend = FALSE) + # <<< turn off text legend. Other ways to do it include `guides(color = "none)`
  scale_color_manual(values = c("black", "white")) +
  labs(x = "Year", y = "NPS Unit", title = "Trends in Annual Visitor Counts by NPS Unit, 2000 - 2015", subtitle = "Higher visitor counts shown as darker cells") + 
  theme_minimal(base_size = 12) + # <<< my theme of choice
  theme(legend.position = "top", # <<<< move legend to top)
        legend.key.width = unit(2.5, "cm")) + # <<< make legend longer
  ggforce::facet_col(vars(region), scales = "free_y", space = "free") # <<< fix facet format. Could also do facet_grid (without `ggforce`) but I like the fact that `facet_col` keeps the facet labels on top
)

tab_visits <- viz_dat %>%
  dplyr::filter(unit_type == "National Park") %>%
  dplyr::mutate(
    visitors_m = visitors/1000000,
    year_f = as.factor(year),
    unit_name_abbrev = gsub("National Park", "NP", unit_name)) # <<<< change to "NP". Could have also used `sub()`

head(tab_visits) # Check the results

(final_heat <- ggplot(tab_visits, aes(y = unit_name_abbrev, x = year_f, fill = visitors_m)) + 
  geom_tile(colour = "white", size = 0.3) +
  scale_y_discrete(limits=rev) + # <<<< reverse y-axis order
  scale_fill_viridis_c(direction = -1, name = "Visitors, in millions", breaks = seq(2,10, by = 2)) + # <<< add legend name, set legend breaks
  geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3, show.legend = FALSE) + # <<< turn off text legend. Other ways to do it include `guides(color = "none)`
  scale_color_manual(values = c("black", "white")) +
  labs(x = "Year", y = "NPS Unit", title = "Trends in Annual Visitor Counts by NPS Unit, 2000 - 2015", subtitle = "Higher visitor counts shown as darker cells") + 
  theme_minimal(base_size = 12) + # <<< my theme of choice
  theme(legend.position = "top", # <<<< move legend to top)
        legend.key.width = unit(2.5, "cm")) + # <<< make legend longer
  ggforce::facet_col(vars(region), scales = "free_y", space = "free") # <<< fix facet format. Could also do facet_grid (without `ggforce`) but I like the fact that `facet_col` keeps the facet labels on top
)
#------------------------
#        Day 3- GIS
#------------------------
library(sf) # for working with simple features
library(tmap) # for mapping spatial data
library(leaflet) # Another package for mapping spatial data
install.packages('terra', repos='https://rspatial.r-universe.dev')
#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%  
library(sf)
library(tmap)
library(spsurvey)

# Generate a random number for the seed
set.seed(62051)
sample(1:100000, 1) #62051

# Read in shapefiles from teams data folder
sara_bound1 <- st_read('./data/SARA_boundary_4269.shp')
sara_veg1 <- st_read('./data/SARA_Veg.shp')

# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83. 
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE

# Quick plot of first column in attribute table
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary

# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields

str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)

# Simplify vegetation types for easier plotting
dev <- c('1. Urban or Built-up Land', '11. Residential', 
         '12. Commercial and Services', '13. Industrial',
         '14. Transportation, Communications, and Utilities', 
         '17. Other Urban or Built-up Land')
crop <- c('21. Cropland and Pasture', 
          '22. Orchards, Groves, Vineyards, and Nurseries', 
          '31. Herbaceous Rangeland')
shrubland <- c('32. Shrub and Brush Rangeland')
forest_up <- c('41. Deciduous Forest Land', '42. Evergreen Forest Land', 
               '43. Mixed Forest Land')
forest_wet <- c('61. Forested Wetland')
open_wet <- c('62. Nonforested wetland', '62. Nonforested Wetland')
water <- c('5. Water', '51. Streams and Canals', '53. Reservoirs')
unk <- 'Unclassified'

# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>% 
  mutate(simp_veg = case_when(ANDERSONII %in% dev ~ 'Developed',
                              ANDERSONII %in% crop ~ 'Open field',
                              ANDERSONII %in% shrubland ~ 'Shrublands',
                              ANDERSONII %in% forest_up ~ 'Forest',
                              ANDERSONII %in% forest_wet ~ 'Forested wetland',
                              ANDERSONII %in% open_wet ~ 'Open wetland',
                              ANDERSONII %in% water ~ 'Open water',
                              ANDERSONII %in% unk ~ 'Unclassified',
                              TRUE ~ 'Unknown'),
         fill_col = case_when(simp_veg == 'Developed' ~ '#D8D8D8',
                              simp_veg == 'Open field' ~ '#f5f0b0',
                              simp_veg == 'Shrublands' ~ '#F29839',
                              simp_veg == 'Powerline ROW' ~ '#F9421D',
                              simp_veg == 'Forest' ~ '#55785e',
                              simp_veg == 'Forested wetland' ~ '#9577a6',
                              simp_veg == 'Open wetland' ~ '#c497d4',
                              simp_veg == 'Open water' ~ '#AFD0F2',
                              simp_veg == 'Unclassified' ~ '#ffffff'))

library(dplyr)
# Import data
wetdat <- read.csv(
  'https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv')

# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop') # keeps dplyr quiet in console

# Check first 6 rows of output
head(wetsum)

# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c('X_Coord', 'Y_Coord'), crs = 26919) 
  # ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.

# Write sf to disk as shapefile
st_write(wetsum_sf, 'ACAD_wetland_data.shp')
sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)
# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)

# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c('Forest', 'Forested wetland')

# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points

grts_50 <- grts_veg %>% filter(priority <= 50)
nrow(grts_50)
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col)) 

sara_map <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_fill('fill_col') +
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('black', lwd = 2) +
  tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'black',
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) + 
  tm_text('priority', size = 0.9, xmod = -0.4, ymod = -0.4) +
  tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21, 
                col = '#EAFF16', border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c('right', 'bottom'),
            title = 'Saratoga NHP GRTS points',
            title.position = c('center', 'top'))

tmap_save(sara_map, 'SARA_GRTS.png', height = 10.5, width = 8, 
          units = 'in', dpi = 600, outer.margins = 0)
sara_map2 <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_polygons('fill_col') + # A1: changed tm_fill() to tm_polygon()
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('red', lwd = 2) + # A2A: changed 'black' to 'red'
  tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'red', # A2B
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) + 
  tm_text('priority', size = 0.9, xmod = -0.5, ymod = -0.5) +
  tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21, 
                col = '#EAFF16', border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), 
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c('right', 'bottom'),
            title = 'Saratoga NHP GRTS points',
            title.position = c('center', 'top'))
sara_map
knitr::include_graphics('SARA_GRTS.png')
# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'

# Set parktiles as default basemaps for all interactive maps
tmap_options(basemaps = c(Map = NPSbasic,
                          Imagery = NPSimagery,
                          Light = NPSlight,
                          Slate = NPSslate))

# Switch to interactive mode, and plot the sara_map
tmap_mode('view') # turn interactive mode on
sara_map
tmap_mode('plot') # return to default plot view
# Load library
library(leaflet)

# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'

# Reproject SARA layers
sara_bnd_wgs <- st_transform(sara_bound, crs = 4326)
grts_forest_wgs <- st_transform(grts_forest, crs = 4326)

# Calculate centroid for leaflet map view
sara_cent <- st_centroid(sara_bnd_wgs)$geometry
str(sara_cent) # -73.63459, 42.99543
sara_map_lf <-
  leaflet() %>% 
  setView(lng = -73.63, lat = 42.99, zoom = 13) %>% 
  # parktiles
  addTiles(group = 'Map',
           urlTemplate = NPSbasic) %>%
  addTiles(group = 'Imagery',
           urlTemplate = NPSimagery) %>%
  addTiles(group = 'Light',
           urlTemplate = NPSlight) %>%
  addTiles(group = 'Slate',
           urlTemplate = NPSslate) %>% 
  addLayersControl(map = .,
    baseGroups = c('Map', 'Imagery', 'Light', 'Slate'),
    options = layersControlOptions(collapsed = T)) %>% 
  # points
  addCircleMarkers(data = grts_forest_wgs,
                   lng = st_coordinates(grts_forest_wgs)[,1],
                   lat = st_coordinates(grts_forest_wgs)[,2],
                   label = grts_forest_wgs$priority,
                   labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                      direction = 'bottom', textsize = '12px'),
                   fillColor = '#EAFF16',
                   radius = 4,
                   stroke = FALSE, # turn off outline
                   fillOpacity = 1) %>% 
  # scale bar and settings
  addScaleBar(position = 'bottomright') %>% 
  scaleBarOptions(maxWidth = 10, metric = TRUE) 

sara_map_lf

# Retrieve the latest water quality data from the full dataset
pH_2020 <- dplyr::filter(pH_Data_All, WaterYear == 2020)
DissolvedOxygen_2020 <- dplyr::filter(DO_Data_All, WaterYear == 2020)
SpecificConductance_2020 <- dplyr::filter(SpCond_Data_All, WaterYear == 2020)
current_water_year <- 2020

pH_Current <- dplyr::filter(pH_Data_All, WaterYear == current_water_year)
DissolvedOxygen_Current <- dplyr::filter(DO_Data_All, WaterYear == current_water_year)
SpecificConductance_Current <- dplyr::filter(SpCond_Data_All, WaterYear == current_water_year)
# Numeric vectors
numbers <- 1:10
other_numbers <- 6:15

# Character vectors
code_lang <- c("R", "C++", "Java", "Python", "C")
human_lang <- c("English", "Spanish", "Chinese", "Russian")
new_numbers <- c()  # Create a new NULL vector

new_numbers[1] <- 2 * numbers[1]
new_numbers[2] <- 2 * numbers[2]
new_numbers[3] <- 2 * numbers[3]
new_numbers[4] <- 2 * numbers[4]
new_numbers[5] <- 2 * numbers[5]
new_numbers[6] <- 2 * numbers[6]
new_numbers[7] <- 2 * numbers[7]
new_numbers[8] <- 2 * numbers[8]
new_numbers[9] <- 2 * numbers[9]
new_numbers[10] <- 2 * numbers[10]

new_numbers
new_numbers <- numbers * 2
new_numbers
numbers
other_numbers

other_numbers - numbers
other_numbers * numbers
ones <- rep(1, 6)  # Vector of 1's, length 6
ones

add_this <- c(1, 2)
this_one_breaks <- c(1, 2, 3, 4)

ones + add_this  # This works fine
ones + this_one_breaks  # This generates a warning
messy_data <- tibble::tibble(plot_id = c("1 ", " 2 ", " ", " 4", "5."),
                             year_monitored = c(2019, 2019, 2019, 2020, 2020))
messy_data
clean_data <- messy_data %>% dplyr::mutate(plot_id = trimws(plot_id),  # Get rid of leading and trailing spaces
                                           plot_id = gsub(pattern = ".", replacement = "", x = plot_id, fixed = TRUE)) %>%  # Replace the stray "." with nothing
  dplyr::filter(plot_id != "") %>%
  dplyr::mutate(plot_id = as.integer(plot_id))
clean_data
plot_id_vec <- messy_data$plot_id
plot_id_vec
plot_id_vec <- trimws(plot_id_vec)
plot_id_vec
plot_id_vec <- gsub(pattern = ".", replacement = "", x = plot_id_vec, fixed = TRUE)
plot_id_vec
id_is_valid <- plot_id_vec != ""
id_is_valid
plot_id_vec <- plot_id_vec[id_is_valid]
plot_id_vec
plot_id_vec <- as.integer(plot_id_vec)
if(...){
  "thing to do if TRUE"}
else{
  "thing to do if FALSE"
}
if (...) {
  "thing to do if TRUE"
} else {
  "thing to do if FALSE"
}
d <- c(1:4, 1:4 * 100, NA)

if (all(is.finite(d))) {
  "no NA, NaN, or Inf present"
} else {
  "NA present"
}
d <- c(1:4, 1:4 * 100)

if (all(is.na(d))) {
  "NA present"
} else if (all(d < 100)) {
  "all items less than 100"
} else if (any(d > 100)) {
  paste(sum(d > 100), "items more than 100")
} else {
  "I'm stumped"
}
d <- c(1:4, 1:4 * 100)

if (all(is.na(d))) {
  "NA present"
}
if (all(d < 100)) {
  "all items less than 100"
}
if (any(d > 100)) {
  paste(sum(d > 100), "items more than 100")
} else {
  "I'm stumped"
}
d <- c(1:4, 100*1:4)
d
q_logic<- d < (mean(d) / 10)
q_logic
ifelse(q_logic, yes="verify", no="OK") #yes is the true condition no is the false condition
q<-ifelse(d < (mean(d) / 10), "verify", "OK")
d <- c(1:4, 100*1:4, NA)
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
d <- c(1:4, 100*1:4, NA)
ifelse(is.na(d) == T, "Missing",
  ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
)
d <- c(1:4, 100*1:4, NA)
dplyr::case_when(
  is.na(d) ~ "Missing",
  d < mean(d, na.rm = T) / 10 ~ "Verify",
  d > mean(d, na.rm = T) / 10 ~ "OK"
)
for(i in vec){ 
  do thing 1
  do thing 2
  ...
  append result to results of previous iterations
}
vec<-1:9 #vector over which we will iterate
for (i in vec) { # for the number 1 through 9
         i
}   
i 
for (i in vec) { # for the number 1 through 9
         print(i+1)
}   
#alternatively predefine a container to receive the data
OutDat <- NULL # define an empty variable. 
for (i in 1:9) {
  OutDat[i] <- i + 1 # store that in the ith position of the variable OutDat
}
OutDat
library(magrittr)
library(ggplot2)

#create a list of file names that we will read in
fPaths <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)
fPaths
#read in and inspect the first element (for training purposes)
temp<-read.csv(fPaths[1], skip = 1, header = T) #hobo data has an extra line at the top we need to skip
head(temp)
temp<-read.csv(fPaths[1], skip = 2, header = F) #Set header to F so read.csv doesn't treat data as a header.
names(temp)[1:3]<-c("idx", "DateTime", "T_F") #let's just assign some column names really quick
head(temp)
str(temp) #look at data types
rm(temp) #do a little cleanup 
len<-length(fPaths) #calculate the number of iterations

SummaryData <- list() #Generate an empty list that we will put output into. Always rerun prior to loop so that it is empty.

for (i in 1:len) {
  # 1 Read in and generate the data to use
  # 1a extract the filename from a filepath and parse the metadata
  fName <- basename(fPaths[i])
  Site <- stringr::str_sub(fName, 1, 6)
  Year <- as.numeric(stringr::str_sub(fName, -13, -10))
  # 1b read in the data, change headers,
  d <- read.csv(fPaths[i], skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))
  # 2 summarize and 3 output the data
  SummaryData[[i]] <- data.frame( #output the data
    # 2 what to summarize for all files 
    Site = Site,
    Year = Year,
    FileName = fName,
    # Variable output
    # 2a What to summarize so long as there is 1 non-NA record
    if (any(is.numeric(d[, 3]))) { # check that there is at least 1 non-NA record
      list(
        T_F_mean = mean(d$T_F, na.rm = T),
        T_F_min = min(d$T_F, na.rm = T),
        T_F_max = max(d$T_F, na.rm = T),
        nObs = nrow(d),
        Flag = NA
      ) 
      # 2b what to summarize if the data are missing.
    } else {
      list(
        T_F_mean = NA,
        T_F_min = NA,
        T_F_max = NA,
        nObs = nrow(d),
        nDays = NA,
        Flag = "Hobo File Present But No Data"
      )
    }
  )
}
dplyr::bind_rows(SummaryData)
OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this later)
# set up some QA thresholds
TempUpper <- 40
TempLower <- (-40)

for (i in fPaths) {
  # 1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))

  # 2 Generate your output in this case, some QA/QC flagging
  d_com <- d %>%
    dplyr::mutate(
      # checks for too big of a temp change
      TChangeFlag = ifelse(c(abs(diff(T_F, lag = 1)) > 10, FALSE), "D10", NA),
      # flag if a temp is too high or too low
      Flag = dplyr::case_when(
        is.na(T_F) ~ "MISSING",
        T_F > TempUpper ~ "High",
        T_F < TempLower ~ "Low"
      )
    ) %>%
    # merge so you can have multiple flags applied
    tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "_", remove = TRUE, na.rm = TRUE)

  # 3 output the data
  dir.create(OutPath, showWarnings = F)
  write.csv(d_com[, c("DateTime", "T_F", "Flag")],
    paste0(OutPath, gsub(".csv", "_QC.csv", fName)),
    row.names = F, quote = F
  )
}
OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this

for (i in fPaths) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  #2 Do the thing you want to do, in this case make and save plots
  p <- ggplot(d, aes(x = DateTime2, y = T_F)) +
    geom_point(size = 0.5) +
    ggtitle(fName)
  #3 Output the results outside of the for loop
  dir.create(OutPath, showWarnings = F)
  ggsave(
    filename = gsub(".csv", "_plot.pdf", fName), 
    path = OutPath, device = "pdf",
    plot = p, width = 5, height = 5, units = "in"
  )
}
#alternatively predefine a container to receive the data
nIter=10000
OutDat <- NULL # define an empty variable. 
system.time(for (i in 1:nIter) {
  d <- data.frame(x=i + 1,y=i)
  OutDat<-rbind(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.47 seconds on my system

#similarly you could try append() - it works
nIter=10000
OutDat <- NULL # define an empty variable. 
system.time(for (i in 1:nIter) {
  d <- data.frame(x=i + 1,y=i)
  OutDat<-append(OutDat,d)# append the previous steps' outputs with this step's output
}) #4.93 seconds on my system

nIter=10000 # number of iterations
OutDat <- data.frame(x=rep(NA,nIter),y=rep(NA,nIter)) # define an preallocated dataframe
system.time(for (i in 1:nIter) {
  OutDat[i,] <- data.frame(x=i + 1, y=i)
}) #2.42 seconds on my system

str(mtcars)
head(mtcars)
apply(mtcars,MARGIN=1, FUN=mean) #calculate mean across the row
apply(mtcars, MARGIN=2, FUN=mean)
fPaths
HoboList<-lapply(fPaths,read.csv, skip=1, header=T)
HoboList
HoboList<-lapply(HoboList,"[",,1:3)
HoboList  
HoboList<-lapply(HoboList,setNames, c("idx","DateTime","T_F"))
HoboList
HoboList<-lapply(fPaths,read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(.,"[",,1:3)%>% #2. Grab only first 3 columns
          lapply(.,setNames, c("idx","DateTime","T_F")) #3. fix column names
HoboSummary<-lapply(HoboList[-3], FUN=function(x){ #-3 drop that blank file for now
  d<-data.frame(
    Mean = mean(x$T_F, na.rm=T),
    Min = min(x$T_F, na.rm=T),
    Max = max(x$T_F, na.rm=T),
    nObs = nrow(x)
  )
})
do.call(rbind, HoboSummary) #equivalent to dplyr::bind_rows(HoboSummary)
guess <- 0  # Initialize the guess variable
number <- sample(1:10, 1)  # Get our random number
guess_count <- 0  # Number of guesses so far

while (guess != number && guess_count < 3) {  # Keep looping if the guess is wrong and we haven't hit our 3 guesses yet
  guess_count <- guess_count + 1
  message <- paste0("Guess ", guess_count, ": what is the random number?  ")
  guess <- as.integer(readline(prompt = message))
  
  if (guess == number) {
    print("Woohoo! You won the game!")
  } else if (guess_count < 3) {
    print("Nope, try again.")
  } else {
    print("You lost :(")
  }
}

#effectively these are manual "breakpoints"

i=fNames[3] #code dev/debugging fragment
# rm(i) #debug.dev frag
SummaryData <- NULL # make sure this gets cleared before rerunning this code
# for(i in fNames){
d <- read.csv(i, skip = 1, header = T)[1:3]
d # check output
fName <- basename(i)
# print(fName)#dev/debug frag
names(d) <- c("idx", "DateTime", "T_F")
# summarize the data
OutDat <- data.frame(
  FileName = fName,
  T_F_mean = mean(d$T_F, na.rm = T),
  T_F_min = min(d$T_F, na.rm = T),
  T_F_max = max(d$T_F, na.rm = T),
  nObs = nrow(d)
) # days
# Append new summary data to old summary data
SummaryData <- rbind(SummaryData, OutDat)
#  }

Meet the Instructors

We are the people who designed and led the training for this week. We hope you found it useful, and that you keep with it!

Andrew Birch
WRD/IMD Water Quality Program Lead



Ellen Cheng
SER Quantitative Ecologist



Kate Miller
NETN/MIDN Quantitative Ecologist



Lauren Pandori
CABR Marine Biologist - MEDN



Thomas Parr
GLKN Program Manager



John Paul Schmit
NCRN Quantitative Ecologist



Sarah Wright
MOJN Data Scientist