The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.
NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!
Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.
A number of packages are required to follow along with the Data Visualization session. Please try to install these ahead of time by running the code below, so you’re able to keep up.
packages <- c("tidyverse", "ggthemes", "GGally", "RColorBrewer",
"viridis", "scales", "plotly", "patchwork",
"sf", "tmap", "leaflet", "spsurvey")
install.packages(setdiff(packages, rownames(installed.packages())))
If folks are having trouble installing tmap
due to an issue with one of its dependencies, terra
, try running the following code, and then reinstall tmap
.
install.packages('terra', repos='https://rspatial.r-universe.dev')
The training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.
Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.
Four days is barely enough to scratch the surface on what you can do in R. Our goals with this training are to help you get beyond the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you in your work. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you start out on this R learning curve. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.
Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.
Welcome to the class!
Learning (or re-learning) how to program can feel like an intimidating learning curve. Congrats on taking the first step! Before we delve into technical topics, let’s talk a little about coding in general.
Errors and broken code are not only part of the learning process, they’re part of the coding process. If you think your code is running perfectly on the first try, you probably didn’t test it thoroughly enough. It can be frustrating and discouraging when you can’t get your code to work the way you want it to, but getting stuck doesn’t make you a bad programmer. Try to approach broken code as a puzzle and a learning opportunity.
As the saying goes,
Good coders write good code; great coders steal great code.
Google and Stack Overflow are great resources when you don’t know off the top of your head how to do something. And reach out to your colleagues too - there’s no reason to waste time writing code that someone else has already written! Just give credit where credit is due, and take some time to make sure you understand what the copied code does. And especially don’t hesitate to reach out to more experienced colleagues with questions.
R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source, which means you will never need a paid license to use it, and anyone who wants to can view the underlying source code and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.
Every programming language has strengths and weaknesses. We like R because:
rmarkdown
package)For more information on the history of R, visit the R Project website.
RStudio is what’s called an integrated development environment (IDE). When you install R, it comes with a simple user interface that lets you write and execute code. Writing code in this interface is kind of like doing word processing in Notepad: it’s simple and straightforward, but soon you’ll start wishing for more features. This is where RStudio comes in. RStudio makes programming in R easier by color coding different types of code, autocompleting code, flagging mistakes (like spellcheck for code), and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).
When you open RStudio, you typically see 4 panes:
Source:
This is where the code gets written. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.
The source pane will helpfully color-code your code to make it easier to read. It will also detect syntax errors (the coding equivalent of a grammar checker). These are flagged with a red “x” next to the line number and a squiggly line under the offending code. When you’re ready to run all or part of your script:
Console:
This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.1.0 or greater). We talked above about how you can run a script at the console. You can also type code directly into the console and run it that way. That code won’t get saved to a file, but it’s a great way to experiment and test out lines of code before you add them to your script in the source pane.
The console is also where errors appear if your code breaks. Deciphering errors can be a challenge sometimes, but Googling them is a good place to start.
Environment/History/Connections:
Files/Plots/Packages/Help/Viewer:
There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.
load(".RData")
at the console the next time you open RStudio to restore everything.Most other settings are whatever you prefer, including everything in the Appearance window.
When you’re working on a code project it’s important to stay organized. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.
Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.
If you refer to a file in R using a relative path, for example ./data/my_data_file.csv
, R will look in your current working directory for a folder called data
containing a file called my_data_file.csv
. Using relative paths is a good idea because the full path is specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in.
Let’s make a project for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd-r-training-intro
. Next, you’ll select what folder to keep your project folder in. Documents/R
is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.
If you successfully started a project named imd-r-training-intro
, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.
First we need to create a new R script file. Make sure you are working in the imd-r-training-intro
project that you just created. Click on the New File icon in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call it my_first_script.R
.
We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. Try it out! You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.
To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.
Text following a hashtag/pound sign (#) will be ignored - use this to leave comments about what your code is doing. Commenting your code is one of the best habits you can form, and you should start now! Comments are a gift to your future self and anyone else who tries to use your code.
# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code
# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")
## [1] "Hello, lets do some basic math. Results of operations will appear here"
# one plus one
1+1
## [1] 2
# two times three, divided by four
(2*3)/4
## [1] 1.5
# basic mathematical and trigonometric functions are fairly similar to what they would be in excel
# calculate the square root of 9
sqrt(9)
## [1] 3
# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)
## [1] 2
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)
## [1] -1
# calculate 5 to the tenth power
5^10
## [1] 9765625
Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.
Occasionally, it’s enough to just run a line of code and display the result in the console. But typically the code we write is more complex than adding one plus one, and when we execute a line of code, we want to be able to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs.
R uses the <-
symbol for variable assignment. If you’ve used other programming languages (or remember high school algebra), you may be tempted to use =
instead. It will work, but there are subtle differences between <-
and =
, so you should get in the habit of using <-
.
# the value of 12.098 is assigned to variable 'a'
a <- 12.098
# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475
# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:
a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261
In the code above, we assign the variables a
and b
once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a
or b
, we only have to do it in one place.
Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.
All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.
It’s hard to get very far in R without making use of functions. Think of a function as a black box that takes some kind of input (the argument(s)) from the user and outputs a result (the return value). Can you identify the functions in the above examples?
The read.csv function is used to bring in a dataset from a CSV file, and it stores the data in one of the fundamental data structures in R: a data frame. read.csv()
takes the file path to the CSV as input, and it outputs a data frame containing the data from the CSV.
We’ll get very familiar with data frames in this class, but for the moment just know that it’s a table of data with rows and columns. Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). Once you have read the data in, you can take a quick look at its structure by typing the name of the variable it’s stored in.
# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")
# display the 'tree_data' data frame we just created
tree_data
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4.0 6 3
## 2 arboreas madeupis 6.0 10 5
## 3 arboreas madeupis 8.0 5 6
## 4 arboreas madeupis 11.0 7 6
## 5 arboreas madeupis 12.0 16 8
## 6 arboreas madeupis 13.0 19 8
## 7 arboreas madeupis 12.0 24 11
## 8 arboreas madeupis 20.0 22 11
## 9 arboreas madeupis 20.0 32 12
## 10 arboreas madeupis 11.0 31 13
## 11 arboreas madeupis 27.0 35 15
## 12 arboreas madeupis 25.0 33 15
## 13 arboreas madeupis 30.0 40 17
## 14 arboreas madeupis 35.0 67 18
## 15 arboreas madeupis 29.0 60 21
## 16 arboreas madeupis 37.0 65 30
## 17 arboreas madeupis 31.0 78 32
## 18 arboreas madeupis 38.0 76 36
## 19 arboreas madeupis 45.0 91 37
## 20 arboreas madeupis 55.0 83 38
## 21 arboreas madeupis 60.0 88 40
## 22 arboreas madeupis 75.0 86 43
## 23 arboreas madeupis 76.0 89 51
## 24 arboreas madeupis 80.0 75 55
## 25 arboreas madeupis 99.0 100 61
## 26 planteus falsinius 9.0 6 2
## 27 planteus falsinius 9.3 9 4
## 28 planteus falsinius 10.1 8 5
## 29 planteus falsinius 10.3 14 4
## 30 planteus falsinius 10.7 12 6
## 31 planteus falsinius 10.8 10 8
## 32 planteus falsinius 11.0 10 7
## 33 planteus falsinius 11.3 11 7
## 34 planteus falsinius 11.4 13 9
## 35 planteus falsinius 11.6 13 12
## 36 planteus falsinius 11.6 15 15
## 37 planteus falsinius 11.9 23 17
## 38 planteus falsinius 12.3 16 18
## 39 planteus falsinius 12.4 21 15
## 40 planteus falsinius 12.6 23 17
## 41 planteus falsinius 12.6 18 21
## 42 planteus falsinius 12.8 25 20
## 43 planteus falsinius 15.0 35 30
Compared to scrolling through a dataset in Excel, examining a dataset in R can feel unintuitive at first. Stick with it, though. Once you get used to exploring data in R, you’ll find that R will help you learn more about your dataset in a more efficient manner.
Let’s see what we can learn about the dataset as a whole. The most useful functions for this are:
head()
Show only the first few lines of the dataframe in the console. The function is useful for taking a quick glance to ensure that column names were properly assigned, and to take a quick look a column names without printing the whole dataset.summary()
Show a basic statistical summary of the dataset.str()
Show information about the structure of the dataset, including number of rows and columns, column data types, and the first few values in each column.View()
In a new tab, display a spreadsheet-like view of the full dataset with options to sort and filter. Note that you can’t change the data in this view - it is read-only.head(tree_data) # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4 6 3
## 2 arboreas madeupis 6 10 5
## 3 arboreas madeupis 8 5 6
## 4 arboreas madeupis 11 7 6
## 5 arboreas madeupis 12 16 8
## 6 arboreas madeupis 13 19 8
head(tree_data, n = 10) # Show the first 10 rows
## Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis 4 6 3
## 2 arboreas madeupis 6 10 5
## 3 arboreas madeupis 8 5 6
## 4 arboreas madeupis 11 7 6
## 5 arboreas madeupis 12 16 8
## 6 arboreas madeupis 13 19 8
## 7 arboreas madeupis 12 24 11
## 8 arboreas madeupis 20 22 11
## 9 arboreas madeupis 20 32 12
## 10 arboreas madeupis 11 31 13
summary(tree_data) # View a basic statistical summary
## Species DBH_in Height_ft Age_yr
## Length:43 Min. : 4.00 Min. : 5.00 Min. : 2.00
## Class :character 1st Qu.:11.00 1st Qu.: 12.50 1st Qu.: 7.50
## Mode :character Median :12.60 Median : 23.00 Median :15.00
## Mean :24.78 Mean : 35.35 Mean :18.81
## 3rd Qu.:30.50 3rd Qu.: 62.50 3rd Qu.:25.50
## Max. :99.00 Max. :100.00 Max. :61.00
str(tree_data) # Get info about the structure of the data
## 'data.frame': 43 obs. of 4 variables:
## $ Species : chr "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
## $ DBH_in : num 4 6 8 11 12 13 12 20 20 11 ...
## $ Height_ft: int 6 10 5 7 16 19 24 22 32 31 ...
## $ Age_yr : int 3 5 6 6 8 8 11 11 12 13 ...
View(tree_data) # Open a new tab with a spreadsheet view of the data
We’re going to take a little detour into data structures at this point. It’ll all tie back in to our tree data.
The data frame we just examined is a type of data structure. A data structure is what it sounds like: it’s a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this unfortunately-named data structure later). Today we’ll focus on vectors and data frames.
Vectors are the simplest data structure in R. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. Here are some examples of vectors:
digits <- 0:9 # Use x:y to create a sequence of integers starting at x and ending at y
digits
## [1] 0 1 2 3 4 5 6 7 8 9
is_odd <- rep(c(FALSE, TRUE), 5) # Use rep(x, n) to create a vector by repeating x n times
is_odd
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
## [1] 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
## [1] "greater roadrunner" "Costa's hummingbird" "kakapo"
Note the use of c()
. The c()
function stands for combine, and it combines elements into a single vector. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.
Let’s play around with vectors a little more. We can use is.vector()
to test whether something is a vector. We can get the length of a vector with length()
. Note that single values in R are just vectors of length one.
is.vector(digits) # Should be TRUE
## [1] TRUE
is.vector(favorite_birds) # Should also be TRUE
## [1] TRUE
length(digits) # Hopefully this is 10
## [1] 10
length(shoe_sizes)
## [1] 10
# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
## [1] TRUE
is.vector(length_one_int)
## [1] TRUE
length(length_one_chr)
## [1] 1
length(length_one_int)
## [1] 1
In the examples above, each vector contains a different type of data. digits
contains integers, is_odd
contains logical (true/false) values, favorite_birds
contains text, and shoe_sizes contains decimal numbers. That’s because a given vector can only contain a single type of data. In R, there are four data types that we will typically encounter:
"hello"
, "3"
, "R is my favorite programming language"
)23
, 3.1415
)L
to it or use as.integer()
(e.g. 5L
, as.integer(30)
).TRUE
, FALSE
). Note that TRUE
and FALSE
must be all-uppercase.There are two more data types, complex and raw, but you are unlikely to encounter these so we won’t cover them here.
You can use the class()
function to get the data type of a vector:
class(favorite_birds)
## [1] "character"
class(shoe_sizes)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"
If you need to access a single element of a vector, you can use the syntax my_vector[x]
where x
is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1]
is the first element of my_vector
). If you’ve coded in other languages, you may be used to indexing starting at 0.
second_favorite_bird <- favorite_birds[2]
second_favorite_bird
## [1] "Costa's hummingbird"
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
## [1] "greater roadrunner" "Costa's hummingbird"
Logical vectors can also be used to subset a vector. The logical vector must be the length of the vector you are subsetting.
odd_digits <- digits[is_odd]
odd_digits
## [1] 1 3 5 7 9
Let’s revisit our trees data frame. We’ve explored the data frame as a whole, but it’s often useful to look at one column at a time. To do this, we’ll use the $
syntax:
tree_data$Species
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [7] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [10] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [13] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [16] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [19] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [22] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [25] "arboreas madeupis" "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data$Age_yr
## [1] 3 5 6 6 8 8 11 11 12 13 15 15 17 18 21 30 32 36 37 38 40 43 51 55 61
## [26] 2 4 5 4 6 8 7 7 9 12 15 17 18 15 17 21 20 30
You can also use square brackets []
to access data frame columns. You can specify columns by name or by index (integer indicating position of column). It’s almost always best to refer to columns by name when possible - it makes your code easier to read, and it prevents your code from breaking if columns get reordered.
The square bracket syntax allows you to select multiple columns at a time and to select a subset of rows.
tree_data[, "Species"] # Same as tree_data$Species
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [7] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [10] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [13] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [16] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [19] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [22] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [25] "arboreas madeupis" "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data[, c("Species", "Age_yr")] # Data frame with only Species and Age_yr columns
## Species Age_yr
## 1 arboreas madeupis 3
## 2 arboreas madeupis 5
## 3 arboreas madeupis 6
## 4 arboreas madeupis 6
## 5 arboreas madeupis 8
## 6 arboreas madeupis 8
## 7 arboreas madeupis 11
## 8 arboreas madeupis 11
## 9 arboreas madeupis 12
## 10 arboreas madeupis 13
## 11 arboreas madeupis 15
## 12 arboreas madeupis 15
## 13 arboreas madeupis 17
## 14 arboreas madeupis 18
## 15 arboreas madeupis 21
## 16 arboreas madeupis 30
## 17 arboreas madeupis 32
## 18 arboreas madeupis 36
## 19 arboreas madeupis 37
## 20 arboreas madeupis 38
## 21 arboreas madeupis 40
## 22 arboreas madeupis 43
## 23 arboreas madeupis 51
## 24 arboreas madeupis 55
## 25 arboreas madeupis 61
## 26 planteus falsinius 2
## 27 planteus falsinius 4
## 28 planteus falsinius 5
## 29 planteus falsinius 4
## 30 planteus falsinius 6
## 31 planteus falsinius 8
## 32 planteus falsinius 7
## 33 planteus falsinius 7
## 34 planteus falsinius 9
## 35 planteus falsinius 12
## 36 planteus falsinius 15
## 37 planteus falsinius 17
## 38 planteus falsinius 18
## 39 planteus falsinius 15
## 40 planteus falsinius 17
## 41 planteus falsinius 21
## 42 planteus falsinius 20
## 43 planteus falsinius 30
tree_data[1:5, "Species"] # First five elements of Species column
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis"
Now that we’ve learned about vectors, these data frame columns might look familiar. A data frame is just a collection of vectors of the same length, where each vector is a column.
is.vector(tree_data$Species)
## [1] TRUE
class(tree_data$Species)
## [1] "character"
is.vector(tree_data$Age_yr)
## [1] TRUE
class(tree_data$Age_yr)
## [1] "integer"
str(tree_data) # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
## 'data.frame': 43 obs. of 4 variables:
## $ Species : chr "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
## $ DBH_in : num 4 6 8 11 12 13 12 20 20 11 ...
## $ Height_ft: int 6 10 5 7 16 19 24 22 32 31 ...
## $ Age_yr : int 3 5 6 6 8 8 11 11 12 13 ...
R is great at working with vectors because they are the fundamental data structure of the language. Since data frame columns are vectors, they are typically easy to operate on.
Basic statistical summary functions aren’t too different from what you’d do in Excel.
#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)
## [1] 18.81395
#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)
## [1] 24.78372
#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
## [1] 15
sd(tree_data$Age_yr)
## [1] 15.02261
min(tree_data$Age_yr)
## [1] 2
Another useful way to summarize a column is to get the unique values it contains. This is typically most useful with character columns (e.g. species) but works with any data type.
unique(tree_data$Species) # in this line, we are printing all of the unique species names in the dataset (in this case 2).
## [1] "arboreas madeupis" "planteus falsinius"
species_names <- unique(tree_data$Species) # assign unique species names to a variable
Summarizing data is useful but we also want to be able to modify it. So let’s say we want to convert the ‘Height_ft’ column in tree_data
from feet to inches. We’ll start by assigning a conversion factor to the variable in_per_ft
.
in_per_ft <- 12 # Since there are 12 inches in a foot
We can then use this conversion factor to convert the ‘Height_ft’ column to inches, and place this converted value in a new column we will create in the data frame. Notice below that if you assign a vector to a column that doesn’t exist yet, R will automatically create that column. We’ll use the head()
function to verify that the Height_in
column was created correctly.
# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft
# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
## Species DBH_in Height_ft Age_yr Height_in
## 1 arboreas madeupis 4 6 3 72
## 2 arboreas madeupis 6 10 5 120
## 3 arboreas madeupis 8 5 6 60
## 4 arboreas madeupis 11 7 6 84
## 5 arboreas madeupis 12 16 8 192
## 6 arboreas madeupis 13 19 8 228
It’s often easier to get a sense of our data by visualizing it, so let’s explore the basic tools for taking a first glance at our data.
We’ll start with a histogram. The code below generates a basic histogram plot of a specific column in the dataframe (remember from earlier that this is denoted by the $
sign) using the hist()
function. We’ll start by creating a histogram for the ‘Age_yr’ column.
hist(x = tree_data$Age_yr)
Looking at the histogram, we can see that the age of most trees in the dataset fall in the 0-20 year old age class.
The data frame also includes information on DBH, and height of the trees. What if we want to quickly take a look at how these two variables relate to one another? To get a basic plot, you can use the plot()
function. A simple call to plot()
takes two arguments: x (a vector of x coordinates) and y (a vector of y coordinates). You can learn more about plot()
by typing ?plot
at the console.
# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)
# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)
There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.
?plot
?dplyr::filter
If you have a question, online resources include Stackexchange and Stackoverflow are extremely helpful. Google searches are a great first step. It helps to include “in R” in every search related to R code.
Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused data science community of practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join. The Teams join code will be shared with course participants.
Goals of this Day
Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.
What is a package? A package is a group of functions that someone has put together. They come with documentation that tells you about what each function does.
First, load package
install.packages("janitor")
To use a package, call it by using the function “library”
library('janitor')
This is called “loading a package” or “calling a package”.
Look at the vignette
take time to click a function from the vignette list walk through how to get helpful info
Did you get “error” text?
Text pops up in the console after you run code You must decide whether it means you have to change something or not
library('janitor')
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Challenge: install + call the package ‘tidyverse’
Solution: install + call the package ‘tidyverse’
install.packages("tidyverse")
library('tidyverse')
Data wrangling is all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process.Material presented today can be found in Hadley Wickham’s book “R for Data Science”, which is free and online. https://r4ds.had.co.nz/wrangle-intro.html
If you’re starting from this tab of the lesson, you’ll need to load two packages before proceeding.
library('janitor')
library('tidyverse')
Major Components to Data Wrangling
Sometimes you get data where the column names are messy. For an example of data that needs tidying before exploration and analyses, we’ll use fake data. Copy + paste the code chunk below into your script, and run it in the console to generate a dataframe called “messy_data”.
Here, we’ll clarify that we are using the word “data.frame” or “dataframe” to describe the data. The word “tibble” also shows up. We’ll use these terms interchangeably.
messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools',
'Monument', 'Tidepools', NA),
Visitor_count = c(200, 300, 400, 500, 400, NA),
`empty row` = NA)
This data.frame captures the hypothetical number of visitors across four trails at a park. There are a few qualities that make it difficult to interpret. Seeing this data, what would you change to make it most consistent?
messy_data
## # A tibble: 6 x 3
## TrailName Visitor_count `empty row`
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
## 6 <NA> NA NA
There are four things I suggest fixing with these data:
The Janitor package has ways to fix all of these data situations and aid in data cleaning. Let’s proceed with changing the column names.
The clean_names() function in the janitor package will consistently format column names.
clean_data <- clean_names(messy_data)
clean_data
## # A tibble: 6 x 3
## trail_name visitor_count empty_row
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
## 6 <NA> NA NA
Now, we’ll deal with the empty row and empty column. To remove empty rows or columns, we can use the remove_empty() function.
The arguments for the function are as follows:
remove_empty(
# first argument is the data
dat,
# the next specifies if you want to get rid of empty rows, columns, or both
which = c("rows", "cols"),
# this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
quiet = TRUE)
Let’s give it a try
Removing empty rows
clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
## # A tibble: 5 x 3
## trail_name visitor_count empty_row
## <chr> <dbl> <lgl>
## 1 Bayside 200 NA
## 2 Coastal 300 NA
## 3 Tidepools 400 NA
## 4 Monument 500 NA
## 5 Tidepools 400 NA
Removing empty rows and columns
clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
What happens if you’d like to both clean_names() and remove_empty() on the same data?
Solution 1 - run as a sequence
clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
Solution 2 - use the pipe operator
Pipe operators (%>% or |>) can be read like the word “then”. For example, clean_names() %>% remove_empty() can be read as “clean names, then remove empty”. They allow for a more clean workflow.
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
## 5 Tidepools 400
# note that replacing %>% with |> also works
Note that there is no data argument in the remove_empty() function. This is because, when using a pipe, the data argument is inherited from the prior function.
A final function we’ll discuss is distinct() from the dplyr package, which is park of tidyverse. It removes duplicate rows from data.
clean_data <- clean_names(messy_data) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
clean_data
## # A tibble: 4 x 2
## trail_name visitor_count
## <chr> <dbl>
## 1 Bayside 200
## 2 Coastal 300
## 3 Tidepools 400
## 4 Monument 500
A word of caution - sometimes you may not want to remove duplicate rows, or may not want to remove empty rows, because they are important to the data. You know your data best. Use functions responsibly.
The code chunk below creates messy data, called “messy_data2”. Use functions from the Janitor package to tidy the data, and name the result “clean_data2”. See if you can use the pipe operator to streamline the script.
messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
visitor_count = c(NA, NA, 400, 500, 400, 400),
`empty row` = NA)
clean_data2 <- clean_names(messy_data2) %>%
remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
distinct()
In this section, we’ll introduce functions in the dplyr package. This package is one of the packages that make up the tidyverse, so you won’t have to install and call another package. The dplyr package is a group of functions that help to manipulate and summarize data. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8
You can think of these functions like verbs that accomplish tasks, and are strung together by the pipe operator (%>% or |>).
We’ll cover the following functions:
The data we’ll use for modules 3-5 is from the Tidy Tuesday community on GitHub. Each week on Tuesday, the community posts data for exploratory data analysis and visualization. You can add data from this page to your RStudio environment by copying + pasting the “Get the data!” code chunk into your script and running it in the console.
https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-17
The data we’ll use is in csv format, so we’ll need to call the readr package, as well as the tidyverse and janitor for this lesson (if you’ve already called tidyverse and janitor, don’t run those lines).
library(readr)
library(janitor)
library(tidyverse)
Next, we’ll load the data from the R for Data Science Tidy Tuesday GitHub page. There are three lines which will load three data.frames.
For now, we’ll focus on the park_visits data.frame, and will circle back to the other two later. There are descriptions of each column for each data.frame on the Tidy Tuesday website linked above if you’d like more information.
park_visits <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
Dplyr is a package that is part of the tidyverse, which is used with data.frames. It makes tasks like filtering and summarizing data easier. Many of the functions within the package are “verb-like” so they are easy to understand and remember.
select() - keeps the columns you indicate and drops the rest. Since select() puts the columns in the order you type them, you can also use select() to reorder columns.
rename() - keeps all the columns, but renames the ones you indicate
Since we won’t be using geographical information, other than state, in our module, we’ll select some columns from the park_visits data to use today.
Note that, the name to the left of the arrow is the resulting data.frame, the information to the right that’s connected by pipes are the steps to get to the result. I normally name the result something different than the original data.frame so that the original data aren’t affected. However, since we won’t need the geographical information later on, I will name this one “park_visits”.
# let's see the column names in park_visits
colnames(park_visits)
## [1] "year" "gnis_id" "geometry"
## [4] "metadata" "number_of_records" "parkname"
## [7] "region" "state" "unit_code"
## [10] "unit_name" "unit_type" "visitors"
# we can also look at the first few rows by using head()
head(park_visits)
## year gnis_id geometry metadata number_of_records parkname
## 1 1904 1163670 POLYGON <NA> 1 Crater Lake
## 2 1941 1531834 MULTIPOLYGON <NA> 1 Lake Roosevelt
## 3 1961 2055170 MULTIPOLYGON <NA> 1 Lewis and Clark
## 4 1935 1530459 MULTIPOLYGON <NA> 1 Olympic
## 5 1982 277263 POLYGON <NA> 1 Santa Monica Mountains
## 6 1919 578853 MULTIPOLYGON <NA> 1 <NA>
## region state unit_code unit_name
## 1 PW OR CRLA Crater Lake National Park
## 2 PW WA LARO Lake Roosevelt National Recreation Area
## 3 PW WA LEWI Lewis and Clark National Historical Park
## 4 PW WA OLYM Olympic National Park
## 5 PW CA SAMO Santa Monica Mountains National Recreation Area
## 6 NE ME ACAD Acadia National Park
## unit_type visitors
## 1 National Park 1500
## 2 National Recreation Area 0
## 3 National Historical Park 69000
## 4 National Park 2200
## 5 National Recreation Area 468144
## 6 National Park 64000
# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)
# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)
# you can check if you have the same result by using colnames()
colnames(park_visits1)
## [1] "year" "parkname" "unit_code" "state" "visitors"
colnames(park_visits2)
## [1] "year" "parkname" "state" "unit_code" "visitors"
# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)
# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
Taking a look at the park_visits data, there are some column name inconsistencies. For example, some are in snake case (unit_code), but some are not (parkname). To rename a column, use rename(). The structure of this argument is as follows:
rename(dataset, new_name = old_name) OR dataset %>% rename(new_name = old_name)
# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)
# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
## [1] "year" "park_name" "unit_code" "state" "visitors"
filter() - filters out rows from a data.frame based on values in columns can combine criteria using & (and), | (or), ! (not)
If you’d like to get visitation data for just parks in the state of Washington (WA), you can use the filter function. The structure is as follows:
wa_park_visits <- filter(
# name of data.frame
park_visits,
# condition
state == 'WA'
)
# head() shows the first few rows of data
head(wa_park_visits)
# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)
Now that you’ve got parks in the state of Washington, perhaps you’d like to know which parks in WA had over 5 million total visitors. This will require more intensive filtering. We’ll have to filter for parks where state == ‘WA’ AND year == ‘Total’ AND visitation >= 5,000,000.
wa_park_visits <- park_visits %>%
# filter for parks in WA
filter(state == 'WA' &
# filter for total visitation
year == 'Total' &
# filter for over 5 million visitors
visitors > 5000000)
# head() shows the first few rows of data
head(wa_park_visits)
You also might be wondering why we use “==” instead of “=”. In R, the single equal sign “=” is used to assign something. For example, if you want to make some variable equal to a number, you could tell R “x = 5”, and it would remember that x is equal to 5. To evaluate something as TRUE/FALSE, you use “==”. If you type “5 == 3” into the console, it will return the answer “FALSE”. The filtering argument uses this TRUE/FALSE evaluation - filtering for rows where the stated condition - such as year == ’Total” - is TRUE.
Time for the first challenge. Using the park_visits data, create a new data.frame containing total visitation for a state of your choice. Note that some “states” aren’t states. For example, VI refers to the Virgin Islands. How can you check your work?
pr_park_visits <- park_visits %>%
# filter for parks in Puerto Rico
filter(state == 'PR' &
# filter for total visitation
year == 'Total')
# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!
# remove result from environment
remove(pr_park_visits)
arrange() reorders rows. You can think of this like “sort” in Excel. While this is generally not needed in R, it can be helpful for exploring your data, and when working with time series data and lagged values.
As an example, we’ll re-visit total park visitation in WA, sorting by the number of visitors.
wa_park_visits <- park_visits %>%
# filter for parks in WA
filter(state == 'WA' &
# filter for total visitation
year == 'Total' &
# filter for over 5 million visitors
visitors > 5000000) %>%
# arrange by visitation - default is ascending order
arrange(visitors)
# look at result (df is short so head() not needed)
wa_park_visits
## year park_name unit_code state visitors
## 1 Total Lewis and Clark LEWI WA 9515314
## 2 Total <NA> NEPE WA 9566104
## 3 Total Ross Lake ROLA WA 11335924
## 4 Total North Cascades NOCA WA 14809679
## 5 Total Fort Vancouver FOVA WA 20238294
## 6 Total San Juan Island SAJH WA 23066952
## 7 Total Lake Roosevelt LARO WA 62247752
## 8 Total Mount Rainier MORA WA 94488801
## 9 Total Olympic OLYM WA 160737962
# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
arrange(desc(visitors))
# look at result
wa_park_visits2
## year park_name unit_code state visitors
## 1 Total Olympic OLYM WA 160737962
## 2 Total Mount Rainier MORA WA 94488801
## 3 Total Lake Roosevelt LARO WA 62247752
## 4 Total San Juan Island SAJH WA 23066952
## 5 Total Fort Vancouver FOVA WA 20238294
## 6 Total North Cascades NOCA WA 14809679
## 7 Total Ross Lake ROLA WA 11335924
## 8 Total <NA> NEPE WA 9566104
## 9 Total Lewis and Clark LEWI WA 9515314
# remove one result
remove(wa_park_visits2)
pull() takes data from a column and returns in in vector form
pull() %>% table() is a handy way to see a summary of categorical data (e.g. how may observations for each state there are in your data)
# two ways to get the number of observations per state
# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
# pull the state column only
pull(state) %>%
# get # of observations by state in table output
table()
# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
# group by state
group_by(state) %>%
# get tally of observations by state in output
tally()
# remove created objects
remove(state_n, state_n2)
mutate() makes new columns by doing calculations on old columns
As an example, let’s calculate the total visitation for parks with over 5 million total visitors in WA in millions, by dividing the number of total visitors by 1,000,000.
wa_park_visits_millions <- wa_park_visits %>%
mutate(visitors_mil = visitors/1000000)
wa_park_visits_millions
## year park_name unit_code state visitors visitors_mil
## 1 Total Lewis and Clark LEWI WA 9515314 9.515314
## 2 Total <NA> NEPE WA 9566104 9.566104
## 3 Total Ross Lake ROLA WA 11335924 11.335924
## 4 Total North Cascades NOCA WA 14809679 14.809679
## 5 Total Fort Vancouver FOVA WA 20238294 20.238294
## 6 Total San Juan Island SAJH WA 23066952 23.066952
## 7 Total Lake Roosevelt LARO WA 62247752 62.247752
## 8 Total Mount Rainier MORA WA 94488801 94.488801
## 9 Total Olympic OLYM WA 160737962 160.737962
group_by() indicates that data is in groups, so you can do calculations separately for each group
group_by() %>% mutate() will add a column and return the same number of rows as the original data
This is helpful when you want to get group summary information without sacrificing the original values. For example, if we want to know what percent of the total visitation of a park occurs in each year, and add this to the original data.
park_visits2 <- park_visits %>%
# remove the rows with Totals per park
filter(year != "Total") %>%
# do calculations by park
group_by(park_name) %>%
# add visit_percent column
# visits fore each year divided by the sum of total visits for each park
# the na.rm = TRUE means that NA values are not included in calculations
# round(2) is rounds result to 2 decimal places for easier reading
mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
round(2))
# take a look at the result
head(park_visits2)
## # A tibble: 6 x 6
## # Groups: park_name [6]
## year park_name unit_code state visitors visit_percent
## <chr> <chr> <chr> <chr> <int> <dbl>
## 1 1904 Crater Lake CRLA OR 1500 0
## 2 1941 Lake Roosevelt LARO WA 0 0
## 3 1961 Lewis and Clark LEWI WA 69000 0.73
## 4 1935 Olympic OLYM WA 2200 0
## 5 1982 Santa Monica Mountains SAMO CA 468144 2.66
## 6 1919 <NA> ACAD ME 64000 0
# remove from environment
remove(park_visits2)
group_by() and summarize() (or summarise()) also work with grouped data. Here you calculate a summary for each group. Output data frame will have the columns that define the group and the summary data, will have one row for each group.
I find this combination most helpful for calculating means and standard deviations for data. You can also find minimum and maximum values. We’ll filter for total visitation numbers for each park, then calculate a visitation summary for each state.
state_summary <- park_visits %>%
# filter for total visitation numbers (to avoid double counts)
filter(year == 'Total') %>%
# do calculations by state
group_by(state) %>%
# calculate summaries
summarize(
# mean number of total visitors across parks in each state
mean_visit = mean(visitors, na.rm = TRUE),
# sd of total visitors across parks in each state
sd_visit = sd(visitors, na.rm = TRUE),
# min total visitors across parks in each state
min_visit = min(visitors, na.rm = TRUE),
# max total visitors across parks in each state
max_visit = max(visitors, na.rm = TRUE),
# get number of park units w data in each state
n_parks = n()
)
# take a look at the result
head(state_summary)
## # A tibble: 6 x 6
## state mean_visit sd_visit min_visit max_visit n_parks
## <chr> <dbl> <dbl> <int> <int> <int>
## 1 AK 4661539 6737110. 10490 19476522 22
## 2 AL 2998848. 2162876. 365883 5487784 5
## 3 AR 20647567. 34237894. 60563 92028933 7
## 4 AS 113694 NA 113694 113694 1
## 5 AZ 23655713. 46310344. 348811 205486894 19
## 6 CA 61322519. 120602434. 6349 611031225 26
# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
ungroup() %>%
mutate(dif = max_visit - min_visit)
# remove from environment
remove(state_summary)
For this second challenge, you’ll start with the park_visits data. Calculate the average visitation divided by the number of parks for each state. The final data.frame should be arranged in descending order by a numeric column of your choice.
challenge2 <- park_visits %>%
# filter for total visitation numbers (to avoid double counts)
filter(year == 'Total') %>%
# do calculations by state
group_by(state) %>%
# calculate summaries
summarize(
# mean number of total visitors across parks in each state
mean_visit = mean(visitors, na.rm = TRUE),
# get number of park units w data in each state
n_parks = n()
) %>%
# ungroup to do another calculation
ungroup() %>%
# calculate visit/n
mutate(visit_per_park = mean_visit/n_parks) %>%
# select relevant columns
select(state, visit_per_park) %>%
# arrange in descending order
arrange(desc(visit_per_park))
# take a look at the result
head(challenge2)
## # A tibble: 6 x 2
## state visit_per_park
## <chr> <dbl>
## 1 NV 103867788.
## 2 PR 55928192
## 3 ME 40890226.
## 4 NC 19591218.
## 5 MS 19160525
## 6 OK 18206483.
# remove from environment
remove(challenge2)
This section will go over the use of functions to change data from wide to long format and vice-versa.
We will:
pivot_longer()
pivot_Wider()
If you’re starting with this section, run the following code chunk to get caught up.
# call packages
library("readr")
library("janitor")
library("tidyverse")
# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
Depending on your goal, you may want to have your data formatted in multiple ways. For example, it is much easier for humans to read wide format data, while most R functions rely on long format data. Here is a quick demonstration using park visits data:
# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
# get parks in MEDN and no totals, years 2010-2015
filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
# make year a number, since no more text
mutate(year = as.numeric(year)) %>%
# get years 2010:2015
filter(year >= 2010 & year <= 2015) %>%
# arrange in ascending order
arrange(year) %>%
# select relevant columns
select(unit_code, year, visitors)
# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits
## # A tibble: 18 x 3
## unit_code year visitors
## <chr> <dbl> <dbl>
## 1 SAMO 2010 568371
## 2 CABR 2010 763140
## 3 CHIS 2010 277515
## 4 SAMO 2011 609636
## 5 CABR 2011 813351
## 6 CHIS 2011 242756
## 7 SAMO 2012 649471
## 8 CABR 2012 877951
## 9 CHIS 2012 249594
## 10 SAMO 2013 633054
## 11 CABR 2013 856128
## 12 CHIS 2013 212029
## 13 SAMO 2014 694714
## 14 CABR 2014 893434
## 15 CHIS 2014 342161
## 16 SAMO 2015 797126
## 17 CABR 2015 981825
## 18 CHIS 2015 324816
# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
## # A tibble: 3 x 7
## unit_code `2010` `2011` `2012` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SAMO 568371 609636 649471 633054 694714 797126
## 2 CABR 763140 813351 877951 856128 893434 981825
## 3 CHIS 277515 242756 249594 212029 342161 324816
There are two functions we’ll discuss today:
pivot_wider()
: converts from long –> widepivot_longer()
: converts from wide –> longFull disclosure, I almost always have to google the arguments that go into the pivot_
functions. Here’s the website for pivot_wider()
https://tidyr.tidyverse.org/reference/pivot_wider.html
Here’s a short example - pivoting the MEDN visits from long to wide, with more detail than the prior code chunk.
# check out the original data (medn_visits) in long format
medn_visits
## # A tibble: 18 x 3
## unit_code year visitors
## <chr> <dbl> <dbl>
## 1 SAMO 2010 568371
## 2 CABR 2010 763140
## 3 CHIS 2010 277515
## 4 SAMO 2011 609636
## 5 CABR 2011 813351
## 6 CHIS 2011 242756
## 7 SAMO 2012 649471
## 8 CABR 2012 877951
## 9 CHIS 2012 249594
## 10 SAMO 2013 633054
## 11 CABR 2013 856128
## 12 CHIS 2013 212029
## 13 SAMO 2014 694714
## 14 CABR 2014 893434
## 15 CHIS 2014 342161
## 16 SAMO 2015 797126
## 17 CABR 2015 981825
## 18 CHIS 2015 324816
# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
pivot_wider(
# names from is the thing you want to be the columns
names_from = year,
# values from is what you want to fill the columns with
values_from = visitors
)
# check out the result
medn_visits_wide
## # A tibble: 3 x 7
## unit_code `2010` `2011` `2012` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SAMO 568371 609636 649471 633054 694714 797126
## 2 CABR 763140 813351 877951 856128 893434 981825
## 3 CHIS 277515 242756 249594 212029 342161 324816
# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
## # A tibble: 6 x 4
## year SAMO CABR CHIS
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 568371 763140 277515
## 2 2011 609636 813351 242756
## 3 2012 649471 877951 249594
## 4 2013 633054 856128 212029
## 5 2014 694714 893434 342161
## 6 2015 797126 981825 324816
Now that we have our data in wide format, it is time to revert back to pivoting longer. Here’s the reference material for pivot_longer() https://tidyr.tidyverse.org/reference/pivot_longer.html.
# check out the original data (medn_visits_wide) in wide format
medn_visits_wide
## # A tibble: 6 x 4
## year SAMO CABR CHIS
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 568371 763140 277515
## 2 2011 609636 813351 242756
## 3 2012 649471 877951 249594
## 4 2013 633054 856128 212029
## 5 2014 694714 893434 342161
## 6 2015 797126 981825 324816
# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
pivot_longer(
# first argument is the columns you'd like to transform
SAMO:CHIS,
# next is what you'd like to name the former name cols
names_to = "unit_code",
# last is what you'd like to name the former values cols
values_to = "visitors"
)
# check out the result
medn_visits_long
## # A tibble: 18 x 3
## year unit_code visitors
## <dbl> <chr> <dbl>
## 1 2010 SAMO 568371
## 2 2010 CABR 763140
## 3 2010 CHIS 277515
## 4 2011 SAMO 609636
## 5 2011 CABR 813351
## 6 2011 CHIS 242756
## 7 2012 SAMO 649471
## 8 2012 CABR 877951
## 9 2012 CHIS 249594
## 10 2013 SAMO 633054
## 11 2013 CABR 856128
## 12 2013 CHIS 212029
## 13 2014 SAMO 694714
## 14 2014 CABR 893434
## 15 2014 CHIS 342161
## 16 2015 SAMO 797126
## 17 2015 CABR 981825
## 18 2015 CHIS 324816
# this should look familiar - like the original medn_visits data
This is a more involved challenge that will draw on your knowledge of both dplyr
(from Module 3) and pivoting.
Using the state_pop data
, accomplish the following tasks:
data.frame
in long formatHint: if your column names are numbers, like years, you must refer to them using back-ticks. A column called 2020 would be `2020`
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
filter(state %in% c("FL", "AK", "CA") &
year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
pivot_wider(
names_from = "year",
values_from = "pop"
)
# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")
Oftentimes we want to use data that is split into two or more data.frames. This section will go over the use of functions to “join” or combine data from two tables into one.
We will:
Use bind_rows()
and bind_columns()
for simple cases
Used dplyr commands for more complicated joins
left_join()
and right join())
full join()
inner_join()
semi_join()
anti_join()
If you’re starting with this section, run the following code chunk to get caught up.
# call packages
library("readr")
library("janitor")
library("tidyverse")
# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
select(year, parkname, unit_code, state, visitors) %>%
filter(year != "Total") %>%
mutate(year = as.integer(year))
bind_rows()
and bind_columns()
simply merge two data.frames
directly.
bind_rows()
bind_rows()
adds the rows in the second data.frame to the bottom of the first data.frame. This generally only makes sense if they have many identical columns. Make sure that columns with the same data have the same name!
# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
# get data from Shenandoah.
filter(unit_code == "SHEN")
SHEN_visits
## # A tibble: 81 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1936 <NA> SHEN VA 694098
## 2 2016 <NA> SHEN VA 1437341
## 3 2015 <NA> SHEN VA 1321873
## 4 2014 <NA> SHEN VA 1255321
## 5 2013 <NA> SHEN VA 1136505
## 6 2012 <NA> SHEN VA 1210200
## 7 2011 <NA> SHEN VA 1209883
## 8 2010 <NA> SHEN VA 1253386
## 9 2009 <NA> SHEN VA 1120981
## 10 2008 <NA> SHEN VA 1075878
## # ... with 71 more rows
ACAD_visits <- park_visits %>%
# get data from Acadia.
filter(unit_code == "ACAD")
ACAD_visits
## # A tibble: 98 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1919 <NA> ACAD ME 64000
## 2 2016 <NA> ACAD ME 3303393
## 3 2015 <NA> ACAD ME 2811184
## 4 2014 <NA> ACAD ME 2563129
## 5 2013 <NA> ACAD ME 2254922
## 6 2012 <NA> ACAD ME 2431052
## 7 2011 <NA> ACAD ME 2374645
## 8 2010 <NA> ACAD ME 2504208
## 9 2009 <NA> ACAD ME 2227698
## 10 2008 <NA> ACAD ME 2075857
## # ... with 88 more rows
# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)
# check that data from both parks is there
SHEN_ACAD_visits %>%
pull(unit_code) %>%
table()
## .
## ACAD SHEN
## 98 81
rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)
You generally would not split up a data.frame and then put it together again like this, but it can be useful if you have data that has the same data structure but starts out broken up by site, date, park etc. and you wish to combine it into one data.frame.
bind_columns()
bind_columns()
adds the columns from the second data.frame to the right of the first data.frame.
# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
# get first 2 columns.
select(year, parkname)
visits_left
## # A tibble: 21,174 x 2
## year parkname
## <int> <chr>
## 1 1904 Crater Lake
## 2 1941 Lake Roosevelt
## 3 1961 Lewis and Clark
## 4 1935 Olympic
## 5 1982 Santa Monica Mountains
## 6 1919 <NA>
## 7 1969 <NA>
## 8 1967 <NA>
## 9 1944 <NA>
## 10 1989 <NA>
## # ... with 21,164 more rows
visits_right <- park_visits %>%
# get last 3 columns.
select(unit_code, state, visitors)
visits_right
## # A tibble: 21,174 x 3
## unit_code state visitors
## <chr> <chr> <dbl>
## 1 CRLA OR 1500
## 2 LARO WA 0
## 3 LEWI WA 69000
## 4 OLYM WA 2200
## 5 SAMO CA 468144
## 6 ACAD ME 64000
## 7 AMIS TX 448000
## 8 ASIS MD 738700
## 9 BIBE TX 1409
## 10 BICY FL 81157
## # ... with 21,164 more rows
visits_bound <- bind_cols(visits_left, visits_right)
visits_bound
## # A tibble: 21,174 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1904 Crater Lake CRLA OR 1500
## 2 1941 Lake Roosevelt LARO WA 0
## 3 1961 Lewis and Clark LEWI WA 69000
## 4 1935 Olympic OLYM WA 2200
## 5 1982 Santa Monica Mountains SAMO CA 468144
## 6 1919 <NA> ACAD ME 64000
## 7 1969 <NA> AMIS TX 448000
## 8 1967 <NA> ASIS MD 738700
## 9 1944 <NA> BIBE TX 1409
## 10 1989 <NA> BICY FL 81157
## # ... with 21,164 more rows
rm(visits_left, visits_right, visits_bound)
bind_columns()
assumes that the rows are in the same order in each data.frame, and that there are no rows in one data.frame that are missing in the other. Unless you are sure that these are true, you should avoid using this function.
bind_rows()
challenge
Using the park_visits
data, make 2 smaller data.frames
:
Then, using bind_rows()
combine the 2 data.frames.
# Atlantic islands
Atlantic <- park_visits %>%
filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
filter(state %in% c("AS", "GU", "HI"))
# All islands
Islands <- bind_rows(Atlantic, Pacific)
rm(Atlantic, Pacific, Islands)
left_join()
and right_join()
left_join()
and right_join()
are a safer way to join columns from one data.frame to the columns in another.
For either of these functions, you need to specify one or more key columns using the by argument. Key columns are the data that is used to determine which rows in once data.frame match the rows in the other. The order of the rows in the data.frames does not matter.
The syntax is left_join(x,y, by="name")
where x
and y
are data.frames
.
The only difference between left_join()
and right_join()
is what happens when the rows in the data.frames don’t match up one to one. left_join()
keeps all the rows in x
and drops any rows in y
that don’t match, while right_join()
keeps all the rows in y
and drop any rows in x
that don’t match.
Let’s combine the gas_price
data with the state_pop
data. The gas price data does not vary by state, so we just have to match the data by year. In order to make the way these functions work a little clearer, we will first filter the state_pop
data to only include data before 2000, so that there is less of an overlap between the 2 data sources.
# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)
# left_join example
state_pop_gas1 <- left_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas1
## # A tibble: 5,100 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1900 AL 1830000 NA NA
## 2 1901 AL 1907000 NA NA
## 3 1902 AL 1935000 NA NA
## 4 1903 AL 1957000 NA NA
## 5 1904 AL 1978000 NA NA
## 6 1905 AL 2012000 NA NA
## 7 1906 AL 2045000 NA NA
## 8 1907 AL 2058000 NA NA
## 9 1908 AL 2070000 NA NA
## 10 1909 AL 2108000 NA NA
## # ... with 5,090 more rows
A couple of thing to notice. The state_pop_gas1
data.frame
has 5100 rows because that is what is in the state_pop2
data.frame
. Every year is repeated 51 times (once for each state + DC) so the gas price data for a given year gets repeated 51 times as well. The gas_price
data only goes back to 1929, so for years before that, the data just has NA
for gas prices.
Here is the exact same thing, but this time using right_join()
# left_join example
state_pop_gas2 <- right_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas2
## # A tibble: 3,637 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1929 AL 2644000 0.21 2.38
## 2 1930 AL 2647000 0.2 2.3
## 3 1931 AL 2649000 0.17 2.18
## 4 1932 AL 2653000 0.18 2.61
## 5 1933 AL 2661000 0.18 2.66
## 6 1934 AL 2685000 0.19 2.67
## 7 1935 AL 2719000 0.19 2.62
## 8 1936 AL 2743000 0.19 2.67
## 9 1937 AL 2762000 0.2 2.63
## 10 1938 AL 2787000 0.2 2.64
## # ... with 3,627 more rows
Now the data is very different. It has 3637 rows, which seems pretty random. What has happened is:
state_pop2
data.frame
.So, all the data from gas_price
is present, if necessary it was repeated to match up with the state_pop2
data.
inner_join()
and full_join()
In the previous section we used left and right joins to get all of the data from one data.frame with matching data from a second data.frame
. Now we will look at 2 other joins:
inner_join()
- only rows that have matches in both data.frames
are keptfull_join()
- you get all the rows from both datasets even if some don’t match.inner join:
# inner_join example
state_pop_gas3 <- inner_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas3
## # A tibble: 3,621 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1929 AL 2644000 0.21 2.38
## 2 1930 AL 2647000 0.2 2.3
## 3 1931 AL 2649000 0.17 2.18
## 4 1932 AL 2653000 0.18 2.61
## 5 1933 AL 2661000 0.18 2.66
## 6 1934 AL 2685000 0.19 2.67
## 7 1935 AL 2719000 0.19 2.62
## 8 1936 AL 2743000 0.19 2.67
## 9 1937 AL 2762000 0.2 2.63
## 10 1938 AL 2787000 0.2 2.64
## # ... with 3,611 more rows
Now the data.frame has 3621 rows = there are 71 years of data from 51 states. This is the smallest data.frame
from these joins. The years in state_pop_gas3
are only those that are found in both data.frames
.
full join:
# full_join example
state_pop_gas4 <- full_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas4
## # A tibble: 5,116 x 5
## year state pop gas_current gas_constant
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1900 AL 1830000 NA NA
## 2 1901 AL 1907000 NA NA
## 3 1902 AL 1935000 NA NA
## 4 1903 AL 1957000 NA NA
## 5 1904 AL 1978000 NA NA
## 6 1905 AL 2012000 NA NA
## 7 1906 AL 2045000 NA NA
## 8 1907 AL 2058000 NA NA
## 9 1908 AL 2070000 NA NA
## 10 1909 AL 2108000 NA NA
## # ... with 5,106 more rows
rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)
This data.frame
has 5116 rows, there most of all the data.frames
. This includes all data from either data.frame
regardless of if there is a match.
Using the park_visits
and the state_pop
data create a new data.frame
with all the data from park_visits
and any matching data from state_pop
.
Hint if you need to use 2 columns as keys the you write the by=
argument like so: by=c("Column1","Column2")
.
park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))
rm(park_visits_pop)
Sometimes you don’t want to combine two tables, but you do want to filter one table so that it only has data that can match data in another table. semi_join()
is used for this
semi join:
# semi_join example
state_pop_gas5 <- semi_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas5
## # A tibble: 3,621 x 3
## year state pop
## <dbl> <chr> <dbl>
## 1 1929 AL 2644000
## 2 1930 AL 2647000
## 3 1931 AL 2649000
## 4 1932 AL 2653000
## 5 1933 AL 2661000
## 6 1934 AL 2685000
## 7 1935 AL 2719000
## 8 1936 AL 2743000
## 9 1937 AL 2762000
## 10 1938 AL 2787000
## # ... with 3,611 more rows
The columns in state_pop_gas4 are the same that are in state_pop2
. However,all the data from year prior to 1929 has now been removed, because pre-1929 data is not in the gas_price
data.
anti_join()
is the opposite of semi_join()
. It will keep the data one data.frame
that does not have a match in the second data.frame
.
anti_join:
# anti_join example
state_pop_gas6 <- anti_join(
x = state_pop2, # first data.frame
y = gas_price, # second data.frame
by = "year" # key column
)
state_pop_gas6
## # A tibble: 1,479 x 3
## year state pop
## <dbl> <chr> <dbl>
## 1 1900 AL 1830000
## 2 1901 AL 1907000
## 3 1902 AL 1935000
## 4 1903 AL 1957000
## 5 1904 AL 1978000
## 6 1905 AL 2012000
## 7 1906 AL 2045000
## 8 1907 AL 2058000
## 9 1908 AL 2070000
## 10 1909 AL 2108000
## # ... with 1,469 more rows
# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)
state_pop_gas_6
only has 1479 rows, because it only contains data from before 1929 when the gas_price
data starts.
Filter the park_visits data so that it only contains data that can be joined with the data in and the state_pop data create a new data.frame
with all the data from park_visit
and any matching data from state_pop
and gas_price
.
Note that the state_pop
data has some NA
s for population for Alaska and Hawaii prior to 1949. For a bigger challenge, don’t include data those state / year combinations in your final data set.
Hint: the function is.na(data)
will be TRUE
when a value is NA
. If you put an exclamation mark in front !is.na(data)
it will tell you when the values are not NA
.
## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.
park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>%
semi_join(gas_price, by="year")
park_visits_filtered
## # A tibble: 19,964 x 5
## year parkname unit_code state visitors
## <int> <chr> <chr> <chr> <dbl>
## 1 1941 Lake Roosevelt LARO WA 0
## 2 1961 Lewis and Clark LEWI WA 69000
## 3 1935 Olympic OLYM WA 2200
## 4 1982 Santa Monica Mountains SAMO CA 468144
## 5 1969 <NA> AMIS TX 448000
## 6 1967 <NA> ASIS MD 738700
## 7 1944 <NA> BIBE TX 1409
## 8 1989 <NA> BICY FL 81157
## 9 1988 <NA> BISO TN 613011
## 10 1941 <NA> BLRI NC 895874
## # ... with 19,954 more rows
rm(park_visits_filtered)
# Packages used in this section
library(tidyverse)
library(viridis) # for colorblind-friendly palettes
Goals of this Day
Data are useful only when used. Data are used only when understood.
Consider three examples:
Most people can understand this…
faster than they can understand this…
state | timestamp | cases | total_population |
---|---|---|---|
AK | 2022-01-25 04:00:00 | 203110 | 731545 |
AL | 2022-01-25 04:00:00 | 1153149 | 4903185 |
AR | 2022-01-25 04:00:00 | 738652 | 3017804 |
AZ | 2022-01-25 04:00:00 | 1767303 | 7278717 |
CA | 2022-01-25 04:00:00 | 7862003 | 39512223 |
CO | 2022-01-25 04:00:00 | 1207991 | 5758736 |
CT | 2022-01-25 04:00:00 | 683731 | 3565287 |
This table shows average monthly revenue for Acme products.
category | product | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
party supplies | balloons | 892 | 1557 | 1320 | 972 | 1309 | 1174 | 1153 | 1138 | 1275 | 1178 | 1325 | 1422 |
party supplies | confetti | 1271 | 1311 | 829 | 1020 | 1233 | 1061 | 1088 | 1395 | 1376 | 1152 | 1568 | 1412 |
party supplies | party hats | 1338 | 1497 | 1445 | 956 | 1372 | 1482 | 1048 | 877 | 1404 | 1030 | 1458 | 1547 |
party supplies | wrapping paper | 1396 | 1026 | 932 | 891 | 1364 | 896 | 900 | 1221 | 1146 | 967 | 1394 | 1507 |
school supplies | backpacks | 1802 | 1773 | 1611 | 1723 | 1799 | 1730 | 1813 | 1676 | 1748 | 1652 | 1819 | 1759 |
school supplies | notebooks | 1153 | 1471 | 1541 | 1371 | 1592 | 1514 | 1725 | 1702 | 1457 | 1604 | 1729 | 1279 |
school supplies | pencils | 1679 | 1304 | 1054 | 1259 | 1425 | 1608 | 1972 | 1811 | 1610 | 1004 | 1417 | 1283 |
school supplies | staplers | 1074 | 1708 | 1439 | 1154 | 1551 | 1099 | 1793 | 1601 | 1647 | 1666 | 1389 | 1511 |
Use the table above to answer these questions:
Now let’s display the same table as a heat map, with larger numbers represented by darker color cells. How quickly can we answer those same two questions? What patterns can we see in the heat map that were not obvious in the table above?
In 1973, Francis Anscombe published “Graphs in statistical analysis”, a paper describing four bivariate datasets with identical means, variances, and correlations.
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 |
---|---|---|---|---|---|---|---|
10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 | |
---|---|---|---|---|---|---|---|---|
mean | 9 | 7.50 | 9 | 7.50 | 9 | 7.50 | 9 | 7.50 |
var | 11 | 4.13 | 11 | 4.13 | 11 | 4.12 | 11 | 4.12 |
Despite their identical statistics, when we plot the data we see the four datasets are actually very different.
Anscombe used clever thinking and simple plots to demonstrate the importance of data visualizations. But it’s not enough to just plot the data. To have impact, a plot must convey a clear message. How can we do this?
Have a purpose. Every data visualization should tell a story. What story does the Covid plot tell? What message does Anscombe’s quartet of plots convey?
Consider your audience. Avoid scientific names, acronyms, or jargon unless your audience is well-versed in that language. Use color-blind friendly colors.
Use an appropriate visualization. For example:
Keep it simple!!!
Use informative text and arrows wisely. Clear, meaningful titles, subtitles, axes titles/labels, and annotations help convey the message of a plot. Use lines and arrows (sparingly but effectively) to emphasize important thresholds, data points or other plot features. Fonts should be large enough with good contrast (against the background) and sufficient white space to be easily readable.
For today’s data visualizations we will use the park visits data you’re already familiar with from the Data Wrangling module.
From subsets of the park visits and gas price data, we will create plots to:
Make sure you’ve installed and loaded the packages.
region
to a factor data typepark_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv") # visits to US National Parks
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv") # gas prices over space and time
View() the data to get an initial sense of what we’re working with.
# Examine the data to understand data structure, data types, and potential problems
View(park_visits); summary(park_visits)
View(gas_price); summary(gas_price)
Before we run more detailed data checks, let’s extract a subset of the park_visits
data to work with and then join in the gas_price$gas_constant
data, using year
as the joining column.
The new data frame should be called joined_dat
and should only include park_visits
data that meet all of these criteria:
We also only need to keep a subset of the data columns from park_visits
. The new data frame should have these columns (in this order): region, unit_type, unit_name, unit_code, year, visitors
NOTE: Even though park_visits$year
is classified as a character data type, R
recognizes the numeric order of quoted numbers (e.g., “6”) and puts quoted numbers before alphabetic characters. Type sort(c("q", "1", "7", "3", "b"))
in your console and see how R
sorts those values. Because of these sorting rules for quoted numbers, we can filter year for “2000”:“2015” and R
will correctly give us all records for which year (even though it’s a character data type) is between 2000 and 2015, inclusive.
# NOTES:
# 1. When we preface function names with the package name, we avoid potential issues with duplicate function names among packages
# 2. The dplyr::select() function can simultaneously select columns, rename them, and order them as specified. See `?select`
# 3. When we `dplyr::filter()` by multiple conditions, we can use a comma (,) or and-sign (&) to combine the multiple conditions
# 4. `gas_price$gas_constant` is gas price in constant 2015 dollars/gallon
joined_dat <- park_visits %>%
dplyr::filter( # filter by multiple conditions
region %in% c("IM", "PW", "SE"),
unit_type %in% c("National Monument", "National Historic Site", "National Park"),
year %in% "2000":"2015") %>%
dplyr::select(region, unit_type, unit_name, unit_code, year, visitors) %>% # keep these columns only, and arrange them in this order
dplyr::mutate(year = as.numeric(year)) %>% # convert `park_visits$year` to numeric, to match the data type of `gas_price$year`
dplyr::left_join(gas_price[c("year", "gas_constant")], by = "year")
# summary(joined_dat) # always check the data before and after wrangling!
The new data frame should look like this when we type glimpse(joined_dat)
in the console:
## Rows: 1,919
## Columns: 7
## $ region <chr> "PW", "SE", "IM", "IM", "PW", "IM", "PW", "PW", "PW", "IM~
## $ unit_type <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~
We now have all the data we need in a single data frame.
We are primarily using dplyr
to wrangle data and ggplot2
to plot data. These two packages are part of the Tidyverse of packages for data science. Tidyverse packages share a common design philosophy centered on ideas like breaking down complex tasks into a series of simpler functions executed in sequence (e.g., piping %>%
). Functions from tidyverse packages work together most seamlessly and efficiently when data are in a tidy format. Remember that a ‘tidy’ format means data have one variable per column and one observation per row.
joined_dat
data frame. Is it in ‘tidy’ format? Yes, it is! The visitor count for each park unit and year is in a separate data frame row. No need to pivot_long anything today. But let’s make two final changes to the data before we begin plotting. First, we will convert region
to an ordered factor with more descriptive labels. Second, we will remove park units that don’t have visitor data for every year from 2000 to 2015. We will end with a final check of the data to make sure it’s ready for plotting.
region
to a factor data type
Convert the variable region
to a factor data type. Relabel and re-order the levels as Pacific West < Intermountain < and Southeast (the default order is alphabetical). This way, bar plots and color legends will order the region levels from west to east instead of alphabetically.
# NOTES:
# 1. The factor data type is similar to (and often interchangeable with) the character data type, but limits entries to pre-specified values and allows us to specify an order to those values (other than the default alphabetical order)
# 2. Factors are especially useful for ordered categories (e.g., low < medium < high) with a small number of possible values
# 3. Under the hood, `R` store factors as integer vectors representing the order of the value (with character string labels)
# 4. I usually classify categorical variables as character data types, unless I want to limit entries to a few pre-specified values, or if I want to enforce a particular order to the category levels
unique(joined_dat$region) # check what values are in `region`
## [1] "PW" "SE" "IM"
joined_dat$region <-
factor(joined_dat$region,
levels = c("PW", "IM", "SE"), # specifies the factor levels and an order for them. If we omit this argument, the factor levels will be the unique set of values in the column, ordered alphabetically
labels = c("Pacific West", "Intermountain", "Southeast")) # relabels the levels with more descriptive names, keeping the same order as the `levels` argument
# Check the data
glimpse(joined_dat) # `region` is now a factor data type
## Rows: 1,919
## Columns: 7
## $ region <fct> Pacific West, Southeast, Intermountain, Intermountain, Pa~
## $ unit_type <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~
levels(joined_dat$region) # shows the (renamed) factor levels and their order
## [1] "Pacific West" "Intermountain" "Southeast"
str(joined_dat$region) # shows the values are coded as integers indicating factor level order
## Factor w/ 3 levels "Pacific West",..: 1 3 2 2 1 2 1 1 1 2 ...
summary(joined_dat) # the summary for `region` now counts the number of times each factor level occurs
## region unit_type unit_name unit_code
## Pacific West :527 Length:1919 Length:1919 Length:1919
## Intermountain:965 Class :character Class :character Class :character
## Southeast :427 Mode :character Mode :character Mode :character
##
##
##
## year visitors gas_constant
## Min. :2000 Min. : 0 Min. :1.750
## 1st Qu.:2004 1st Qu.: 57344 1st Qu.:2.320
## Median :2008 Median : 187397 Median :3.000
## Mean :2008 Mean : 602483 Mean :2.828
## 3rd Qu.:2012 3rd Qu.: 652166 3rd Qu.:3.610
## Max. :2015 Max. :10712674 Max. :3.800
We will be creating plots to see if trends in park visits from 2000 to 2015 differ by region or unit type. To avoid drawing improper conclusions from the data, make sure every park unit has one and only one data record for each year. Specifically, check for:
Determining if park units have incomplete data is not as simple as filtering for NA’s because missing park unit-years are not represented by records with NA
for the visitor count–the missing data are simply not in the data frame.
We also cannot just count the number of records for each park unit and assume that if a park unit has 16 records then the data are fine. A park unit could, for example, be missing one year of data and have a duplicate record for another year. It would then have 16 data records but not one per year from 2000 to 2015.
We present one–but not the only–way to solve this issue.
# NOTES:
# 1. Check out `?expand_grid` for an alternative approach
# 2. The `table` data class describs
# Run the code in pieces to see what the different lines do. For example, copy `table(uc = park_sub$unit_code, yr =park_sub$year)` and run it in the console to see the output. Then copy `table(uc = park_sub$unit_code, yr =park_sub$year) %>% as.data.frame()` and run it in the console to see the output. Repeat until the end of the code chunk.
remove_units <-
table(unit = joined_dat$unit_code, yr = joined_dat$year) %>% # build a contingency table of counts for every combination of `unit_code` and `year`
as.data.frame() %>% # convert from `table` to `data frame` class
dplyr::filter(Freq != 1) %>% # filter for unit-year combinations that are less than or greater than 1
distinct(unit) # identify the corresponding park units
# Remove those park units from the data
viz_dat <- joined_dat %>%
dplyr::filter(!unit_code %in% remove_units$unit)
# View(viz_dat) to make sure the park units have been removed. The final data frame should have 1856 rows and 7 columns.
viz_dat
as an R
object (optional)
# If you want to save `viz_dat` to your current working directory, run the code below. This code saves `viz_dat` as an .RDS file, which just means it saves it as an `R` object instead of as a .csv file.
# When we save a data frame as an `R` object we retain the accompanying attribute information (e.g., the order of factor levels for `region`)--this information would not be saved in a .csv file.
saveRDS(viz_dat, "viz_dat.RDS")
# To read the RDS file from your working directory and assign it to `viz_dat`, use the code below. You can assign it to any name you want. If you're reading it from a location other than your working current directory, provide the file path in the function argument.
viz_dat <- readRDS("viz_dat.RDS")
Note that a report of our findings should explain how and why we selected this subset of data to work with.
Always do a final check of the data to look for missing data, duplicated records, funny numbers or category levels, etc. It helps to create summary tables that decompose the data in various ways. These tables can identify oddities (e.g., category levels with only one record) and also give us an idea of the sample sizes we’ll be working with if we create plots with subsets of the data.
CHALLENGE: Create a table (data frame) that shows the number of park units in each combination of region and. Extra credit if you can reformat the table to wide instead of long format.
The new data frame (if you have it in wide format) should look like this:
## # A tibble: 3 x 4
## unit_type `Pacific West` Intermountain Southeast
## <chr> <int> <int> <int>
## 1 National Historic Site 7 7 11
## 2 National Monument 9 34 7
## 3 National Park 16 18 7
We see from this table that we have data for 34 National Monuments in the Intermountain region and that the smallest number of park units in any region-type is 7. We’re ready to start plotting.
tab_count <- viz_dat %>%
dplyr::distinct(region, unit_type, unit_code) %>%
dplyr::count(region, unit_type) %>%
tidyr::pivot_wider(names_from = region, values_from = n)
# View(viz_dat) to double-check the numbers
# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.
viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
# Packages used in this section
library(tidyverse)
R
?
Plots can reveal data issues that we had overlooked with our data checks. I sometimes use base R plotting functions to create quick-and-dirty plots for this purpose. If I want to do a lot of plot customizations, I find ggplot
functions easier to remember and use.
In addition to quick plots, base R plotting functions have another important use. Many R
packages have functions that create specific data visualizations when R
’s generic plot()
is applied to an object of a particular class. For example, lm()
is a statistical function that fits a general linear model to data and saves the model as an object of class lm
. Run the code below to see what happens when we apply plot()
to an object of class lm
.
# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.
# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the name `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots
We aren’t going to discuss statistics or residual plots today, but it’s useful to know these common uses for plot()
.
Look in the help file for the R graphics package that is automatically installed with R
(in RStudio’s ‘Packages’ tab, click the hyperlink for the package called ‘graphics’). Here you will find many functions for base R graphics. For example, the boxplot()
function draws box plots.
Typically, we can feed data to most base R plotting functions in one of two ways:
Provide the data as a formula of the form y ~ x
where y
is a vector of data for the y-axis and x
is a vector of data for the x-axis. For example, plot(y ~ x, data = my_data)
or plot(my_data$y ~ my_data$x)
.
Provide the vectors as separate arguments. For example, plot(x = my_data$x, y = my_data$y)
When providing vectors as separate arguments, each vector must be a standalone vector. That is, we can’t just give column names and provide the data frame via a data argument.
With the non-formula approach, the argument names may not be x
and y
. For example, with bar plots, the argument name height
is used instead of y
. Refer to the help (e.g., ?barplot
) for each plotting function to see what arguments it accepts.
I tend to use the formula approach for feeding data to base R plotting functions, because it is more consistent across the different types of plots. In other words, I’m too lazy to look up what argument names to use with each plot type for the non-formula approach.
Let’s use base R functions to create a box plot of park visitors by year, for a subset of viz_dat
that meets these criteria:
We want to know how much spread there is (across park units) in the number of visitors to National Monuments of the Intermountain region. We also want to know if visitor counts differed much among the four years.
# Check out `?boxplot`
# Subset the data
subdat <- viz_dat %>%
dplyr::filter(
year %in% c(2000, 2005, 2010, 2015),
unit_type == "National Monument",
region == "Intermountain")
boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula
# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument
This plot is okay, but even for a basic plot I would want to make the y-axis labels more understandable. Because the visitor numbers are large, R
is using scientific notation for the y-axis labels. So instead of 200,000 we see 2e+05. Let’s change that in the box plot and add plot and axes titles.
# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
data = subdat,
main = "National Monuments in the Intermountain Region", # plot title
xlab = "Year", # x-axis title
ylab = "Number of visitors, in 1000's") # y-axis title
Not bad for basic box plots. Now look at the data underlying the box plots. In ?boxplot
, the section titled ‘Value’ explains what these numbers mean. For any
p_box
## $stats
## [,1] [,2] [,3] [,4]
## [1,] 3.131 2.919 4.3500 9.492
## [2,] 61.170 53.521 52.5660 53.165
## [3,] 112.611 103.570 103.8875 94.931
## [4,] 258.306 280.068 221.0830 222.723
## [5,] 550.657 505.158 470.9210 416.635
##
## $n
## [1] 34 34 34 34
##
## $conf
## [,1] [,2] [,3] [,4]
## [1,] 59.1935 42.18307 58.22483 48.98625
## [2,] 166.0285 164.95693 149.55017 140.87575
##
## $out
## [1] 789.131 848.348 622.320 830.253 525.831 578.554 827.247 793.601 497.506
## [10] 478.833 813.686 588.006
##
## $group
## [1] 1 1 2 2 3 3 3 4 4 4 4 4
##
## $names
## [1] "2000" "2005" "2010" "2015"
From the boxplots and numbers, it looks like visits were generally down in 2010 and 2015 compared to 2000 and 2005. The spread of visitor counts across these 34 National Monuments is huge, though.
Type ?par
in the console to see the many graphical parameters (arguments) that can be used to modify base R plots. Unless you plan to use base R functions to create complex plots, you can probably get by with remembering a small handful of useful graphical parameters, such as pch =
to change scatter plot point types, col =
to set color, and a variety of cex
arguments to change font size and point size.
Let’s create a basic scatter plot and use some of these graphical parameters to gussy up the plot.
CHALLENGE: Create a scatter plot showing the relationship between park visits (y-axis) and gas price (x-axis) for Arches National Park (unit code is ARCH).
We want to know if annual visitor counts at Arches National Park is correlated with gas price (annual national average).
The data for this plot should be the 16 records of data for Arches National Park (one data point per year). Assign this subset of data to the name ‘arch_dat’ because we will be using it again for later plots.
Make these modifications to the default scatter plot:
Show visitor count as 1000’s of visitors (i.e., divide the visitor count by 1000) as we did for the box plots.
Use the graphical parameter pch
to change the point type from the default open circle to a solid circle (see ?points
to find the number corresponding to a filled circle. The examples at the end of ?points
show how to use pch
as a plot argument)
Increase point size to 1.5 instead of the default of 1
The final plot should show this relationship:
# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")
# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant,
data = arch_dat,
pch = 19, # change point shape to a solid circle
cex = 1.5, # increase point size
main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015",
xlab = "Gas price (constant 2015 dollars/gallon)",
ylab = "Number of visitors, in 1000's")
This plot shows that annual visitor counts increased as gas prices increased. That’s not the relationship we might have expected to see. But correlation is not causation, and the resolution of data influences the pattern we see. Maybe if we had used average June - July gas prices within 330 km of Arches National Park (instead of annual national averages) we might have seen a different relationship with visitor counts. Maybe, maybe not.
The scatter plot shows another interesting bit of information. What is that funny point that is marching to its own beat?! Gas price was low (~2.50 per gallon in 2015 dollars) and the visitor count was really high, ~ 1.4 million. A quick look at the data shows that this funny point was for the year 2015.
It would be interesting to see if 2015 was a high visitor count year for other park units as well. We will save that question for ggplot.
For now, let’s see if we gain any additional insight from plots showing trends in visits to Arches National Park and trends in national gas prices. Try creating these plots on your own by modifying the code slightly.
# Visitors vs. year
plot(visitors/1000 ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's")
The plot above shows that from 2004 to 2015, visitor numbers at Arches National Park increased every year. Visitor counts REALLY jumped in 2014 and 2015.
# Gas price vs. year
plot(gas_constant ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Average National Gas Prices, 2000 - 2015", xlab = "Year", ylab = "Gas price (constant 2015 dollars/gallon)")
The plot above shows that gas prices have gone up and down between 2000 and 2015. There was a huge drop in gas price between 2008 and 2009 and again between 2014 and 2015.
These scatter plots give interesting insight on trends in national gas prices and trends in visits to Arches National Park. The pattern we saw in our initial scatter plot–positive correlation between annual visitor counts and national average gas price–was almost certainly a spurious correlation. It’s important to look at data from many different directions and to not jump to conclusions based on a narrow exploration of the data.Line graphs are often used for showing changes in continuous data over time. Let’s make one last plot in base R before we move on to ggplot
.
CHALLENGE: Re-create the plot of trends in visitor counts at Arches National Park, using a line graph instead of a scatter plot. If you’re feeling ambitious, make it a line graph with over-plotted points.
The final plot (with over-plotted points) should show this relationship:
# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points
# This section requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.
viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park
# Packages used in this section
library(tidyverse)
library(ggthemes)
ggplot2
?
If you search the Internet for information on plotting in R
, you will quickly learn that ggplot2
is the most popular R
package for plotting. It takes a little effort to learn how the pieces of a ggplot object fit together. But once you get the hang of it, you will be able to create a large variety of attractive plots with just a few lines of R
code.
Reference documents and cheatsheets can be very helpful while you are learning to use the ggplot2
package.
An aesthetic mapping describes how variables in the data frame should be represented in the plot. Any column of data that will be used to create the plot must be mapped to an aesthetic. Examples of aesthetics include x, y, color, fill, point size, point shape, and line type.
From the ‘arch_dat’ data we will map ‘visitors’ (in 1000’s) to the y variable and ‘year’ to the x variable. This will set up default axes and axes titles for the plot.
# NOTES:
# 1. The first argument provided to the `ggplot()` function is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to the name `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.
(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))
summary(p) # summary of the information contained in the plot object
## data: region, unit_type, unit_name, unit_code, year, visitors,
## gas_constant [16x7]
## mapping: x = ~year, y = ~visitors/1000
## faceting: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object.
## # A tibble: 16 x 7
## region unit_type unit_name unit_code year visitors gas_constant
## <fct> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Intermountain National Park Arches Nat~ ARCH 2015 1399247 2.45
## 2 Intermountain National Park Arches Nat~ ARCH 2014 1284767 3.4
## 3 Intermountain National Park Arches Nat~ ARCH 2013 1082866 3.62
## 4 Intermountain National Park Arches Nat~ ARCH 2012 1070577 3.8
## 5 Intermountain National Park Arches Nat~ ARCH 2011 1040758 3.75
## 6 Intermountain National Park Arches Nat~ ARCH 2010 1014405 3.02
## 7 Intermountain National Park Arches Nat~ ARCH 2009 996312 2.58
## 8 Intermountain National Park Arches Nat~ ARCH 2008 928795 3.61
## 9 Intermountain National Park Arches Nat~ ARCH 2007 860181 3.16
## 10 Intermountain National Park Arches Nat~ ARCH 2006 833049 3
## 11 Intermountain National Park Arches Nat~ ARCH 2005 781670 2.74
## 12 Intermountain National Park Arches Nat~ ARCH 2004 733131 2.32
## 13 Intermountain National Park Arches Nat~ ARCH 2003 757781 2.01
## 14 Intermountain National Park Arches Nat~ ARCH 2002 769672 1.75
## 15 Intermountain National Park Arches Nat~ ARCH 2001 754026 1.91
## 16 Intermountain National Park Arches Nat~ ARCH 2000 786429 2.02
The geometry layer determines how the data will be represented in the plot. We can generally think of this as the plot type. For example, geom_hist()
creates a histogram and geom_line()
creates a line graph.
If we had set default data and aesthetic mappings, the geometry layer will use that information unless additional (overriding) data or aesthetics are set in the layer itself. If we had NOT set default data and aesthetics, they will need to be set within the layer. At a minimum we need to specify the data frame and which data frame columns hold the x
and (for most plot types) y
aesthetics. The ggplot object will use its own default values for any other aesthetics (e.g., color, size…) if not defined by the user.
Each geometry layer has its own set of required and accepted aesthetics. For example,geom_point()
(point plot) REQUIRES x
and y
aesthetics and ACCEPTS aesthetics that modify point color and size. It does NOT accept the linetype
aesthetic. We cover plot aesthetics in more detail in the section titled ‘Customizing a ggplot Object’.
# NOTE: We combine ggplot object layers with `+`, not with `%>%`. With a ggplot object we are not piping sequential results but rather layering pieces. In this example we are building our ggplot object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.
p + geom_line()
If we had NOT set default data and x
and y
aesthetics in the plot’s master template we could have set them within the geometry layer like this:
# Example of setting aesthetic mappings within the layer only
ggplot() +
geom_line(data = arch_dat, aes(x = year, y = visitors/1000))
We can include multiple geometry layers in a single ggplot object. For example, to overlay a point plot on the line graph we would add a geom_point()
layer. If we had set default data and aesthetic mappings, this additional geometry would use that information unless additional (overriding) information were set in the geom_point()
layer itself.
# This works because we had defined default data and aesthetics in `p`, so don't need to provide additional information to the geometry layers
p + geom_line() + geom_point()
The code below will NOT add the points layer because geom_point()
has no data or aesthetics defined, and we did not define default values in the ggplot object.
ggplot() +
geom_line(data = arch_dat, aes(x = year, y = visitors/1000)) +
geom_point()
layer()
function
Geometry layers can alternatively be specified with the generic plot layer()
function. Geometry layers are basically the layer()
function with some pre-defined (default) arguments. For example, geom_point()
is the same as layer(geom = "point", stat = "identity", position = "identity")
. See ?layer
for more information.
p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
scale_...()
Functions to Fine-Tune Aesthetics
All aesthetic mappings can be fine-tuned with a scale_...()
function. For example, the aesthetic mapping for x
can be modified with scale_x_continuous()
if the data are continuous, scale_x_discrete()
if the data are categorical, and scale_x_date()
if the data class is data/time. ggplot2
provides default scale arguments for all aesthetic mappings. If we don’t explicitly define a scale for an aesthetic, ggplot2
will use its default scale settings.
scale_y_continous()
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to the name `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot2` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`
# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p +
geom_line() +
geom_point() +
scale_y_continuous(
name = "Number of visitors, in 1000's", # y-axis title
limits = c(600, 1500), # minimum and maximum values for y-axis
breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)
labs()
to Specify Plot Labels
This useful layer allows us to set various plot labels, such as the plot title and subtitle, x- and y-axis titles, and alternative text and plot tags. Some of these labels can be set in other ways rather than via labs()
. For example, we had already set the y-axis title in the first argument of scale_y_continous()
.
If we want to remove a default label from the plot, we can set the argument to NULL. For example, labs(x = NULL)
removes the default x-axis title.
# NOTE: See `?ggtitle`, `?xlab`, and `?ylab` for alternative ways to set plot labels
(p3 <- p2 +
labs(
title = "Visitor Counts at Arches National Park, 2000 - 2015", # plot title
subtitle = "Visits to Arches National Park have increased each year since 2004", # plot subtitle
x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
)
Use theme()
to customize all non-data components of the plot. Conveniently, the ggplot2
package provides complete themes that change several elements of the plot’s appearance at once. The default complete theme is theme_grey
. See ?theme
for all the ways we can modify the non-data plot elements. This help page also provides a hyperlink to a help page for complete themes (I’m sure there is a way to get there more directly, but I couldn’t figure it out).
ggplot2
complete themes
p3 +
theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()...
For more fun, install and load the ggthemes
package, which allows us to apply plot styles inspired by the Economist and the popular FiveThirtyEight website, among other choices.
p3 +
ggthemes::theme_fivethirtyeight()
In addition to setting complete themes, we can individually modify element themes. If the element themes layer is set AFTER the complete themes layer, the element themes will override any conflicting default arguments in the complete theme. Some of these element these can alternatively be set in a scale_...()
function. For example, to move the y-axis to the right side of the plot, use scale_y_continuous(position = "right")
or theme()
It’s a bit difficult to understand the help page for ?themes
. I usually resort to an Internet search to figure out how to set an element theme.
p3 +
ggthemes::theme_fivethirtyeight() +
theme(
plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
We can use facet_grid()
or facet_wrap()
to split a ggplot object into multiple subplots, each representing a different group level. This is a nice way to simplify data-heavy plots with a single line of code.
With facet_grid we can specify different grouping variables for a row of plots versus a column of plots. For example, we can say that each row of subplots should represent a different region and each column of subplots should represent a different unit type. With facet_grid, facet column titles are horizontal at the top of each facet and facet row titles will be vertical to the right of each facet. With facet_wrap, subplots are “wrapped” in order from left-to-right and top-to-bottom, just like text is wrapped on a page. The facet titles are horizontal at the top of each facet.
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)
subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)
ggplot(subdat, aes(x = year, y = visitors/1000)) +
geom_line() +
geom_point() +
labs(
title = "Annual Visitor Counts at Three National Parks, 2000 - 2015",
x = "Year",
y = "Number of visitors, in 1000's") +
theme_bw() +
theme(plot.title = element_text(hjust=0.5)) + # center the plot title
facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")
We were wondering if 2015 was a very high year for visitor counts in parks other than Arches National Park. It looks like it was for Grand Canyon and Yosemite National Parks too!
These are not the ONLY ggplot components, but they are the ones most useful to know (IMHO)
Data - a data frame (or tibble). Specified as the first ggplot()
argument. Any aesthetics set at this level are part of the “master” template for the plot.
Aesthetic mapping - the variables represented in the plot (and how), e.g., x, y, color, fill, size, shape, line type. These aesthetics, except x & y, can also be set to a constant value by specifying outside the aes(). Aesthetics can be defined in the master template or in a geometry layer.
Geometry layer - the plot/mark layers, e.g., geom_point, geom_line, geom_boxplot. Can specify multiple such layers in a plot.
Scales - used to fine-tune aesthetics. Takes the form ‘scale_xxx_yyy’, where ‘xxx’ refers to the aesthetic and ‘yyy’ how the aesthetic values will be specified, e.g., discrete, continuous, manual. The associated legend for the aesthetic can also be modified here.
Labs - sets the plot labels, including plot title, subtitle, caption, and axis labels. Importantly, this is also where you would set alternative text and plot tags, for Section 508 Compliance.
Themes - complete themes (e.g., theme_bw, theme_classic) can be used to control all non-data display. Themes can also be used to more specifically modify aspects of the panel background, legend elements, label text size and color, etc.
Facets - subset the data by grouping variables and generate a separate subplot for each grouping variable, putting the graphs all on the same page.
# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frame in the global environment, run the code below.
viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
# Packages used in this section
# NOTE: This is a useful chunk of code that installs any missing packages and then loads all of them. Useful if we're sharing code with others who may not have some packages installed. Also increasingly useful if the list of packages required to run your script is somewhat long (I don't consider this a long list, but this code is handy to know.)
pkgs <- c("tidyverse",
"ggthemes", # nice plot themes
"GGally", # for stat_prop
"RColorBrewer", # to display Brewer palettes
"viridis", # for colorblind-friendly palettes
"scales", # breaks_pretty(), etc.
"plotly", # to make plots interactive
"patchwork") # for page layout
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
Although ggplot2
default options create decent plots, we have extensive options for customizing every aspect of a ggplot object. In this section of the data visualization module we will practice building and customizing ggplot objects.
The format of this section is that for each plot type we:
ggplot2
coding elements that will be introduced (we introduce a lot of new elements with histograms, so you can use them with subsequent plots)We include Challenge Questions that ask you to figure out some coding on your own. These are coding challenges that should be “do-able” at this stage of learning, and will help you hone your skills of searching the Internet and R
help documents for answers to your R-related questions–an absolute MUST for continuing to improve your coding skills after this training is done.
Embrace the challenge. Happy coding!
Create a histogram to show the distribution of annual visits in the year 2015, by NPS unit. This plot will give us an idea if the distribution of visitor counts across NPS units is highly skewed, and if visitor distributions were similar across different unit types in 2015.
Learn to:
ggplot()
frequency histogram, density histogram, and dot plotGeneral histogram suggestions:
geom_density()
can be a useful alternative for comparing multiple histograms with different sample sizes, if the actual counts per bin are not of interest.Subset the data for this plot and create a basic histogram for a first look at the data.
# Subset the data, only keeping data for the year 2015
# To improve readability, show visitor count as number of visitors in 1000's. This conversion be done during initial data formatting, or can be calculated directly in the `ggplot()` function. I'm doing the conversion in the data wrangling stage so I don't have to keep retyping it in the plot iterations.
dat_2015 <- subset(viz_dat, year == 2015) %>%
dplyr::mutate(visitors_k = visitors/1000)
head(dat_2015) # always check the data before plotting
## # A tibble: 6 x 8
## region unit_type unit_name unit_code year visitors gas_constant visitors_k
## <fct> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Pacifi~ National ~ Crater La~ CRLA 2015 614712 2.45 615.
## 2 Pacifi~ National ~ Olympic N~ OLYM 2015 3263761 2.45 3264.
## 3 Interm~ National ~ Big Bend ~ BIBE 2015 381747 2.45 382.
## 4 Pacifi~ National ~ Craters o~ CRMO 2015 246826 2.45 247.
## 5 Interm~ National ~ El Malpai~ ELMA 2015 174433 2.45 174.
## 6 Southe~ National ~ Everglade~ EVER 2015 1077427 2.45 1077.
# Basic histogram
ggplot(dat_2015, aes(x = visitors_k)) + # specify the variable to plot
geom_histogram()
# Compare the plot to the data to make sure things look about right
This histogram has a really long, discontinuous tail. It looks like a very small number of NPS units had much higher visitor counts in 2015 than the bulk of the units did. Check the underlying data to make sure this observed pattern is correct. For now, don’t worry too much about the error message that says we are using a default of bins = 30
. This message is reminding us that we should thoughtfully select an appropriate bin size for a histogram instead of relying on a default. If we explicitly set the default number of bins to 30 (geom_histogram(bins = 30)
) the message goes away, even though we produce the same histogram. For highly skewed data like we have here, typical rules-of-thumb for the appropriate number of bins may not apply.
We can split out the unit types in this histogram by assigning a different fill to each unit type or by faceting the histogram by unit type. When we map the ggplot()
fill aesthetic to a categorical variable, ggplot()
automatically maps that variable to a grouping aesthetic as well. That is, each category level will be represented by a separate histogram (with a different fill) in the plot.
The difference between the fill
and color
aesthetics in ggplot2
is that fill
changes the color in the inside of a point or bar, and color
changes the outline.
Let’s try it both ways (fill
aesthetic and faceting) and also compare the frequency histogram to a density histogram and dot plot.
First try – use different fill colors for different unit types.
# NOTES:
# 1. We can set aesthetics to constant values by excluding the `aes()` function. For example, `aes(fill = unit_type)` groups and fills histograms based on the `unit_type` variable (different fill color for each unit type). In contrast, `fill = "red"` creates a single histogram that is red.
# 2. Instead of adding the `fill` aesthetic to the master template, we could have added it directly in the `geom_histogram()` layer
ggplot(dat_2015, aes(x = visitors_k, fill = unit_type)) +
geom_histogram(alpha = 0.3, color = "black") # histograms for the different unit types overlap, so apply transparency (alpha = 0.3) and add a black outline on the bars so we can more easily distinguish them in the histograms.
Well, transparency didn’t help much here. We can’t really see what the distribution for National Historic Sites looks like. Try faceting. Save the faceted frequency histogram to a variable named hist1
so we can look at the underlying summary counts in a later step.
# Facet wrap
(hist1 <- ggplot(dat_2015, aes(x = visitors_k)) +
geom_histogram() +
facet_wrap(vars(unit_type), ncol = 1)) # facet by unit type, put all plots in a single column
# Now try using `facet_grid()` instead to see what the difference is
Compare to density histograms and dot plots.
# Density histogram
ggplot(dat_2015, aes(x = visitors_k)) +
geom_density() + # <<< only changed this line
facet_wrap(vars(unit_type), ncol = 1)
# Dot plot
ggplot(dat_2015, aes(x = visitors_k)) +
geom_dotplot() + # <<< changed this line
facet_wrap(vars(unit_type), ncol = 1)
The dot plot is actually nicely informative (though could use tweaking), given the moderately small sample size of our data set. But the reader is forced to manually count the taller bin stacks, which can be slower than reading the y-axis of the frequency histogram. We could do some hack coding to change the y-axis labels to show counts instead of density, but instead let us customize the frequency histogram so it is more readable and informative.
Look at the count summaries used to build hist1
View(ggplot_build(hist1)$data[[1]])
y | count | x | xmin | xmax | density | ncount | ndensity | flipped_aes | PANEL | group | ymin | ymax | colour | fill | size | linetype | alpha |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | 20 | 0.0000 | -184.6333 | 184.6333 | 0.0021665 | 1.00 | 1.00 | FALSE | 1 | -1 | 0 | 20 | NA | grey35 | 0.5 | 1 | NA |
1 | 1 | 369.2666 | 184.6333 | 553.8999 | 0.0001083 | 0.05 | 0.05 | FALSE | 1 | -1 | 0 | 1 | NA | grey35 | 0.5 | 1 | NA |
2 | 2 | 738.5332 | 553.8999 | 923.1666 | 0.0002166 | 0.10 | 0.10 | FALSE | 1 | -1 | 0 | 2 | NA | grey35 | 0.5 | 1 | NA |
0 | 0 | 1107.7999 | 923.1666 | 1292.4332 | 0.0000000 | 0.00 | 0.00 | FALSE | 1 | -1 | 0 | 0 | NA | grey35 | 0.5 | 1 | NA |
2 | 2 | 1477.0665 | 1292.4332 | 1661.6998 | 0.0002166 | 0.10 | 0.10 | FALSE | 1 | -1 | 0 | 2 | NA | grey35 | 0.5 | 1 | NA |
0 | 0 | 1846.3331 | 1661.6998 | 2030.9664 | 0.0000000 | 0.00 | 0.00 | FALSE | 1 | -1 | 0 | 0 | NA | grey35 | 0.5 | 1 | NA |
0 | 0 | 2215.5997 | 2030.9664 | 2400.2330 | 0.0000000 | 0.00 | 0.00 | FALSE | 1 | -1 | 0 | 0 | NA | grey35 | 0.5 | 1 | NA |
In each record, y
(same as count
) tells us the height of the histogram bar, xmin
and xmax
tell us the x-axis boundaries of that bar, PANELtells us which facet plot the record belongs to, and
fill` shows the fill color applied.
We can use the ggplot_build
function to examine the information underlying any ggplot object. In this case, we were only interested in the count summaries so we only viewed the data[[1]]
list element.
Notice that the first bin goes from -185 to +185. That’s due to geom_histogram()
’s default binning (and bin centering) behavior. We could use the boundary
and/or center
arguments in geom_histogram()
to change the bin centering.
Let’s customize the bin sizes and x-axis labels. Save the result as hist2
.
(hist2 <- ggplot(dat_2015, aes(x = visitors_k)) +
geom_histogram(binwidth = 200) + # each bin now spans 200k visitors
scale_x_continuous(breaks = seq(0, 11000, by = 2000)) + # label the x-axis from 0 to 11,000 at intervals of 2000
facet_wrap(vars(unit_type), ncol = 1)
)
# Use `ggplot_build()` to see the new histogram bins
# This is not the only way, but we are going to add new data columns that include the unit type median and a facet label.
# For alternative ideas, check out the `labeller` argument in `facet_wrap()`
dat_2015_label <- dat_2015 %>%
group_by(unit_type) %>%
mutate(N = n(),
type_med = round(median(visitors_k)),
type_label = paste0(unique(unit_type)," (N = ", N, " NPS units; median = ", type_med, "k visitors)")) # `paste` and `paste0` are nice functions for combining strings and variables
# View(dat_2015_label) # look at the new columns
ggplot(dat_2015_label, aes(x = visitors_k)) +
geom_histogram( binwidth = 200) +
scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
geom_vline(aes(xintercept = type_med)) + # add median lines
facet_wrap(vars(type_label), ncol = 1) # we are faceting by `type_label`, which splits the data out the same way as `unit_type`. This is not the only way to customize facet labels.
CHALLENGE: Customize a final plot however you wish and save it as hist3
. Then use the R package patchwork
(make sure you have it installed and loaded) to arrange hist1
and hist3
into a single graphic
In the graphic, arrange the two histograms side-by-side and label them (A) and (B). Give hist3
more room (a wider column) in the graphic so x-axis labels are not squished. Add a title to the graphic.
Here is an example of a final histogram.
Here is an example of two histograms combined in a single graphic.
# Final histogram
(hist3 <- ggplot(dat_2015_label, aes(x = visitors_k)) +
geom_histogram(fill = "lightblue", color = "black", binwidth = 200) +
scale_x_continuous(breaks = seq(0, 11000, by = 2000)) +
geom_vline(aes(xintercept = type_med), color = "red", size = 1, linetype = "dashed") +
labs(x = "Number of visitors, in 1000's", y = "Number of NPS units", title = "Distribution of Visitor Counts (Yr 2015)",
subtitle = "Red dashed line shows median count") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
facet_wrap(vars(type_label), ncol = 1)
)
# Combine two histograms in a single graphic
(hist_page <- hist1 + hist3 +
patchwork::plot_layout(ncol = 2, widths = c(1, 2)) +
patchwork::plot_annotation(
title = "Basic (A) and Customized (B) Histograms",
tag_prefix = "(",
tag_levels = "1",
tag_suffix = ")"
)
)
We are going to stop here with the histograms, but it’s useful to think about ways we can further improve our plots.
It would help to add x-axis labels on every facet, not just to the bottom facet. I found an easy way to do this with the lemon
package, but it caused problems with the median lines (showed all three median lines on each facet). I also found some complicated coding solutions, but would prefer an easy fix. So this is something I will have to tinker around with when I have more time (ha!). In general, I try to avoid relying on “new-ish” packages (like lemon
) unless I think the package will likely be maintained and updated into the future (e.g., Hadley’s packages).
In these histograms, the really long tail in the national parks panel makes it difficult to realize differences in median counts for the three unit types. Depending on the key message we want to convey with the figure, we may consider an alternative plot (perhaps setting the upper x-axis limit to 6000 and adding an annotation with GRSM’s visitor count and reason for not showing).
Look at your final plot and think about ways to improve it (better yet, actually write the code to do so!)
Create a bar chart to see if regions differ much in % visitor counts (out of the regional total for that year) across the three park unit types. Compare results for year 2000 vs. 2015, and facet by region.
Learn to:
ggplot()
stacked and side-by-side bar chartsGeneral bar chart suggestions:
Look at ?geom_bar
to understand the difference between geom_bar()
and geom_col()
.
The code below is copied directly from the first Example in ?geom_bar
. Run this code, then View()
the mpg
data set to try to figure how the data were used to create the resulting bar chart. The mpg
data are automatically provided with the ggplot2
install. Check out ?mpg
for details.
# geom_bar is designed to make it easy to create bar charts that show
# counts (or sums of weights)
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
# Total engine displacement of each class
g + geom_bar(aes(weight = displ))
# Map class to y instead to flip the orientation
ggplot(mpg) + geom_bar(aes(y = class))
# NOTE: We can also use `coord_flip()` to flip the axes
Let’s try building a bar chart with our data. We are going to start off by doing all of the data wrangling within geom_bar()
instead of wrangling a plot-specific data set. It’s useful to know how to wrangle within ggplot2
functions because it comes in handy sometimes.
NOTE: In the bar chart we will map the x
aesthetic to year
as a factor data class (instead of numeric). This isn’t absolutely necessary, but will avoid errors later when we build on this basic bar chart. Bar charts are not meant to display two continuous variables, so with as.factor(year)
we are making it clear that year
should be considered a categorical variable.
# Stacked bar chart, using the handy `weight` argument
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
aes(x = as.factor(year), fill = unit_type)) +
geom_bar(aes(weight = visitors)) + # we don't need to divide visitors by 1000 for this one because we will be converting to proportions anyway
facet_grid(cols = vars(region))
Notice that the order of levels in the legend is alphabetical from top to bottom, and the order of colors in the stacked bar chart matches the legend order. If unit_type
were a factor data type (instead of character), then the legend and chart colors would follow the factor level orders (remember that we had manually set factor level orders for region
)
Instead of wrangling the data set to calculate group proportions, we can convert count data to proportion data by simply changing the position
setting in geom_bar()
.
# NOTE: Setting the `position` argument to "dodge" would change the stacked bar to a side-by-side bar chart
# Set the `position` argument to "fill" to convert the counts to proportions
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
aes(x = as.factor(year), fill = unit_type)) +
geom_bar(aes(weight = visitors), position = "fill") + # <<< changed this line
facet_grid(cols = vars(region))
The RColorBrewer
and viridis
packages have many popular color palettes for:
The viridis
palettes are known for being colorblind-friendly. RColorBrewer
has a subset of palettes that is colorblind-friendly.
Apply one of the colorblind-friendly palettes to the bar chart.
# These are colorblind-friendly palettes in `RColorBrewer`
RColorBrewer::display.brewer.all(n=NULL, type="all", select=NULL, exact.n=TRUE, colorblindFriendly=TRUE)
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
aes(x = as.factor(year), fill = unit_type)) +
geom_bar(aes(weight = visitors), position = "fill") +
scale_fill_brewer(palette = "Set2") + # pick a palette to apply
facet_grid(cols = vars(region))
To manually assign colors to each variable:
# Google search a cheat sheet of "R ggplot2 colors" to see hundreds of named colors to choose from. Manually set fill for each unit type by replacing `scale_fill_brewer()` with:
scale_fill_manual(values = c("National Historic Site " = "cadetblue", "National Monument" = "mediumpurple2", "National Park" = "burlywood1"))
To create your own palette of colors, and call it from the plot…
# A colorblind-friendly palette, with colors identified by hexadecimal code instead of by name (you can replace these with color names)
my_colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
scales::show_col(my_colors) # the `scales` package has a `show_col()` function to display the colors corresponding to a hexadecimal code
# To use your custom palette, replace `scale_fill_brewer()` with:
scale_fill_manual(values = my_colors) # make sure the number of colors in your custom palette is greater than or equal to the number of levels in your grouping variable (in our case, 3 unit types--so unit types (alphabetically or by factor level order) will be assigned the the first 3 colors in the custom palette
A potential problem with stacked bar charts (depending on the message we want to convey) is that it can be difficult to determine differences between two bar charts, for a grouping level that is not grounded at the bottom of the stack. For example, it may be hard to tell if the proportion of visitors to National Monuments in the Southeast was higher or lower in 2015 compared to 2010. That is because the lower boundaries of the National Monuments bar segments have different starting points. If it’s important for our plot to convey this kind of information to the reader, one option would be to label each bar segment with the % of the bar it represents.
Below is some pseudo-complicated code to add bar segment labels as %. Ultimately, an easier option (at least, easier to remember) would have been to wrangle the data with columns that calculate % visitors grouped by region, unit type, and year. Then use geom_col()
to just show the pre-calculated percentages, and geom_text()
to add those pre-calculated values as labels.
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) + # <<< this `by =` argument is required for the `stat` argument in `geom_text()` to calculate proportion
geom_bar(aes(weight = visitors), position = "fill") +
geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
scale_fill_brewer(palette = "Set2") +
facet_grid(cols = vars(region))
CHALLENGE: Finalize your bar chart and save the resulting plot as a .pdf file.
The final bar chart should convert the y-axis labels to percentages instead of proportions (e.g., 50% instead of 0.50) and should include appropriate plot labels. Feel free to add other personal tweaks as well! Then save your ggplot object as a .pdf file in your working directory (test your Internet search skills!)Here is an example of a final bar chart. Yours may look similar (or not!)
# Assign the final bar chart to the name 'final_bar' so we can save it as a .pdf file in the next step
(final_bar <-
ggplot(subset(viz_dat, year %in% c(2000, 2015)),
aes(x = as.factor(year), fill = unit_type, by = as.factor(year))) +
geom_bar(aes(weight = visitors), position = "fill") +
labs(fill = "NPS unit type", title = "Proportion of visits in each NPS unit type and region", subtitle = "Comparison of Years 2000 & 2015", y = "% of visits", x = "Year") +
scale_y_continuous(labels = scales::percent) + # <<< this line converts proportion to % on the y-axis
geom_text(stat = "prop", fontface = "bold", size = 3.5, aes(weight = visitors), position = position_fill(0.5)) + # <<< this line creates the bar segment labels. `stat = "prop"` requires package `GGally`
scale_fill_brewer(palette = "Set2") +
ggthemes::theme_pander() +
theme(plot.margin = margin(rep(15, 4)), # increase white margin around plot. This particular theme needs it, others may not.
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
facet_grid(cols = vars(region))
)
# And here is the code to save it as a .pdf file
ggsave(filename = "final_bar.pdf", plot = final_bar)
# NOTE: `ggsave()` uses the file extension to guess at the file type to save as. There are function arguments to set figure width and height (and the associated measurement unit). The default is to save the image based on how we manually sized it in the RStudio plots plane. When we save the image this way, we will see a message in the console about the width and height of the saved image. For reproducibility, it's usually a good idea to hard code the output figure dimensions (if you plan to reuse or share a script)
Spend some time thinking about how you might improve your final bar chart. Here are a few things to consider.
Regions differ in the numbers of relative proportions of unit types. For example, the Intermountain region has very few Historic Sites, and that fact certainly contributes to the very low proportion of visits to Historic Sites in that region. Depending on what message we want to convey with these bar charts, it may be more appropriate to summarize the data in a different way (for example, comparing average number of visits per unit, across unit types).
A potential problem with stacked bar charts (depending on the message we want to convey) is that it can be difficult to determine differences between two bar charts, for a grouping level that is not grounded at the bottom of the stack. For example, it may be hard to tell if the proportion of visitors to National Monuments in the Southeast was higher or lower in 2015 compared to 2010. That is because the lower boundaries of the National Monuments bar segments have different starting points. If it’s important for our plot to convey this kind of information to the reader, one option would be to label each bar segment with the % of the bar it represents, and perhaps also actual number of visits.
Create a line graph to show trends in visitor counts, separately for each unit type and region. Do not include Southeast region data in this figure.
Learn to:
Subset the data for this plot and create a basic faceted line graph for a first look at the data.
# Subset the data
line_dat <- viz_dat %>%
dplyr::filter(region != "Southeast") %>%
dplyr::mutate(visitors_k = visitors/1000)
ggplot(line_dat, aes(x = year, y = visitors_k)) +
geom_line() +
facet_grid(rows = vars(unit_type), col = vars(region))
CHALLENGE: This one’s all yours! Create a final line graph that is interactive.
After you fix the mis-connected lines issue, learn about and apply two functions in your final line graph:
breaks_pretty()
from the scales
package automatically sets attractive axis breaks–use this function to improve the spacing of your y-axis labels.ggplotly()
from the plotly
package makes it super easy to make a ggplot objective interactive.scales
and plotly
installed and loaded before you begin.
Here is an example of a final line graph. Feel free to personalize your graph with your own favorite theme and customizations.
# NOTES:
# 1. If we had mapped unit_code to a color or linetype aesthetic we would not have needed to separately specify the group aesthetic
# 2. `ggplotly` doesn't support subtitles created in a ggplot object. To add subtitles to a ggplotly object, we can hack the `ggplotly` title text feature, or use `ggplotly` annotations (but then we have to specify position of text on plot)
(final_line <- ggplot(line_dat, aes(x = year, y = visitors_k, group = unit_code)) + # it didn't know to draw separate line per unit. Could also set as color or line type aesthetic, etc. and that would automatically also map it to group aesthetic.
geom_line() +
labs(x = "Year", y = "Number of visitors, in 1000's") +
scale_y_continuous(breaks = scales::breaks_pretty()) +
facet_grid(rows = vars(unit_type), col = vars(region)) + # so it's separate scales per unit type but same within a row (so good way to decide what should be facet row vs. col)
ggthemes::theme_fivethirtyeight(base_size = 14) +
theme(plot.margin=unit(c(2,1,1,1), 'cm')) # needed extra space at top to fit ggplotly hack subtitle
) %>%
ggplotly() %>%
layout(title = list(
text = paste0("Trends in visitor counts (in 1000's), by NPS unit type and region", '<br>', '<sup>', 'Each line represents an NPS unit', '</sup>')))
R
plotting toolkit. It can be greatly improved, and we’ll leave it at that.
Create a matrix heat map to examine differences in annual visitor counts among National Parks and trends over time
Learn to:
Subset the data for this plot and create a basic matrix heat map for a first look at the data.
# Subset the data, only keeping data for National Parks
# Instead of showing visitors in 1000's we will show visitors in millions because annual National Park visitor counts are generally higher
tab_visits <- viz_dat %>%
dplyr::filter(unit_type == "National Park") %>%
dplyr::mutate(visitors_m = visitors/1000000,
year_f = as.factor(year))
head(tab_visits) # always check the data before plotting
## # A tibble: 6 x 9
## region unit_type unit_name unit_code year visitors gas_constant visitors_m
## <fct> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Pacifi~ National~ Crater Lak~ CRLA 2015 614712 2.45 0.615
## 2 Pacifi~ National~ Olympic Na~ OLYM 2015 3263761 2.45 3.26
## 3 Interm~ National~ Big Bend N~ BIBE 2015 381747 2.45 0.382
## 4 Southe~ National~ Everglades~ EVER 2015 1077427 2.45 1.08
## 5 Southe~ National~ Great Smok~ GRSM 2015 10712674 2.45 10.7
## 6 Pacifi~ National~ Redwood Na~ REDW 2015 527143 2.45 0.527
## # ... with 1 more variable: year_f <fct>
# Starting plot
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
geom_tile()
# Try setting `x = year` instead of `x = year_f`, to see what changes in the heat map
# Compare the plot to the data to make sure things look about right
The matrix cell colors all run into each other. To make the individual cells more obvious, we need to change the cell color from the default black to white. We also don’t get much color range with many shades of a single color. It’s easier to see differences among heat map values if we use a palette with a wider color range, e.g., a palette that blends two different colors.
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
geom_tile(colour = "white", size = 0.3) + # <<< white border with thicker border than default
scale_fill_viridis_c() # <color-blind friendly and larger range of colors
What patterns do we see in the heat map so far? With the viridis
palette it’s a bit easier to see that there are a few National Parks with really high visitor counts– Great Smoky Mountains and Grand Canyon National Parks, for example.
If we are not just trying to show “big picture” patterns in the heat map, but want the reader to be able to distinguish smaller differences, it is useful to label the heat map cells with their actual values. Before we do that here, we will also reverse the color order so the lighter colors correspond with smaller values (to me, that order seems to make more sense).
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
geom_tile(colour = "white", size = 0.3) +
scale_fill_viridis_c(direction = -1) + # <<< Reverse color order
geom_text(aes(label=round(visitors_m, 1)), size = 3) # <<< add the cell labels, rounded to one digit. Adjust font size.
With the numbers added, we can now more clearly see that annual visitor counts increased at some parks (e.g., Arches National Park) over time, and declined at others (e.g., Mammoth Cave National Park). Those patterns aren’t that easy to discern with the colors alone.
It’s difficult to see black numbers against the darker colors. One way to fix this is to set the text color conditional on the matrix value.
ggplot(tab_visits, aes(y = unit_name, x = year_f, fill = visitors_m)) +
geom_tile(colour = "white", size = 0.3) +
scale_fill_viridis_c(direction = -1) +
geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3) + # <<< make color conditional on value. Choose a value that makes sense for the color palette you use.
scale_color_manual(values = c("black", "white")) # set color of TEXT as FALSE(0) = black (for <=7) and TRUE(1) = white (for >7)
This is a decent start. The rest is up to you!
CHALLENGE: Create a final matrix heat map.
See how many of these Challenge tasks you can figure out.
SUPERSTAR CHALLENGES!!!
SUPERSTAR 1. Facet by region and fix any formatting problems that you get (the problems will be obvious)
SUPERSTAR 2. Change the Park name labels so they say “NP” instead of “National Park” (e.g., “Yosemite NP” instead of “Yosemite National Park”) – probably easiest to do this with the initial data wrangling, before the data are provided toggplot()
Here is an example of a final matrix heat map with all the challenges met.
With this final heat map it’s easy to see regional patterns in visitor counts. In all regions there are a few National Parks with really high visitor counts.The Intermountain region has many National Parks that typically exceed 2 million visitors per year.
tab_visits <- viz_dat %>%
dplyr::filter(unit_type == "National Park") %>%
dplyr::mutate(
visitors_m = visitors/1000000,
year_f = as.factor(year),
unit_name_abbrev = gsub("National Park", "NP", unit_name)) # <<<< change to "NP". Could have also used `sub()`
head(tab_visits) # Check the results
(final_heat <- ggplot(tab_visits, aes(y = unit_name_abbrev, x = year_f, fill = visitors_m)) +
geom_tile(colour = "white", size = 0.3) +
scale_y_discrete(limits=rev) + # <<<< reverse y-axis order
scale_fill_viridis_c(direction = -1, name = "Visitors, in millions", breaks = seq(2,10, by = 2)) + # <<< add legend name, set legend breaks
geom_text(aes(label=round(visitors_m, 1), color = visitors_m > 7), size = 3, show.legend = FALSE) + # <<< turn off text legend. Other ways to do it include `guides(color = "none)`
scale_color_manual(values = c("black", "white")) +
labs(x = "Year", y = "NPS Unit", title = "Trends in Annual Visitor Counts by NPS Unit, 2000 - 2015", subtitle = "Higher visitor counts shown as darker cells") +
theme_minimal(base_size = 12) + # <<< my theme of choice
theme(legend.position = "top", # <<<< move legend to top)
legend.key.width = unit(2.5, "cm")) + # <<< make legend longer
ggforce::facet_col(vars(region), scales = "free_y", space = "free") # <<< fix facet format. Could also do facet_grid (without `ggforce`) but I like the fact that `facet_col` keeps the facet labels on top
)
Matrix heat maps are great for showing a lot of information in a highly condensed figure, but they have there limits. Some thoughts on our final heat matrix:
This isn’t a great way to convey a message about trends in visitor counts at an individual National Park. The much higher counts at some National Parks (GRSM!) means that the plot highlights differences among National Parks more than it highlights changes over time (which were a much smaller difference). Check out the oob_squish()
function as a possible fix for this problem.
We could make it easier to identify the most visited National Parks per region by ordering the data within region so National Parks with highest average annual visitor counts are at the top, and the lowest are at the bottom. Add a caption to explain the ordering.
As before, it would be useful to add x-axis labels (year) on every facet instead of just at the bottom of the page.
Note that we have zero’s in some of the cells because we showed counts as millions of visitors, rounded only to one digit past the decimal. We could fix this issue by rounding to two digits, but that would clutter the heat map even more than it already is.
A potential problem is that readers may think the different text colors have some special meaning other than just for increasing visibility. Not sure how to deal with that.
Font sizes are pretty small. Perhaps we should split this into three separate heat maps (by region) and use larger font sizes.
library(sf) # for working with simple features
library(tmap) # for mapping spatial data
library(leaflet) # Another package for mapping spatial data
If folks are having trouble installing tmap
due to an issue with one of its dependencies, terra
, try running the following code, and then reinstall tmap
.
install.packages('terra', repos='https://rspatial.r-universe.dev')
If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf
and tmap
packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.
The sf
package is the workhorse for anything you need to do with spatial vector data. File types with sf
are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf
package is also superseding the rgdal
package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf
couldn’t do.
The main application of tmap
package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2
. In practice for tmap
, it means that maps are built as a collection of many small functions/layers that get added together with pipes (%>%), and order matters. There are also tmap-enabled functions that you can use in ggplot2
plots too, but you can do a lot more in tmap
. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial
, which is an add-on package for plotting spatial data in ggplot2
.
The raster
package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster
package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).
Finally, the leaflet
package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.
leaflet
package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.
My two favorite features of sf are 1) the attribute table of a simple feature (sf’s equivalent of a shapefile) is a dataframe in R, and 2) package functions are pipeable with tidyverse functions. That means, if you want to delete points in your layer, you can use dplyr’s filter()
function to filter out those points. The sf package will update the geometry of the layer to remove the points that were filtered out.
To demonstrate the use of sf and tmap, I’m going to generate a simple GRTS sample using spsurvey
, which now connects to sf
instead of sp
and rgdal
. Then we’ll filter out points that don’t fall in a forest habitat to demonstrate how to work with sf data in R. Finally we’ll plot the results using tmap
.
I wasn’t able to figure out how to post a shapefile that you could easily download in R with a url. I emailed them to you as data.zip the previous week. I also posted all of the files to our Scientists Team, which can be downloaded using the following link: https://doimspp.sharepoint.com/:f:/s/ScientistsTraining2022/EiXOTYV1l4RCk1sMr5yXhZUB1ZFEaAlAN4elvsYbBfYHRg?e=ktVy5n. To follow along, you’ll need to download (and unzip if you’re using the email attachment) the shapefiles and save them to a “data” folder in your working directory.
Once those steps are completed, we’re ready to generate a GRTS sample and start making a map. Note that I’m using 6-digit hex colors (i.e., “#ffffff” is white) to define the fill color for each habitat type. To find your own or see what color these look like, check out htmlcolorcodes.com
#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%
library(sf)
library(tmap)
library(spsurvey)
# Generate a random number for the seed
set.seed(62051)
sample(1:100000, 1) #62051
# Read in shapefiles from teams data folder
sara_bound1 <- st_read('./data/SARA_boundary_4269.shp')
sara_veg1 <- st_read('./data/SARA_Veg.shp')
# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83.
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE
# Quick plot of first column in attribute table
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary
# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields
str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)
# Simplify vegetation types for easier plotting
dev <- c('1. Urban or Built-up Land', '11. Residential',
'12. Commercial and Services', '13. Industrial',
'14. Transportation, Communications, and Utilities',
'17. Other Urban or Built-up Land')
crop <- c('21. Cropland and Pasture',
'22. Orchards, Groves, Vineyards, and Nurseries',
'31. Herbaceous Rangeland')
shrubland <- c('32. Shrub and Brush Rangeland')
forest_up <- c('41. Deciduous Forest Land', '42. Evergreen Forest Land',
'43. Mixed Forest Land')
forest_wet <- c('61. Forested Wetland')
open_wet <- c('62. Nonforested wetland', '62. Nonforested Wetland')
water <- c('5. Water', '51. Streams and Canals', '53. Reservoirs')
unk <- 'Unclassified'
# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>%
mutate(simp_veg = case_when(ANDERSONII %in% dev ~ 'Developed',
ANDERSONII %in% crop ~ 'Open field',
ANDERSONII %in% shrubland ~ 'Shrublands',
ANDERSONII %in% forest_up ~ 'Forest',
ANDERSONII %in% forest_wet ~ 'Forested wetland',
ANDERSONII %in% open_wet ~ 'Open wetland',
ANDERSONII %in% water ~ 'Open water',
ANDERSONII %in% unk ~ 'Unclassified',
TRUE ~ 'Unknown'),
fill_col = case_when(simp_veg == 'Developed' ~ '#D8D8D8',
simp_veg == 'Open field' ~ '#f5f0b0',
simp_veg == 'Shrublands' ~ '#F29839',
simp_veg == 'Powerline ROW' ~ '#F9421D',
simp_veg == 'Forest' ~ '#55785e',
simp_veg == 'Forested wetland' ~ '#9577a6',
simp_veg == 'Open wetland' ~ '#c497d4',
simp_veg == 'Open water' ~ '#AFD0F2',
simp_veg == 'Unclassified' ~ '#ffffff'))
Before moving to the next step, I wanted to show how easy it is to create simple features from dataframes that contain X/Y coordinates. We’ll read in a fake dataset in the GitHub repo for this training, and call it wetdat. The dataset contains fake species composition data for wetland monitoring sites in ACAD and includes X and Y coordinates. We’ll use dplyr
functions to calculate the number of invasive and protected species in each site by year combination. Then we’ll make it a simple feature, and save it as a shapefile. Note that there are multiple formats you can save simple features as. I only show the shapefile version, in case you find yourself going between R and ArcGIS.
library(dplyr)
# Import data
wetdat <- read.csv(
'https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv')
# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>%
summarize(num_inv = sum(Invasive), num_prot = sum(Protected),
.groups = 'drop') # keeps dplyr quiet in console
# Check first 6 rows of output
head(wetsum)
# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c('X_Coord', 'Y_Coord'), crs = 26919)
# ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.
# Write sf to disk as shapefile
st_write(wetsum_sf, 'ACAD_wetland_data.shp')
The spsurvey
package has been updated to point to sf
instead of rgdal
. It’s a code-breaking change if you have old R scripts that generated GRTS samples. However, the process is even easier now, as you can see below. Here we’re just going to use the simplest of GRTS designs. The output from grts()
has multiple slots. The one we want is sites_base
, and you can see how we get that slot in the code below.
Note: While I followed the approach documented in the spsurvey
vignette to generate reproducable GRTS samples, it does not appear to be working as I’m testing it. Despite running set.seed(62051)
after loading spsurvey
and then running the code chunk below, each sample appears to be different.
sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)
First step here is to spatially join the sara_grts layer to the sara_veg layer. Here we only cared about the sara_veg’s simp_veg field, so I used dplyr’s select()
function. After joining, you should see a simp_veg field added to the end of the grts layer (it’s actually to the left of geometry, which is the last). In the next step, we then filter the points in the newly created grts_veg layer to only include points that fall in forest habitat.
# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)
# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c('Forest', 'Forested wetland')
# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points
grts_50 <- grts_veg %>% filter(priority <= 50)
nrow(grts_50)
## [1] 50
Now it’s time to plot. The code below may seem a bit much, but we’ll break it down piece by piece. The great thing about building maps this way is that you’re essentially building a template in code that you can steal from and adapt for future maps. You can even turn these into a function that’s even easier to use in future maps. That’s beyond what we can cover here, but it’s another great benefit of making maps in R instead of ArcGIS.
The code below is broken into the various pieces that make up the map. The way tmap
works is that first, you have to add the layer via tm_shape()
, and then you specify how you want that layer to look, like tm_fill()
, or tm_border()
. Each piece has its own legend as well. This is similar to how you start each ggplot2
graph defining the data with ggplot(data, aes(x = xvar, y = yvar))
, and then start adding geom_point()
, or whatever else to define how the data are plotted. The difference with tmap
is that every layer you want in the map has to be coded this way. Finally tm_layout()
is similar to theme()
in ggplot2, and is where you can customize the map layout.
for_legend
makes a list of the habitat types in simp_veg and their associated colors. That saves time having to type them all over again, and potentially spelling them wrong.
tm_shape()
in the map sets the projection and the bounding box. If you don’t set the bounding box, the map will show the largest extent of your layers. So if you have a roads layer at the state-level, your map will be zoomed at that extent, instead of the park boundary.
tm_fill()
, which fills in colors, and tm_borders()
, which only plots outlines. If you want both borders and fills, use tm_polygons()
instead.
tm_add_legend()
allows you to set the order that each legend appears in the map. Otherwise, they’ll appear in the order you specify the layers.
tm_symbols()
allows you to change the symbol format in a similar way to ggplot2
. We then added tm_text()
to label each point using the numbers in the priority column of the grts_forest attribute table. The xmod
and ymod
allow you to offset labels from the points either horizontally and vertically. In this case, negative xmod
moves the label to the left, and a negative ymod
moves the label below the point. The default location for labels is directly on top of the point.
tm_layout()
code is where you can change the default settings of the figure, including font size, placement of the legend, and margins. The title name and position are also set in the tm_layout()
.
tmap_save()
allows you to write the map to disk and to specify the height and width of the map. I prefer to write maps to disk to see what the size looks like before changing any of the layout and font size settings, because the figure on your disk will look different (and often better) than the plot view in your R Studio session.
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col))
sara_map <-
# Vegetation map
tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
tm_fill('fill_col') +
tm_add_legend(type = 'fill', labels = for_legend$simp_veg,
col = for_legend$fill_col, z = 3) +
# Park boundary
tm_shape(sara_bound) +
tm_borders('black', lwd = 2) +
tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'black',
z = 2)+
# GRTS points
tm_shape(grts_forest) +
tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) +
tm_text('priority', size = 0.9, xmod = -0.4, ymod = -0.4) +
tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21,
col = '#EAFF16', border.lwd = 1, size = 0.5,
z = 1) +
# Other map features
tm_compass(size = 2, type = 'arrow',
text.size = 1, position = c('left', 'bottom')) +
tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) +
tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
outer.margins = 0,
legend.text.size = 1.25,
legend.just = 'right',
legend.position = c('right', 'bottom'),
title = 'Saratoga NHP GRTS points',
title.position = c('center', 'top'))
tmap_save(sara_map, 'SARA_GRTS.png', height = 10.5, width = 8,
units = 'in', dpi = 600, outer.margins = 0)
sara_map2 <-
# Vegetation map
tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
tm_polygons('fill_col') + # A1: changed tm_fill() to tm_polygon()
tm_add_legend(type = 'fill', labels = for_legend$simp_veg,
col = for_legend$fill_col, z = 3) +
# Park boundary
tm_shape(sara_bound) +
tm_borders('red', lwd = 2) + # A2A: changed 'black' to 'red'
tm_add_legend(type = 'line', labels = 'Park Boundary', col = 'red', # A2B
z = 2)+
# GRTS points
tm_shape(grts_forest) +
tm_symbols(shapes = 21, col = '#EAFF16', border.lwd = 0.5, size = 0.3) +
tm_text('priority', size = 0.9, xmod = -0.5, ymod = -0.5) +
tm_add_legend(type = 'symbol', labels = 'GRTS points', shape = 21,
col = '#EAFF16', border.lwd = 1, size = 0.5,
z = 1) +
# Other map features
tm_compass(size = 2, type = 'arrow',
text.size = 1, position = c('left', 'bottom')) +
tm_scale_bar(text.size = 1.25, position = c('center', 'bottom')) +
tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02),
outer.margins = 0,
legend.text.size = 1.25,
legend.just = 'right',
legend.position = c('right', 'bottom'),
title = 'Saratoga NHP GRTS points',
title.position = c('center', 'top'))
The dimensions and scaling in the plot viewer of your R session tends to be a bit funky. Instead of trying to make the map perfect in the plot viewer, I usually save it to disk and tweak that version to look the way I want it to. To demonstrate, here’s what sara_map looks like in your plot viewer:
sara_map
tmap_save()
:
The last cool thing to show with tmap, is that you can include interactive HTML widgets similar to what leaflet can do (coming next). With the interactive mode, you can change baselayers, turn your layers off/on, and zoom to different extent. The legend in the interactive mode is a bit limited to only show fills, but it’s still pretty cool. You can also add custom basemaps (e.g. parktiles) to the interactive mode either by adding tm_basemap(url)
, or by changing the default basemap via tmap_options()
, which I show below.
# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
# Set parktiles as default basemaps for all interactive maps
tmap_options(basemaps = c(Map = NPSbasic,
Imagery = NPSimagery,
Light = NPSlight,
Slate = NPSslate))
# Switch to interactive mode, and plot the sara_map
tmap_mode('view') # turn interactive mode on
sara_map