Last updated: 23 Apr 22 10:06:50 (UTC)
Frank's Outline of HOPR (Hands-on Programming with R)
https://rstudio-education.github.io/hopr/
2 - The Very Basics
- 2.1 The R user interface:
- Basic arithmetic at the console
- Comments
#
- Control-C to cancel a command
- 2.2 Objects
- Colon operator
:
- Objects and assignment with
<-
- RStudio environment pane
- Colon operator
- 2.3 Functions
- function and argument
mean()
,round()
,factorial()
sample()
with and without replacement- Simulate rolling a pair of dice with
sample
- 2.4 Writing Your Own Functions
sum()
- The function constructor
- A function to roll a fair dice twice and return the sum
- 2.5 Arguments
- Function with arguments and/or default values
- Function to roll a pair of dice where the faces on each die are an argument
- Details of functions: name, body, arguments, default values, last line of code (return)
- 2.6 Scripts
- Creating a script in Rstudio: editor pane
- Control + Return / Command + Return to run lines from script
- The “Run” and “Source” buttons in RStudio
3 - Packages and Help Files
- 3.1 Packages
install.packages()
library()
- Directs readers to an appendix with more details about package management
c()
operator- Scatterplot with
qplot()
fromggplot2
- Histogram with
qplot()
setting the binwidth parameter replicate()
to repeat the simulation of rolling two dice- Make a histogram of result from
replicate()
to check the dice rolling simulation code - Ask the reader to weight the dice by using an option in the
sample()
function. Doesn’t tell which, so you need to look in the help file!
- 3.2 Getting Help
?
to get help- Can’t get help about package commands unless you’ve loaded the package with
library()
- Parts of a help file
- Advice on how to read a help file
- The argument for
sample()
used to weight the dice isprob
- Other places to get help: R help list, Stack Overflow, and <community.rstudio.com>
Note that chapter 4 doesn’t have any content: it just outlines the next group of chapters.
5- Objects
- 5.1 Atomic Vectors
- An atomic vector is one-dimensional and stores data of a single type.
- Six basic types of atomic vector: double, integer, character, logical, complex, and raw.
- Make an atomic vector with
c()
- Test whether something is an atomic vector with
is.vector()
length()
to find the length of an atomic vector- An atomic vector can be length one: R doesn’t actually have scalar types!
- Use
L
to create an atomic vector of integer type typeof()
to find out what type of atomic vector you have- numeric is a synonym for double in R
- Arithmetic with integers is always exact, but with doubles it may not be. Doubles are exact when used to store integers and store a wider range of integers so you rarely see integer types in practice.
- Character data and the distinction between
1
and”1”
as well as betweenx
and”x”
- Logical aka Boolean means
TRUE
orFALSE
- Shorthand
T
andF
but don’t use them because they can be re-defined by the user, e.g.T <- FALSE
- We’re not going to work with raw or complex types so don’t worry about them.
- 5.2 Attributes
- An attribute is a piece of “extra information” that can be “attached” to any R object, including an atomic vector
- Use
attributes()
to view an object’s attributes: it returnsNULL
if it has none NULL
is how R denotes the empty set aka the null set. You can create an empty set:x <- NULL
- Common attributes for atomic vectors: name, dimensions (dim), classes
names()
to give an object names attributes; can set toNULL
or change existing names- Use
dim()
to convert an atomic vector to an n-dimensional array by giving it dimensions - (rows, columns, slices) with
dim()
- 5.3 Matrices; 5.4 Arrays
matrix()
to create a 2-dimensional array; fills by column by default but can change withbyrow
array()
to creat an n-dimensional array; first argument is an atomic vector of values, second is a vector of dimensions- Exercises: make a few matrices and fill them by row versus column
- 5.5 Class
- An object’s class isn’t the same thing as its type
class()
to check or change an object’s class; but changing it is usually a bad idea!- When you change the dimensions of an atomic vector, its class changes but not its type; examples
- Key point: matrices and arrays are simply atomic vectors with attributes! This means that, like an atomic vector, they can only hold data of a single type
- Dates are an interesting example of a class. There’s no date type in R: instead R handles dates by setting a class attribute for a double
- You can see the underlying representation by using
unclass()
x <- Sys.time(); typeof(x); class(x); unclass(x)
- R uses factors to represent categorical variables; there’s no factor type instead this handled by setting attributes
- Use
factor()
to create a factor; useattributes()
to see thelevels
andclass
attributes; useunclass()
to see the underlying representation - TLDR: factors basically represent character data with a limited number of possible values as integers. This is convenient for statistical models e.g. regression but can create problems in other settings.
- By default R often tries to read in character data as a factor. If you need to undo this, you can use
as.character()
- 5.6 Coercion
- If you try to store different types in an atomic vector, R will convert them to a single type.
- This is called coercion
- E.g.
x <- c(1, TRUE, ‘1’)
- Coercion follows this precedence relation: character > numeric > logical
- The type with the highest precedence “wins” and everything else gets converted to this type.
- When logical types are converted to numeric, TRUE becomes 1 and FALSE becomes 0.
- This is handy for doing mathematics, e.g.
mean(c(TRUE, FALSE, TRUE))
- To convert manually use
as.character(); as.numeric(); as.logical()
- 5.7 Lists
- The advantage of atomic vectors / matrices / arrays is that working with them is fast, memory-efficient, and mathematically convenient because they contain elements of a single type.
- We use a list when we need an object that contain elements of arbitrary types.
- The elements of a list can be *any R object whatsoever *: atomic vectors, other lists, you name it!
list()
creates a list just likec()
creates an atomic vector: use commas to separate the elements.- Lists may have multiple levels of indexing, since they can contain elements that themselves have indices
- You can give names to list items when you create a list, e.g.
list(dogs = c(‘Fido’, ‘Spot’, ‘Lassie’), currencies = c(‘Sterling’, ‘Dollars’), primes = c(2, 3, 5, 7, 11))
- Use
str()
to get an overview of what’s contained in a list
- 5.8 Data Frames
- These are just a special case of a list: a 2-dimensional list in which each element is an atomic vector of the same length but potentially a different type.
- You can think of a data frame as a “spreadsheet”
- You can create a data frame using
data.frame()
and give names to the columns e.g.data.frame(name = c(Waldo, Wanda, Wilfred), age = c(24, 32, 43))
- By default, R stores character data in a data frame as a factor. To disable this, us the
strings.as.factors = FALSE
option withdata.frame()
- A data frame
df
is an object whose type is list, and whose class is data frame. To see this, and look inside a data frame, usetypeof(df); class(df); str(df)
- 5.9 Reading Data
- Shows how to use the RStudio GUI to read in a text file.
- More details are provided in the appendix “Loading and Saving Data in R” including
read.table()
, setting a path, etc.
- 5.10 Saving Data
- Shows how to use
write.csv()
withrow.names = FALSE
- More details in the appendix “Loading and Saving Data in R”
- Shows how to use
6 - R Notation
This chapter is about how to access elements of R objects. It mainly focuses on data frames but much of the material also applies to matrices, arrays, and lists.
- 6.1 Selecting Values
- Use the syntax
x[ row_indices, column_indices ]
to extract elements of a data framex
. (All of this works for a matrix and generalises to higher-dimensional arrays) - There are six ways of specifying the indices. In each case you supply an atomic vector, possibly of length one:
- Positive integers: specify which row and column indices to select
- Negative integers: specify which row and column indices to not select
- Zero: if you put a zero for a dimension, R extracts nothing for that dimension. Not very useful
- Blank space: if blank, everything in that dimension is extracted e.g. all rows, or all columns, or all slices
- Logical values: need to match the length of the dimension or R will recycle. TRUE if you want the element and FALSE if you don’t
- Names: if you’ve given names to the rows and or columns you can use them. Less error prone than matching indices
- R begins indexing with 1 rather than 0; its notation follows linear algebra books rather than languages like C or Python
- If you ask for a single row or column e.g.
x[1, 1:3]
R will convert the result from a matrix to an atomic vector. To prevent this usedrop = FALSE
e.g.x[1, 1:3, drop = FALSE]
- See Radford Neal’s blog post about how
drop = FALSE
can be considered an R design flaw - You can supply the same row or column index multiple times in which case R will give you repeat elements! E.g.
x[c(1, 1), 1:3]
This is helpful for the bootstrap - You can’t mix negative integers and positive integers as subscripts! But you can mix zeros with either.
- 6.2 Deal a Card
- Write a function that uses brackets to select the first card (row) in a deck (data frame)
- 6.3. Shuffle Card
- Use
head()
andtail()
to look at a data frame that represents a deck of cards - Use
sample()
to shuffle the deck by returning all of the rows in a random order (a random row permutation)
- Use
- 6.4 Dollar Signs and Double Brackets
- These allow us to extract elements from a list. Since a data frame is a list, we can use them on a data frame too.
$
to extract a column from a data frame by name: example of taking the median and mean of a column of numeric data- The “usual” way to access elements of a list is the same as atomic vector:
x[1]
etc. Here R will return the same type: list from a list and atomic vector from atomic vector - Example of trying to take the
sum
an atomic vector stored in a listsum(x[1])
doesn’t work. - If the elements of a list are named we can use
$
to extract a list element by name and R will not necessarily return a list: it will return the object stored in the list that has that name. - Using
$
the example with a sum works. - But what if we don’t have named list elements? Then we can use double brackets:
x[[1]]
doesn’t return a list unless the first element ofx
is itself a list. Fixes the problem in the sum example:sum(x[[1]])
. - We can combine the single and double bracket notation for lists with any of the six ways of specifying indices from above: positive integers, negative, logical, names, etc.
- Example:
lst = list(numbers = c(1, 3, 5), boolean = TRUE, strings = c(‘a’, ‘b’, ‘c’))
lst[“numbers”]
returns a list whilelst[[“numbers”]]
returns an atomic vector- Think of a list as a train that has many cars. Each car contains an R object.
- Single brackets return another train with the cars you have selected: the objects are still inside the train cars.
- Double brackets unload the train cars you’ve specified, and return the contents without the car itself
7 - Modifying Values
- 7.1 - Changing Values in Place
- LHS: describe values you want to modify; RHS: use
<-
to overwrite - E.g.
vec <- rep(0, 6); vec[1] <- 1000
- Can replace multiple values at once as long as the dimensions match (recycling rule?)
vec[c(1, 3, 5)] <- rep(1, 3)
- Can expand a vector: set values of elements that don’t (yet) exist, e.g.
vec[7] <- 0
- A good use case is adding a column to an existing dataframe:
deck2$new <- 1:52
- A bad use case is adding elements to a vector within a loop. Don’t do this! It’s really slow! Much better to pre-allocate. (This is discussed below)
- Can remove a column from a dataframe or an element from a list by assigning it to
NULL
, e.g.deck2$new <- NULL
- Simple example of recycling:
vec[c(1, 3, 5)] <- -999
- When we change values in place in this way, note that we are overwriting the original object rather than creating a modified copy of that object.
- LHS: describe values you want to modify; RHS: use
- 7.2 - Logical Subsetting
- Some key logical operators in R:
>
,>=
,<
,<=
,==
,!=
- These return
TRUE
orFALSE
. - If you apply any of these to vectors, R interprets your intention elementwise. This is exactly how R words with numeric vectors.
- Another logical operator:
%in%
%in%
works differently from the other logical operators from above:x %in% y
returns a logical vector of the same length asx
. Each element indicates whether the corresponding element ofx
is among the elements ofy
.- Be careful about
==
versus=
. The former is a logical operator while the latter is an assignment operator: a synonym for<-
- Exercise: count how many of the cards in a deck are
ace
cards - Using logical subsetting to modify values:
deck3$value[deck3$face == 'ace'] <- 14
- Boolean Operators:
&
,|
,xor
,!
,any
,all
- Common error: forgetting to “put a complete test on either side of a Boolean operator,” e.g.
x > 2 & < 9
is WRONG - Examples of selecting particular rows of a dataframe using compound statements that combine Boolean and logical operators
- Selecting rows of a dataframe with
%in%
- Some key logical operators in R:
- 7.3 - Missing Information
NA
is a reserved word in R. It means “not available” and serves as a “placeholder for missing information”NA
doesn’t behave as you might expect! But the best way to reason about it is to replaceNA
with the phrase “a value I don’t know.”- For example:
NA + 1
means “what is the sum of 1 and a value I don’t know?” The answer is “I don’t know” i.e.NA
- In other words:
NA
values tend to “propogate.” - Why does R behave this way? It wants to confront you with the fact that you have missing data, to prevent you from making silly mistakes.
- Many functions, e.g.
mean()
have anna.rm
option. Setting this equal toTRUE
drops the missing values. Otherwise taking a mean that includes any missing values will be anNA
. NA == 1
returnsNA
andNA == NA
also returnsNA
. This means we can’t test forNA
values using==
is.na()
allows us to test for missing values
8 - Environments
This chapter isn’t so important for core ERM. There’s some helpful material on scope, but it may be better to revisit this after you already have a bit more familiarity with the “nuts and bolts” of R.
9 - Programs
Build a working slot machine using R functions. The point of the chapter is to think about how to create programs by combining functions.
- 9.1 - Strategy
- How to solve complicated problems by breaking them up into sub-tasks
- sequential steps: some things need to be done in order, with the output of one step serving as the input to the next
- parallel cases: in some settings our program needs to work in a different way depending on the kind of input that is supplied
- Make a flowchart of the slot machine example to clarify which steps are sequential and which cases are parallel
- 9.2 -
if
Statements- Basic syntax of
if
in R - Condition of
if
must evaluate to a single TRUE or FALSE. R throws a warning if you supply a logical vector instead. It then uses the first element of the vector. - Some quiz questions to check basic understanding of
if
- Basic syntax of
- 9.3 -
else
Statements- Basic syntax of
if ... else
in R. trunc()
function to extract the decimal part:x - trunc(x)
if ... else
withtrunc
to round a numeric value- Syntax of
else if
- R starts at the top of an
if
block and continues until it finds a logical condition that evaluates to true. It then runs the corresponding code block and skips any remainingif
orelse
blocks. When none of theif
ofelse if
blocks has a true condition, R evaluates the code block associated withelse
if one is present. - Use
if ... else if ... else
for the parallel cases in the slot machine example unique()
function to create a subvector with no repeat elements&&
and||
work like&
and|
except that they are lazy: they only check the first elements and then stop. You can always be sure that they will return a single logicalTRUE
orFALSE
rather than a vector. This makes them handy for use inif ... else if ... else
statements.- Replacing multiple
|
operators with a single%in%
to yield code that is more efficient and easier to read and maintain
- Basic syntax of
- 9.4 - Lookup Tables
- How to avoid an
if ... else if ... else
statement with lots and lots ofelse if
blocks? - Pro tip: subsetting often provides the simplest way to achieve a complicated task in R.
- Create a named vector called
payouts
with names that correspond to the symbols that appear on the slot machine and values that give the associated payoffs, e.g.payouts <- c('DD' = 100, '7' = 80, 'BBB' 40, ... )
- Now it’s easy to find the prize for a given payout:
payouts['DD']
- This is a lookup table. We create one in R by subsetting a named object in a clever way.
- By feeding a variable into
payouts[...]
we can look up a payout that arises at runtime based on the randomly generate symbols on the slot machine. - Using R’s coercion rules to count occurences of a particular value, e.g.
sum(symbols == 'DD')
- When to use
if
trees versus lookup tables? Trees are for running different code in each branch; lookup tables are for assigning different values in each branch. - How to convert from an
if
tree to a lookup table? Find the values that are being assigned and store them as a vector. Then find the conditions used to choose between them. If the conditions are based on character strings, use name-based subsetting; otherwise use integer-based subsetting. - Double the prize for every diamond present: nice vectorized solution
prize * (2^diamonds)
- How to avoid an
- 9.5 - Code comments
- When to use comments
- How to write functions: start with the “meat” inside and once it’s working, encapsulate into a function. You can either do this by hand or with the “Extract Function” option in RStudio (menu bar under “Code”)
10 - S3
This is less important. After you have a solid foundation in R, you may find it helpful to come back and read this.
11 - Loops
- 11.1 - Expected values
- Explains how this is defined for a discrete RV
- Illustrates using the weighted die example from earlier
- 11.2
expand.grid
- Use
expand.grid()
to create a dataframe whose rows contain the Cartesian product of two vectors. In other words: every element of the first vector is paired with every element of the second. - Example of
expand.grid()
to take the Cartesian product of a vector with itself:rolls <- expand.grid(die, die)
to get all pairs when rolling two dice. - Elementwise operations to convert
rolls
into the totals when rolling two dicerolls$value <- rolls$Var1 + rolls$Var2
- Steps to calculate expected value of sum when rolling two dice:
Var1
inrolls
is the first die andVar2
is the second. The sum isrolls$value
- Create a lookup table for probabilies with names that match the values in
Var1
and use subsetting to get the desired probs:probs1
- Repeat for
Var2
to getprobs2
- Joint probabilities:
rolls$prob <- probs1 * probs2
sum(rolls$value * rolls$prob)
- Create a lookup table for probabilies with names that match the values in
- Apply the preceding steps to the slot machine example.
- CAREFUL: set
stringsAsFactors = FALSE
with expand grid when using it with character data to avoid the result being stored as a factor. - Example: use
expand.grid()
to create all combinations of three elements:expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)
- Use
- 11.3 -
for
Loops- “Do this for every value of that”
for (value in that) { this }
- Creates an object called
value
and assigns it a new value for each run of the loop for
can loop over elements of any atomic vector and we can call the index variable whatever we like, e.g.for (value in c('my', 'second', 'for', 'loop'){ ... }
- Careful about scope: R runs the loop in the environment that called it. Objects will be overwritten if you’re not careful:
value
is actually an object that R creates, so you could accidentally overwrite an existing object. - R is weird: its
for
loops iterate over elements of sets. In most other programming language,for
loops iterate over sequences of integers. In R we could choose to create a set of integers and loop over this, but we could also loop over some other kind of set. - What Happens in Vegas Stays in Vegas and the same is true of a
for
loop. To use the results of the calculations in afor
loop, we need to store them somewhere. - Create an empty vector of a certain length:
chars <- vector(length = 4)
then fill it up within afor
loop by indexing the elements of some other vector. - Create an empty column in a dataframe using R’s recycling rules:
combos$prize <- NA
and then loop over the rows withnrow()
to fill in the missing values.
- “Do this for every value of that”
- 11.4 -
while
Loops- Two other kinds of loops in R:
while
andrepeat
while
keeps looping as long as the condition is true; this is useful when you don’t know in advance how many times you need to do something, e.g. keep playing the slot machine until you run out of money.
- Two other kinds of loops in R:
- 11.5 -
repeat
Loops- Repeat a chunk of code until you hit escape or a
break
statement is reached.
- Repeat a chunk of code until you hit escape or a
12 - Speed
- 12.1 - Vectorized Code
- Two examples of a function that returns the absolute value of all elements of a numeric vector: one using a
for
loop and another using logical subsetting - The
for
loop version is much slower. system.time(...)
to see how long it takes to run a command; don’t confuse withSys.time()
which instead returns the current system time!- Compare the timings of the vectorized absolute value function to R’s built-in function
abs()
. Guess who wins? The internal function!
- Two examples of a function that returns the absolute value of all elements of a numeric vector: one using a
- 12.2 - How to Write Vectorized Code
- A general pattern for creating vectorized code:
- Replace sequential steps with existing vectorized functions.
- Use logical subsetting for parallel cases; handle everything at once.
- Example: vectorize a function that renames the slot machine symbols by combines a
for
loop with a giantif
tree. Time the two versions of the code. - Best solution to the proceeding: use a lookup table and
unname()
to clean up the result:unname(tb[vec])
wherevec
is the input vector whose names we want to change andtb
is the lookup table. The result is 40 times faster! - R is not a compiled language. This is why
for
loops suffer a speed penalty relative to subsetting approaches: the subsetting operations effectively get “handed over” to extremely fast compiled C code “under the hood.”
- A general pattern for creating vectorized code:
- 12.3 - How to Write Fast
for
Loops in R- Sometimes you can’t avoid a
for
loop, e.g. when the calculation in step (k+1) of a problem depends on the value computed in step k. - How can we make our
for
loops faster? Two principles:- Do as much as you can outside the loop: anything inside the loop is run many many times.
- Pre-allocate storage for any results you want to store. If you know you’re trying to calculate 1000 values, start by creating an empty 1000-vector outside the loop. Increasing the size of an object “on the fly” incurs a huge performance penalty because it requires R to make copies of an object in memory so it can erase and replace.
- Sometimes you can’t avoid a
- 12.4 - Vectorized Code in Practice
- Applysome of the preceding ideas to the slot machine example.
A - Installing R and Rstudio
Not crucial: we’ll use RStudio cloud and the TAs can help help you with installation on your local machine if needed during their office hours.
B - R Packages
- “base R” refers to the functions that are always available to you whenever you start R. You can use these “out of the box” as it were.
- Other packages need to be installed and then loaded. You only need to install once; you need to load whenever you restart R.
install.packages('<package name>')
to install a single package at the R consoleinstall.packages(c('ggplot2', 'reshape2', 'dplyr'))
to install multiple packages at once- You’ll be prompted to choose a mirror the first time you in stall packages. When in doubt, choose Austria: new things appear here first since it’s the main CRAN repo.
install_github()
etc. to install development versions of packages or packages that aren’t (yet) on CRAN.library(<package name>)
to load a package; you need to do this before using it and repeat whenever you restart your R session.library()
to see the packages you currently have installed- Learn about new packages using task views: http://cran.r-project.org/web/views
C - Updating R and its Packages
update.packages(c('ggplot2', 'reshape2', 'dplyr'))
to check whether you have the most recent versions of these packages and, if not, install them.- Start a new R session after updating; close R session and open new one if you already have the old version loaded.
D - Loading and Saving Data in R
Less important since we’ll talk about this in Core ERM.
- D.1 - Data Sets in Base R
- D.2 - Working Directory
- D.3 - Plain-text Files
- D.4 - R Files
- D.5 - Excel Spreadsheets
- D.6 - Loading Files from Other Programs
E - Debugging R Code
Worth reading, but not essential for Core ERM. You can come back to it later.
- E.1 - traceback
- E.2 - browser
- E.3 - Break Points
- E.4 - debug
- E.5 - trace
- E.6 - recover