Last updated: 23 Apr 22 10:06:50 (UTC)
Frank's Outline of HOPR (Hands-on Programming with R)
https://rstudio-education.github.io/hopr/
2 - The Very Basics
- 2.1 The R user interface:
- Basic arithmetic at the console
- Comments
# - Control-C to cancel a command
- 2.2 Objects
- Colon operator
: - Objects and assignment with
<- - RStudio environment pane
- Colon operator
- 2.3 Functions
- function and argument
mean(),round(),factorial()sample()with and without replacement- Simulate rolling a pair of dice with
sample
- 2.4 Writing Your Own Functions
sum()- The function constructor
- A function to roll a fair dice twice and return the sum
- 2.5 Arguments
- Function with arguments and/or default values
- Function to roll a pair of dice where the faces on each die are an argument
- Details of functions: name, body, arguments, default values, last line of code (return)
- 2.6 Scripts
- Creating a script in Rstudio: editor pane
- Control + Return / Command + Return to run lines from script
- The “Run” and “Source” buttons in RStudio
3 - Packages and Help Files
- 3.1 Packages
install.packages()library()- Directs readers to an appendix with more details about package management
c()operator- Scatterplot with
qplot()fromggplot2 - Histogram with
qplot()setting the binwidth parameter replicate()to repeat the simulation of rolling two dice- Make a histogram of result from
replicate()to check the dice rolling simulation code - Ask the reader to weight the dice by using an option in the
sample()function. Doesn’t tell which, so you need to look in the help file!
- 3.2 Getting Help
?to get help- Can’t get help about package commands unless you’ve loaded the package with
library() - Parts of a help file
- Advice on how to read a help file
- The argument for
sample()used to weight the dice isprob - Other places to get help: R help list, Stack Overflow, and <community.rstudio.com>
Note that chapter 4 doesn’t have any content: it just outlines the next group of chapters.
5- Objects
- 5.1 Atomic Vectors
- An atomic vector is one-dimensional and stores data of a single type.
- Six basic types of atomic vector: double, integer, character, logical, complex, and raw.
- Make an atomic vector with
c() - Test whether something is an atomic vector with
is.vector() length()to find the length of an atomic vector- An atomic vector can be length one: R doesn’t actually have scalar types!
- Use
Lto create an atomic vector of integer type typeof()to find out what type of atomic vector you have- numeric is a synonym for double in R
- Arithmetic with integers is always exact, but with doubles it may not be. Doubles are exact when used to store integers and store a wider range of integers so you rarely see integer types in practice.
- Character data and the distinction between
1and”1”as well as betweenxand”x” - Logical aka Boolean means
TRUEorFALSE - Shorthand
TandFbut don’t use them because they can be re-defined by the user, e.g.T <- FALSE - We’re not going to work with raw or complex types so don’t worry about them.
- 5.2 Attributes
- An attribute is a piece of “extra information” that can be “attached” to any R object, including an atomic vector
- Use
attributes()to view an object’s attributes: it returnsNULLif it has none NULLis how R denotes the empty set aka the null set. You can create an empty set:x <- NULL- Common attributes for atomic vectors: name, dimensions (dim), classes
names()to give an object names attributes; can set toNULLor change existing names- Use
dim()to convert an atomic vector to an n-dimensional array by giving it dimensions - (rows, columns, slices) with
dim()
- 5.3 Matrices; 5.4 Arrays
matrix()to create a 2-dimensional array; fills by column by default but can change withbyrowarray()to creat an n-dimensional array; first argument is an atomic vector of values, second is a vector of dimensions- Exercises: make a few matrices and fill them by row versus column
- 5.5 Class
- An object’s class isn’t the same thing as its type
class()to check or change an object’s class; but changing it is usually a bad idea!- When you change the dimensions of an atomic vector, its class changes but not its type; examples
- Key point: matrices and arrays are simply atomic vectors with attributes! This means that, like an atomic vector, they can only hold data of a single type
- Dates are an interesting example of a class. There’s no date type in R: instead R handles dates by setting a class attribute for a double
- You can see the underlying representation by using
unclass() x <- Sys.time(); typeof(x); class(x); unclass(x)- R uses factors to represent categorical variables; there’s no factor type instead this handled by setting attributes
- Use
factor()to create a factor; useattributes()to see thelevelsandclassattributes; useunclass()to see the underlying representation - TLDR: factors basically represent character data with a limited number of possible values as integers. This is convenient for statistical models e.g. regression but can create problems in other settings.
- By default R often tries to read in character data as a factor. If you need to undo this, you can use
as.character()
- 5.6 Coercion
- If you try to store different types in an atomic vector, R will convert them to a single type.
- This is called coercion
- E.g.
x <- c(1, TRUE, ‘1’) - Coercion follows this precedence relation: character > numeric > logical
- The type with the highest precedence “wins” and everything else gets converted to this type.
- When logical types are converted to numeric, TRUE becomes 1 and FALSE becomes 0.
- This is handy for doing mathematics, e.g.
mean(c(TRUE, FALSE, TRUE)) - To convert manually use
as.character(); as.numeric(); as.logical()
- 5.7 Lists
- The advantage of atomic vectors / matrices / arrays is that working with them is fast, memory-efficient, and mathematically convenient because they contain elements of a single type.
- We use a list when we need an object that contain elements of arbitrary types.
- The elements of a list can be *any R object whatsoever *: atomic vectors, other lists, you name it!
list()creates a list just likec()creates an atomic vector: use commas to separate the elements.- Lists may have multiple levels of indexing, since they can contain elements that themselves have indices
- You can give names to list items when you create a list, e.g.
list(dogs = c(‘Fido’, ‘Spot’, ‘Lassie’), currencies = c(‘Sterling’, ‘Dollars’), primes = c(2, 3, 5, 7, 11)) - Use
str()to get an overview of what’s contained in a list
- 5.8 Data Frames
- These are just a special case of a list: a 2-dimensional list in which each element is an atomic vector of the same length but potentially a different type.
- You can think of a data frame as a “spreadsheet”
- You can create a data frame using
data.frame()and give names to the columns e.g.data.frame(name = c(Waldo, Wanda, Wilfred), age = c(24, 32, 43)) - By default, R stores character data in a data frame as a factor. To disable this, us the
strings.as.factors = FALSEoption withdata.frame() - A data frame
dfis an object whose type is list, and whose class is data frame. To see this, and look inside a data frame, usetypeof(df); class(df); str(df)
- 5.9 Reading Data
- Shows how to use the RStudio GUI to read in a text file.
- More details are provided in the appendix “Loading and Saving Data in R” including
read.table(), setting a path, etc.
- 5.10 Saving Data
- Shows how to use
write.csv()withrow.names = FALSE - More details in the appendix “Loading and Saving Data in R”
- Shows how to use
6 - R Notation
This chapter is about how to access elements of R objects. It mainly focuses on data frames but much of the material also applies to matrices, arrays, and lists.
- 6.1 Selecting Values
- Use the syntax
x[ row_indices, column_indices ]to extract elements of a data framex. (All of this works for a matrix and generalises to higher-dimensional arrays) - There are six ways of specifying the indices. In each case you supply an atomic vector, possibly of length one:
- Positive integers: specify which row and column indices to select
- Negative integers: specify which row and column indices to not select
- Zero: if you put a zero for a dimension, R extracts nothing for that dimension. Not very useful
- Blank space: if blank, everything in that dimension is extracted e.g. all rows, or all columns, or all slices
- Logical values: need to match the length of the dimension or R will recycle. TRUE if you want the element and FALSE if you don’t
- Names: if you’ve given names to the rows and or columns you can use them. Less error prone than matching indices
- R begins indexing with 1 rather than 0; its notation follows linear algebra books rather than languages like C or Python
- If you ask for a single row or column e.g.
x[1, 1:3]R will convert the result from a matrix to an atomic vector. To prevent this usedrop = FALSEe.g.x[1, 1:3, drop = FALSE] - See Radford Neal’s blog post about how
drop = FALSEcan be considered an R design flaw - You can supply the same row or column index multiple times in which case R will give you repeat elements! E.g.
x[c(1, 1), 1:3]This is helpful for the bootstrap - You can’t mix negative integers and positive integers as subscripts! But you can mix zeros with either.
- 6.2 Deal a Card
- Write a function that uses brackets to select the first card (row) in a deck (data frame)
- 6.3. Shuffle Card
- Use
head()andtail()to look at a data frame that represents a deck of cards - Use
sample()to shuffle the deck by returning all of the rows in a random order (a random row permutation)
- Use
- 6.4 Dollar Signs and Double Brackets
- These allow us to extract elements from a list. Since a data frame is a list, we can use them on a data frame too.
$to extract a column from a data frame by name: example of taking the median and mean of a column of numeric data- The “usual” way to access elements of a list is the same as atomic vector:
x[1]etc. Here R will return the same type: list from a list and atomic vector from atomic vector - Example of trying to take the
suman atomic vector stored in a listsum(x[1])doesn’t work. - If the elements of a list are named we can use
$to extract a list element by name and R will not necessarily return a list: it will return the object stored in the list that has that name. - Using
$the example with a sum works. - But what if we don’t have named list elements? Then we can use double brackets:
x[[1]]doesn’t return a list unless the first element ofxis itself a list. Fixes the problem in the sum example:sum(x[[1]]). - We can combine the single and double bracket notation for lists with any of the six ways of specifying indices from above: positive integers, negative, logical, names, etc.
- Example:
lst = list(numbers = c(1, 3, 5), boolean = TRUE, strings = c(‘a’, ‘b’, ‘c’)) lst[“numbers”]returns a list whilelst[[“numbers”]]returns an atomic vector- Think of a list as a train that has many cars. Each car contains an R object.
- Single brackets return another train with the cars you have selected: the objects are still inside the train cars.
- Double brackets unload the train cars you’ve specified, and return the contents without the car itself
7 - Modifying Values
- 7.1 - Changing Values in Place
- LHS: describe values you want to modify; RHS: use
<-to overwrite - E.g.
vec <- rep(0, 6); vec[1] <- 1000 - Can replace multiple values at once as long as the dimensions match (recycling rule?)
vec[c(1, 3, 5)] <- rep(1, 3) - Can expand a vector: set values of elements that don’t (yet) exist, e.g.
vec[7] <- 0 - A good use case is adding a column to an existing dataframe:
deck2$new <- 1:52 - A bad use case is adding elements to a vector within a loop. Don’t do this! It’s really slow! Much better to pre-allocate. (This is discussed below)
- Can remove a column from a dataframe or an element from a list by assigning it to
NULL, e.g.deck2$new <- NULL - Simple example of recycling:
vec[c(1, 3, 5)] <- -999 - When we change values in place in this way, note that we are overwriting the original object rather than creating a modified copy of that object.
- LHS: describe values you want to modify; RHS: use
- 7.2 - Logical Subsetting
- Some key logical operators in R:
>,>=,<,<=,==,!= - These return
TRUEorFALSE. - If you apply any of these to vectors, R interprets your intention elementwise. This is exactly how R words with numeric vectors.
- Another logical operator:
%in% %in%works differently from the other logical operators from above:x %in% yreturns a logical vector of the same length asx. Each element indicates whether the corresponding element ofxis among the elements ofy.- Be careful about
==versus=. The former is a logical operator while the latter is an assignment operator: a synonym for<- - Exercise: count how many of the cards in a deck are
acecards - Using logical subsetting to modify values:
deck3$value[deck3$face == 'ace'] <- 14 - Boolean Operators:
&,|,xor,!,any,all - Common error: forgetting to “put a complete test on either side of a Boolean operator,” e.g.
x > 2 & < 9is WRONG - Examples of selecting particular rows of a dataframe using compound statements that combine Boolean and logical operators
- Selecting rows of a dataframe with
%in%
- Some key logical operators in R:
- 7.3 - Missing Information
NAis a reserved word in R. It means “not available” and serves as a “placeholder for missing information”NAdoesn’t behave as you might expect! But the best way to reason about it is to replaceNAwith the phrase “a value I don’t know.”- For example:
NA + 1means “what is the sum of 1 and a value I don’t know?” The answer is “I don’t know” i.e.NA - In other words:
NAvalues tend to “propogate.” - Why does R behave this way? It wants to confront you with the fact that you have missing data, to prevent you from making silly mistakes.
- Many functions, e.g.
mean()have anna.rmoption. Setting this equal toTRUEdrops the missing values. Otherwise taking a mean that includes any missing values will be anNA. NA == 1returnsNAandNA == NAalso returnsNA. This means we can’t test forNAvalues using==is.na()allows us to test for missing values
8 - Environments
This chapter isn’t so important for core ERM. There’s some helpful material on scope, but it may be better to revisit this after you already have a bit more familiarity with the “nuts and bolts” of R.
9 - Programs
Build a working slot machine using R functions. The point of the chapter is to think about how to create programs by combining functions.
- 9.1 - Strategy
- How to solve complicated problems by breaking them up into sub-tasks
- sequential steps: some things need to be done in order, with the output of one step serving as the input to the next
- parallel cases: in some settings our program needs to work in a different way depending on the kind of input that is supplied
- Make a flowchart of the slot machine example to clarify which steps are sequential and which cases are parallel
- 9.2 -
ifStatements- Basic syntax of
ifin R - Condition of
ifmust evaluate to a single TRUE or FALSE. R throws a warning if you supply a logical vector instead. It then uses the first element of the vector. - Some quiz questions to check basic understanding of
if
- Basic syntax of
- 9.3 -
elseStatements- Basic syntax of
if ... elsein R. trunc()function to extract the decimal part:x - trunc(x)if ... elsewithtruncto round a numeric value- Syntax of
else if - R starts at the top of an
ifblock and continues until it finds a logical condition that evaluates to true. It then runs the corresponding code block and skips any remainingiforelseblocks. When none of theifofelse ifblocks has a true condition, R evaluates the code block associated withelseif one is present. - Use
if ... else if ... elsefor the parallel cases in the slot machine example unique()function to create a subvector with no repeat elements&&and||work like&and|except that they are lazy: they only check the first elements and then stop. You can always be sure that they will return a single logicalTRUEorFALSErather than a vector. This makes them handy for use inif ... else if ... elsestatements.- Replacing multiple
|operators with a single%in%to yield code that is more efficient and easier to read and maintain
- Basic syntax of
- 9.4 - Lookup Tables
- How to avoid an
if ... else if ... elsestatement with lots and lots ofelse ifblocks? - Pro tip: subsetting often provides the simplest way to achieve a complicated task in R.
- Create a named vector called
payoutswith names that correspond to the symbols that appear on the slot machine and values that give the associated payoffs, e.g.payouts <- c('DD' = 100, '7' = 80, 'BBB' 40, ... ) - Now it’s easy to find the prize for a given payout:
payouts['DD'] - This is a lookup table. We create one in R by subsetting a named object in a clever way.
- By feeding a variable into
payouts[...]we can look up a payout that arises at runtime based on the randomly generate symbols on the slot machine. - Using R’s coercion rules to count occurences of a particular value, e.g.
sum(symbols == 'DD') - When to use
iftrees versus lookup tables? Trees are for running different code in each branch; lookup tables are for assigning different values in each branch. - How to convert from an
iftree to a lookup table? Find the values that are being assigned and store them as a vector. Then find the conditions used to choose between them. If the conditions are based on character strings, use name-based subsetting; otherwise use integer-based subsetting. - Double the prize for every diamond present: nice vectorized solution
prize * (2^diamonds)
- How to avoid an
- 9.5 - Code comments
- When to use comments
- How to write functions: start with the “meat” inside and once it’s working, encapsulate into a function. You can either do this by hand or with the “Extract Function” option in RStudio (menu bar under “Code”)
10 - S3
This is less important. After you have a solid foundation in R, you may find it helpful to come back and read this.
11 - Loops
- 11.1 - Expected values
- Explains how this is defined for a discrete RV
- Illustrates using the weighted die example from earlier
- 11.2
expand.grid- Use
expand.grid()to create a dataframe whose rows contain the Cartesian product of two vectors. In other words: every element of the first vector is paired with every element of the second. - Example of
expand.grid()to take the Cartesian product of a vector with itself:rolls <- expand.grid(die, die)to get all pairs when rolling two dice. - Elementwise operations to convert
rollsinto the totals when rolling two dicerolls$value <- rolls$Var1 + rolls$Var2 - Steps to calculate expected value of sum when rolling two dice:
Var1inrollsis the first die andVar2is the second. The sum isrolls$value- Create a lookup table for probabilies with names that match the values in
Var1and use subsetting to get the desired probs:probs1 - Repeat for
Var2to getprobs2 - Joint probabilities:
rolls$prob <- probs1 * probs2 sum(rolls$value * rolls$prob)
- Create a lookup table for probabilies with names that match the values in
- Apply the preceding steps to the slot machine example.
- CAREFUL: set
stringsAsFactors = FALSEwith expand grid when using it with character data to avoid the result being stored as a factor. - Example: use
expand.grid()to create all combinations of three elements:expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)
- Use
- 11.3 -
forLoops- “Do this for every value of that”
for (value in that) { this } - Creates an object called
valueand assigns it a new value for each run of the loop forcan loop over elements of any atomic vector and we can call the index variable whatever we like, e.g.for (value in c('my', 'second', 'for', 'loop'){ ... }- Careful about scope: R runs the loop in the environment that called it. Objects will be overwritten if you’re not careful:
valueis actually an object that R creates, so you could accidentally overwrite an existing object. - R is weird: its
forloops iterate over elements of sets. In most other programming language,forloops iterate over sequences of integers. In R we could choose to create a set of integers and loop over this, but we could also loop over some other kind of set. - What Happens in Vegas Stays in Vegas and the same is true of a
forloop. To use the results of the calculations in aforloop, we need to store them somewhere. - Create an empty vector of a certain length:
chars <- vector(length = 4)then fill it up within aforloop by indexing the elements of some other vector. - Create an empty column in a dataframe using R’s recycling rules:
combos$prize <- NAand then loop over the rows withnrow()to fill in the missing values.
- “Do this for every value of that”
- 11.4 -
whileLoops- Two other kinds of loops in R:
whileandrepeat whilekeeps looping as long as the condition is true; this is useful when you don’t know in advance how many times you need to do something, e.g. keep playing the slot machine until you run out of money.
- Two other kinds of loops in R:
- 11.5 -
repeatLoops- Repeat a chunk of code until you hit escape or a
breakstatement is reached.
- Repeat a chunk of code until you hit escape or a
12 - Speed
- 12.1 - Vectorized Code
- Two examples of a function that returns the absolute value of all elements of a numeric vector: one using a
forloop and another using logical subsetting - The
forloop version is much slower. system.time(...)to see how long it takes to run a command; don’t confuse withSys.time()which instead returns the current system time!- Compare the timings of the vectorized absolute value function to R’s built-in function
abs(). Guess who wins? The internal function!
- Two examples of a function that returns the absolute value of all elements of a numeric vector: one using a
- 12.2 - How to Write Vectorized Code
- A general pattern for creating vectorized code:
- Replace sequential steps with existing vectorized functions.
- Use logical subsetting for parallel cases; handle everything at once.
- Example: vectorize a function that renames the slot machine symbols by combines a
forloop with a giantiftree. Time the two versions of the code. - Best solution to the proceeding: use a lookup table and
unname()to clean up the result:unname(tb[vec])wherevecis the input vector whose names we want to change andtbis the lookup table. The result is 40 times faster! - R is not a compiled language. This is why
forloops suffer a speed penalty relative to subsetting approaches: the subsetting operations effectively get “handed over” to extremely fast compiled C code “under the hood.”
- A general pattern for creating vectorized code:
- 12.3 - How to Write Fast
forLoops in R- Sometimes you can’t avoid a
forloop, e.g. when the calculation in step (k+1) of a problem depends on the value computed in step k. - How can we make our
forloops faster? Two principles:- Do as much as you can outside the loop: anything inside the loop is run many many times.
- Pre-allocate storage for any results you want to store. If you know you’re trying to calculate 1000 values, start by creating an empty 1000-vector outside the loop. Increasing the size of an object “on the fly” incurs a huge performance penalty because it requires R to make copies of an object in memory so it can erase and replace.
- Sometimes you can’t avoid a
- 12.4 - Vectorized Code in Practice
- Applysome of the preceding ideas to the slot machine example.
A - Installing R and Rstudio
Not crucial: we’ll use RStudio cloud and the TAs can help help you with installation on your local machine if needed during their office hours.
B - R Packages
- “base R” refers to the functions that are always available to you whenever you start R. You can use these “out of the box” as it were.
- Other packages need to be installed and then loaded. You only need to install once; you need to load whenever you restart R.
install.packages('<package name>')to install a single package at the R consoleinstall.packages(c('ggplot2', 'reshape2', 'dplyr'))to install multiple packages at once- You’ll be prompted to choose a mirror the first time you in stall packages. When in doubt, choose Austria: new things appear here first since it’s the main CRAN repo.
install_github()etc. to install development versions of packages or packages that aren’t (yet) on CRAN.library(<package name>)to load a package; you need to do this before using it and repeat whenever you restart your R session.library()to see the packages you currently have installed- Learn about new packages using task views: http://cran.r-project.org/web/views
C - Updating R and its Packages
update.packages(c('ggplot2', 'reshape2', 'dplyr'))to check whether you have the most recent versions of these packages and, if not, install them.- Start a new R session after updating; close R session and open new one if you already have the old version loaded.
D - Loading and Saving Data in R
Less important since we’ll talk about this in Core ERM.
- D.1 - Data Sets in Base R
- D.2 - Working Directory
- D.3 - Plain-text Files
- D.4 - R Files
- D.5 - Excel Spreadsheets
- D.6 - Loading Files from Other Programs
E - Debugging R Code
Worth reading, but not essential for Core ERM. You can come back to it later.
- E.1 - traceback
- E.2 - browser
- E.3 - Break Points
- E.4 - debug
- E.5 - trace
- E.6 - recover