Last updated: 23 Apr 22 10:06:50 (UTC)

Frank's Outline of HOPR (Hands-on Programming with R)

https://rstudio-education.github.io/hopr/

2 - The Very Basics

  • 2.1 The R user interface:
    • Basic arithmetic at the console
    • Comments #
    • Control-C to cancel a command
  • 2.2 Objects
    • Colon operator :
    • Objects and assignment with <-
    • RStudio environment pane
  • 2.3 Functions
    • function and argument
    • mean(), round(), factorial()
    • sample() with and without replacement
    • Simulate rolling a pair of dice with sample
  • 2.4 Writing Your Own Functions
    • sum()
    • The function constructor
    • A function to roll a fair dice twice and return the sum
  • 2.5 Arguments
    • Function with arguments and/or default values
    • Function to roll a pair of dice where the faces on each die are an argument
    • Details of functions: name, body, arguments, default values, last line of code (return)
  • 2.6 Scripts
    • Creating a script in Rstudio: editor pane
    • Control + Return / Command + Return to run lines from script
    • The “Run” and “Source” buttons in RStudio

3 - Packages and Help Files

  • 3.1 Packages
    • install.packages()
    • library()
    • Directs readers to an appendix with more details about package management
    • c() operator
    • Scatterplot with qplot() from ggplot2
    • Histogram with qplot() setting the binwidth parameter
    • replicate() to repeat the simulation of rolling two dice
    • Make a histogram of result from replicate() to check the dice rolling simulation code
    • Ask the reader to weight the dice by using an option in the sample() function. Doesn’t tell which, so you need to look in the help file!
  • 3.2 Getting Help
    • ? to get help
    • Can’t get help about package commands unless you’ve loaded the package with library()
    • Parts of a help file
    • Advice on how to read a help file
    • The argument for sample() used to weight the dice is prob
    • Other places to get help: R help list, Stack Overflow, and <community.rstudio.com>

Note that chapter 4 doesn’t have any content: it just outlines the next group of chapters.

5- Objects

  • 5.1 Atomic Vectors
    • An atomic vector is one-dimensional and stores data of a single type.
    • Six basic types of atomic vector: double, integer, character, logical, complex, and raw.
    • Make an atomic vector with c()
    • Test whether something is an atomic vector with is.vector()
    • length() to find the length of an atomic vector
    • An atomic vector can be length one: R doesn’t actually have scalar types!
    • Use L to create an atomic vector of integer type
    • typeof() to find out what type of atomic vector you have
    • numeric is a synonym for double in R
    • Arithmetic with integers is always exact, but with doubles it may not be. Doubles are exact when used to store integers and store a wider range of integers so you rarely see integer types in practice.
    • Character data and the distinction between 1 and ”1” as well as between x and ”x”
    • Logical aka Boolean means TRUE or FALSE
    • Shorthand T and F but don’t use them because they can be re-defined by the user, e.g. T <- FALSE
    • We’re not going to work with raw or complex types so don’t worry about them.
  • 5.2 Attributes
    • An attribute is a piece of “extra information” that can be “attached” to any R object, including an atomic vector
    • Use attributes() to view an object’s attributes: it returns NULL if it has none
    • NULL is how R denotes the empty set aka the null set. You can create an empty set: x <- NULL
    • Common attributes for atomic vectors: name, dimensions (dim), classes
    • names() to give an object names attributes; can set to NULL or change existing names
    • Use dim() to convert an atomic vector to an n-dimensional array by giving it dimensions
    • (rows, columns, slices) with dim()
  • 5.3 Matrices; 5.4 Arrays
    • matrix() to create a 2-dimensional array; fills by column by default but can change with byrow
    • array() to creat an n-dimensional array; first argument is an atomic vector of values, second is a vector of dimensions
    • Exercises: make a few matrices and fill them by row versus column
  • 5.5 Class
    • An object’s class isn’t the same thing as its type
    • class() to check or change an object’s class; but changing it is usually a bad idea!
    • When you change the dimensions of an atomic vector, its class changes but not its type; examples
    • Key point: matrices and arrays are simply atomic vectors with attributes! This means that, like an atomic vector, they can only hold data of a single type
    • Dates are an interesting example of a class. There’s no date type in R: instead R handles dates by setting a class attribute for a double
    • You can see the underlying representation by using unclass()
    • x <- Sys.time(); typeof(x); class(x); unclass(x)
    • R uses factors to represent categorical variables; there’s no factor type instead this handled by setting attributes
    • Use factor() to create a factor; use attributes() to see the levels and class attributes; use unclass() to see the underlying representation
    • TLDR: factors basically represent character data with a limited number of possible values as integers. This is convenient for statistical models e.g. regression but can create problems in other settings.
    • By default R often tries to read in character data as a factor. If you need to undo this, you can use as.character()
  • 5.6 Coercion
    • If you try to store different types in an atomic vector, R will convert them to a single type.
    • This is called coercion
    • E.g. x <- c(1, TRUE, ‘1’)
    • Coercion follows this precedence relation: character > numeric > logical
    • The type with the highest precedence “wins” and everything else gets converted to this type.
    • When logical types are converted to numeric, TRUE becomes 1 and FALSE becomes 0.
    • This is handy for doing mathematics, e.g. mean(c(TRUE, FALSE, TRUE))
    • To convert manually use as.character(); as.numeric(); as.logical()
  • 5.7 Lists
    • The advantage of atomic vectors / matrices / arrays is that working with them is fast, memory-efficient, and mathematically convenient because they contain elements of a single type.
    • We use a list when we need an object that contain elements of arbitrary types.
    • The elements of a list can be *any R object whatsoever *: atomic vectors, other lists, you name it!
    • list() creates a list just like c() creates an atomic vector: use commas to separate the elements.
    • Lists may have multiple levels of indexing, since they can contain elements that themselves have indices
    • You can give names to list items when you create a list, e.g. list(dogs = c(‘Fido’, ‘Spot’, ‘Lassie’), currencies = c(‘Sterling’, ‘Dollars’), primes = c(2, 3, 5, 7, 11))
    • Use str() to get an overview of what’s contained in a list
  • 5.8 Data Frames
    • These are just a special case of a list: a 2-dimensional list in which each element is an atomic vector of the same length but potentially a different type.
    • You can think of a data frame as a “spreadsheet”
    • You can create a data frame using data.frame() and give names to the columns e.g. data.frame(name = c(Waldo, Wanda, Wilfred), age = c(24, 32, 43))
    • By default, R stores character data in a data frame as a factor. To disable this, us the strings.as.factors = FALSE option with data.frame()
    • A data frame df is an object whose type is list, and whose class is data frame. To see this, and look inside a data frame, use typeof(df); class(df); str(df)
  • 5.9 Reading Data
    • Shows how to use the RStudio GUI to read in a text file.
    • More details are provided in the appendix “Loading and Saving Data in R” including read.table(), setting a path, etc.
  • 5.10 Saving Data
    • Shows how to use write.csv() with row.names = FALSE
    • More details in the appendix “Loading and Saving Data in R”

6 - R Notation

This chapter is about how to access elements of R objects. It mainly focuses on data frames but much of the material also applies to matrices, arrays, and lists.

  • 6.1 Selecting Values
  • Use the syntax x[ row_indices, column_indices ] to extract elements of a data frame x. (All of this works for a matrix and generalises to higher-dimensional arrays)
  • There are six ways of specifying the indices. In each case you supply an atomic vector, possibly of length one:
    1. Positive integers: specify which row and column indices to select
    2. Negative integers: specify which row and column indices to not select
    3. Zero: if you put a zero for a dimension, R extracts nothing for that dimension. Not very useful
    4. Blank space: if blank, everything in that dimension is extracted e.g. all rows, or all columns, or all slices
    5. Logical values: need to match the length of the dimension or R will recycle. TRUE if you want the element and FALSE if you don’t
    6. Names: if you’ve given names to the rows and or columns you can use them. Less error prone than matching indices
  • R begins indexing with 1 rather than 0; its notation follows linear algebra books rather than languages like C or Python
  • If you ask for a single row or column e.g. x[1, 1:3] R will convert the result from a matrix to an atomic vector. To prevent this use drop = FALSE e.g. x[1, 1:3, drop = FALSE]
  • See Radford Neal’s blog post about how drop = FALSE can be considered an R design flaw
  • You can supply the same row or column index multiple times in which case R will give you repeat elements! E.g. x[c(1, 1), 1:3] This is helpful for the bootstrap
  • You can’t mix negative integers and positive integers as subscripts! But you can mix zeros with either.
  • 6.2 Deal a Card
    • Write a function that uses brackets to select the first card (row) in a deck (data frame)
  • 6.3. Shuffle Card
    • Use head() and tail() to look at a data frame that represents a deck of cards
    • Use sample() to shuffle the deck by returning all of the rows in a random order (a random row permutation)
  • 6.4 Dollar Signs and Double Brackets
    • These allow us to extract elements from a list. Since a data frame is a list, we can use them on a data frame too.
    • $ to extract a column from a data frame by name: example of taking the median and mean of a column of numeric data
    • The “usual” way to access elements of a list is the same as atomic vector: x[1] etc. Here R will return the same type: list from a list and atomic vector from atomic vector
    • Example of trying to take the sum an atomic vector stored in a list sum(x[1]) doesn’t work.
    • If the elements of a list are named we can use $ to extract a list element by name and R will not necessarily return a list: it will return the object stored in the list that has that name.
    • Using $ the example with a sum works.
    • But what if we don’t have named list elements? Then we can use double brackets: x[[1]] doesn’t return a list unless the first element of x is itself a list. Fixes the problem in the sum example: sum(x[[1]]).
    • We can combine the single and double bracket notation for lists with any of the six ways of specifying indices from above: positive integers, negative, logical, names, etc.
    • Example: lst = list(numbers = c(1, 3, 5), boolean = TRUE, strings = c(‘a’, ‘b’, ‘c’))
    • lst[“numbers”] returns a list while lst[[“numbers”]] returns an atomic vector
    • Think of a list as a train that has many cars. Each car contains an R object.
    • Single brackets return another train with the cars you have selected: the objects are still inside the train cars.
    • Double brackets unload the train cars you’ve specified, and return the contents without the car itself

7 - Modifying Values

  • 7.1 - Changing Values in Place
    • LHS: describe values you want to modify; RHS: use <- to overwrite
    • E.g. vec <- rep(0, 6); vec[1] <- 1000
    • Can replace multiple values at once as long as the dimensions match (recycling rule?) vec[c(1, 3, 5)] <- rep(1, 3)
    • Can expand a vector: set values of elements that don’t (yet) exist, e.g. vec[7] <- 0
    • A good use case is adding a column to an existing dataframe: deck2$new <- 1:52
    • A bad use case is adding elements to a vector within a loop. Don’t do this! It’s really slow! Much better to pre-allocate. (This is discussed below)
    • Can remove a column from a dataframe or an element from a list by assigning it to NULL, e.g. deck2$new <- NULL
    • Simple example of recycling: vec[c(1, 3, 5)] <- -999
    • When we change values in place in this way, note that we are overwriting the original object rather than creating a modified copy of that object.
  • 7.2 - Logical Subsetting
    • Some key logical operators in R: >, >=, <, <=, ==, !=
    • These return TRUE or FALSE.
    • If you apply any of these to vectors, R interprets your intention elementwise. This is exactly how R words with numeric vectors.
    • Another logical operator: %in%
    • %in% works differently from the other logical operators from above: x %in% y returns a logical vector of the same length as x. Each element indicates whether the corresponding element of x is among the elements of y.
    • Be careful about == versus =. The former is a logical operator while the latter is an assignment operator: a synonym for <-
    • Exercise: count how many of the cards in a deck are ace cards
    • Using logical subsetting to modify values: deck3$value[deck3$face == 'ace'] <- 14
    • Boolean Operators: &, |, xor, !, any, all
    • Common error: forgetting to “put a complete test on either side of a Boolean operator,” e.g. x > 2 & < 9 is WRONG
    • Examples of selecting particular rows of a dataframe using compound statements that combine Boolean and logical operators
    • Selecting rows of a dataframe with %in%
  • 7.3 - Missing Information
    • NA is a reserved word in R. It means “not available” and serves as a “placeholder for missing information”
    • NA doesn’t behave as you might expect! But the best way to reason about it is to replace NA with the phrase “a value I don’t know.”
    • For example: NA + 1 means “what is the sum of 1 and a value I don’t know?” The answer is “I don’t know” i.e. NA
    • In other words: NA values tend to “propogate.”
    • Why does R behave this way? It wants to confront you with the fact that you have missing data, to prevent you from making silly mistakes.
    • Many functions, e.g. mean() have an na.rm option. Setting this equal to TRUE drops the missing values. Otherwise taking a mean that includes any missing values will be an NA.
    • NA == 1 returns NA and NA == NA also returns NA. This means we can’t test for NA values using ==
    • is.na() allows us to test for missing values

8 - Environments

This chapter isn’t so important for core ERM. There’s some helpful material on scope, but it may be better to revisit this after you already have a bit more familiarity with the “nuts and bolts” of R.

9 - Programs

Build a working slot machine using R functions. The point of the chapter is to think about how to create programs by combining functions.

  • 9.1 - Strategy
    • How to solve complicated problems by breaking them up into sub-tasks
    • sequential steps: some things need to be done in order, with the output of one step serving as the input to the next
    • parallel cases: in some settings our program needs to work in a different way depending on the kind of input that is supplied
    • Make a flowchart of the slot machine example to clarify which steps are sequential and which cases are parallel
  • 9.2 - if Statements
    • Basic syntax of if in R
    • Condition of if must evaluate to a single TRUE or FALSE. R throws a warning if you supply a logical vector instead. It then uses the first element of the vector.
    • Some quiz questions to check basic understanding of if
  • 9.3 - else Statements
    • Basic syntax of if ... else in R.
    • trunc() function to extract the decimal part: x - trunc(x)
    • if ... else with trunc to round a numeric value
    • Syntax of else if
    • R starts at the top of an if block and continues until it finds a logical condition that evaluates to true. It then runs the corresponding code block and skips any remaining if or else blocks. When none of the if of else if blocks has a true condition, R evaluates the code block associated with else if one is present.
    • Use if ... else if ... else for the parallel cases in the slot machine example
    • unique() function to create a subvector with no repeat elements
    • && and || work like & and | except that they are lazy: they only check the first elements and then stop. You can always be sure that they will return a single logical TRUE or FALSE rather than a vector. This makes them handy for use in if ... else if ... else statements.
    • Replacing multiple | operators with a single %in% to yield code that is more efficient and easier to read and maintain
  • 9.4 - Lookup Tables
    • How to avoid an if ... else if ... else statement with lots and lots of else if blocks?
    • Pro tip: subsetting often provides the simplest way to achieve a complicated task in R.
    • Create a named vector called payouts with names that correspond to the symbols that appear on the slot machine and values that give the associated payoffs, e.g. payouts <- c('DD' = 100, '7' = 80, 'BBB' 40, ... )
    • Now it’s easy to find the prize for a given payout: payouts['DD']
    • This is a lookup table. We create one in R by subsetting a named object in a clever way.
    • By feeding a variable into payouts[...] we can look up a payout that arises at runtime based on the randomly generate symbols on the slot machine.
    • Using R’s coercion rules to count occurences of a particular value, e.g. sum(symbols == 'DD')
    • When to use if trees versus lookup tables? Trees are for running different code in each branch; lookup tables are for assigning different values in each branch.
    • How to convert from an if tree to a lookup table? Find the values that are being assigned and store them as a vector. Then find the conditions used to choose between them. If the conditions are based on character strings, use name-based subsetting; otherwise use integer-based subsetting.
    • Double the prize for every diamond present: nice vectorized solution prize * (2^diamonds)
  • 9.5 - Code comments
    • When to use comments
    • How to write functions: start with the “meat” inside and once it’s working, encapsulate into a function. You can either do this by hand or with the “Extract Function” option in RStudio (menu bar under “Code”)

10 - S3

This is less important. After you have a solid foundation in R, you may find it helpful to come back and read this.

11 - Loops

  • 11.1 - Expected values
    • Explains how this is defined for a discrete RV
    • Illustrates using the weighted die example from earlier
  • 11.2 expand.grid
    • Use expand.grid() to create a dataframe whose rows contain the Cartesian product of two vectors. In other words: every element of the first vector is paired with every element of the second.
    • Example of expand.grid() to take the Cartesian product of a vector with itself: rolls <- expand.grid(die, die) to get all pairs when rolling two dice.
    • Elementwise operations to convert rolls into the totals when rolling two dice rolls$value <- rolls$Var1 + rolls$Var2
    • Steps to calculate expected value of sum when rolling two dice: Var1 in rolls is the first die and Var2 is the second. The sum is rolls$value
      1. Create a lookup table for probabilies with names that match the values in Var1 and use subsetting to get the desired probs: probs1
      2. Repeat for Var2 to get probs2
      3. Joint probabilities: rolls$prob <- probs1 * probs2
      4. sum(rolls$value * rolls$prob)
    • Apply the preceding steps to the slot machine example.
    • CAREFUL: set stringsAsFactors = FALSE with expand grid when using it with character data to avoid the result being stored as a factor.
    • Example: use expand.grid() to create all combinations of three elements: expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)
  • 11.3 - for Loops
    • “Do this for every value of that” for (value in that) { this }
    • Creates an object called value and assigns it a new value for each run of the loop
    • for can loop over elements of any atomic vector and we can call the index variable whatever we like, e.g. for (value in c('my', 'second', 'for', 'loop'){ ... }
    • Careful about scope: R runs the loop in the environment that called it. Objects will be overwritten if you’re not careful: value is actually an object that R creates, so you could accidentally overwrite an existing object.
    • R is weird: its for loops iterate over elements of sets. In most other programming language, for loops iterate over sequences of integers. In R we could choose to create a set of integers and loop over this, but we could also loop over some other kind of set.
    • What Happens in Vegas Stays in Vegas and the same is true of a for loop. To use the results of the calculations in a for loop, we need to store them somewhere.
    • Create an empty vector of a certain length: chars <- vector(length = 4) then fill it up within a for loop by indexing the elements of some other vector.
    • Create an empty column in a dataframe using R’s recycling rules: combos$prize <- NA and then loop over the rows with nrow() to fill in the missing values.
  • 11.4 - while Loops
    • Two other kinds of loops in R: while and repeat
    • while keeps looping as long as the condition is true; this is useful when you don’t know in advance how many times you need to do something, e.g. keep playing the slot machine until you run out of money.
  • 11.5 - repeat Loops
    • Repeat a chunk of code until you hit escape or a break statement is reached.

12 - Speed

  • 12.1 - Vectorized Code
    • Two examples of a function that returns the absolute value of all elements of a numeric vector: one using a for loop and another using logical subsetting
    • The for loop version is much slower.
    • system.time(...) to see how long it takes to run a command; don’t confuse with Sys.time() which instead returns the current system time!
    • Compare the timings of the vectorized absolute value function to R’s built-in function abs(). Guess who wins? The internal function!
  • 12.2 - How to Write Vectorized Code
    • A general pattern for creating vectorized code:
      1. Replace sequential steps with existing vectorized functions.
      2. Use logical subsetting for parallel cases; handle everything at once.
    • Example: vectorize a function that renames the slot machine symbols by combines a for loop with a giant if tree. Time the two versions of the code.
    • Best solution to the proceeding: use a lookup table and unname() to clean up the result: unname(tb[vec]) where vec is the input vector whose names we want to change and tb is the lookup table. The result is 40 times faster!
    • R is not a compiled language. This is why for loops suffer a speed penalty relative to subsetting approaches: the subsetting operations effectively get “handed over” to extremely fast compiled C code “under the hood.”
  • 12.3 - How to Write Fast for Loops in R
    • Sometimes you can’t avoid a for loop, e.g. when the calculation in step (k+1)(k+1)(k+1) of a problem depends on the value computed in step kkk.
    • How can we make our for loops faster? Two principles:
      1. Do as much as you can outside the loop: anything inside the loop is run many many times.
      2. Pre-allocate storage for any results you want to store. If you know you’re trying to calculate 1000 values, start by creating an empty 1000-vector outside the loop. Increasing the size of an object “on the fly” incurs a huge performance penalty because it requires R to make copies of an object in memory so it can erase and replace.
  • 12.4 - Vectorized Code in Practice
    • Applysome of the preceding ideas to the slot machine example.

A - Installing R and Rstudio

Not crucial: we’ll use RStudio cloud and the TAs can help help you with installation on your local machine if needed during their office hours.

B - R Packages

  • “base R” refers to the functions that are always available to you whenever you start R. You can use these “out of the box” as it were.
  • Other packages need to be installed and then loaded. You only need to install once; you need to load whenever you restart R.
  • install.packages('<package name>') to install a single package at the R console
  • install.packages(c('ggplot2', 'reshape2', 'dplyr')) to install multiple packages at once
  • You’ll be prompted to choose a mirror the first time you in stall packages. When in doubt, choose Austria: new things appear here first since it’s the main CRAN repo.
  • install_github() etc. to install development versions of packages or packages that aren’t (yet) on CRAN.
  • library(<package name>) to load a package; you need to do this before using it and repeat whenever you restart your R session.
  • library() to see the packages you currently have installed
  • Learn about new packages using task views: http://cran.r-project.org/web/views

C - Updating R and its Packages

  • update.packages(c('ggplot2', 'reshape2', 'dplyr')) to check whether you have the most recent versions of these packages and, if not, install them.
  • Start a new R session after updating; close R session and open new one if you already have the old version loaded.

D - Loading and Saving Data in R

Less important since we’ll talk about this in Core ERM.

  • D.1 - Data Sets in Base R
  • D.2 - Working Directory
  • D.3 - Plain-text Files
  • D.4 - R Files
  • D.5 - Excel Spreadsheets
  • D.6 - Loading Files from Other Programs

E - Debugging R Code

Worth reading, but not essential for Core ERM. You can come back to it later.

  • E.1 - traceback
  • E.2 - browser
  • E.3 - Break Points
  • E.4 - debug
  • E.5 - trace
  • E.6 - recover