5 How much R do I need to know to pass?

This is a communication exam because 30-40% of the points are based on your writing and data storytelling quality.

You do not need to become an R expert for this exam. While you are expected to develop a basic familiarity with R, the .Rmd template will provide you with the vast majority of the R commands needed. You will need to practice taking code templates and adjusting them to the specific variables and formulas you need. The most difficult coding questions will be during data exploration. These will ask you to

Use basic mathematical operators and functions such as exp() and log()
Select, modify, and summarize data in a dataframe
Display data from a dataframe in common types of plots, using ggplot2

When fitting predictive models, you will also need to

Modify or add a formula or other parameters to model-fitting functions like glm() and rpart()
Extract and displaying results from a fitted predictive model

You are not expected to construct loops, write functions, or use other programmatic techniques with R. The scope is limited to single-line commands.

You will have two cheat sheets for data visualization and base R. You can use this in your study to become familiar with how to find the R code quickly. These cheat sheets that the SOA gives you were designed by the RStudio team for everyone who uses R, and so we have gone through and removed the parts that you will not need to learn. For example, no one will be making new types of ggplot graphs under the “graphical primitives” section, which has been blocked out. Enroll in either of our online courses to get our simplified Base R cheat sheet and tutorial.

5.1 How to use the PA R cheat sheets?

Be observant and expect to spend twice as long explaining what your code is doing as you write the code itself. A few points are based on the organization your .Rmd file, although you do not need this to read like an essay. The vast majority of your time will be spent on your Word document.

The June 16 2020 Project Statement has this under “General information for candidates.”

Each task will be graded on the quality of your thought process, added or modified code, and conclusions

At a minimum, you must submit your completed report template and an .Rmd file that supports your work. Graders expect that your .Rmd code can be run from beginning to end. The code snippets provided should either be commented out or adapted for execution. Ensure that it is clear where in the code each of the tasks is addressed.

In other words, the results of your report must be consistent with what the grading team finds when they run your .Rmd file.

5.2 Example: SOA PA 6/16/20, Task 8

This question is from the June 16, 2020 exam. You can see that significantly only minor code changes need to be made. The remainder of this question consists of a short-answer response. This is very typical of Exam PA.

(4 obserations) Perform feature selection with lasso regression.

Run a lasso regression using the code chunk provided. The code will need to be modified to reflect your decision in Task 7 regarding the PCA variable.

You probably read this and asked “what is a lasso regression?” and with good reason - we haven’t yet covered this topic. All that you need to know is highlighted in black: you will need to change the code that they give you, which is below.

You need to choose between using one of two data sets:

DATA SET A
DATA SET B

Then ignore everything else!

# Format data as matrices (necessary for glmnet). 
# Uncomment two items that reflect your decision from Task 7.

#DATA SET A
lasso.mat.train <- model.matrix(days ~ . - PC1, data.train)
lasso.mat.test <- model.matrix(days ~ . - PC1, data.test)

#DATA SET B
# lasso.mat.train <- model.matrix(days ~ . - num_procs - num_meds - num_ip - num_diags, data.train)
# lasso.mat.test <- model.matrix(days ~ . - num_procs - num_meds - num_ip - num_diags, data.test)

set.seed(789)

lasso.cv <- cv.glmnet(
  x = lasso.mat.train,
  y = data.train$days,
  family = "poisson", # Do not change.
  alpha = 1 # alpha = 1 for lasso
)

If you wanted to use data set B, you would just add comments to data set A and uncomment B.

#DATA SET A
# lasso.mat.train <- model.matrix(days ~ . - PC1, data.train)
# lasso.mat.test <- model.matrix(days ~ . - PC1, data.test)

#DATA SET B
lasso.mat.train <- model.matrix(days ~ . - num_procs - num_meds - num_ip - num_diags, data.train)
lasso.mat.test <- model.matrix(days ~ . - num_procs - num_meds - num_ip - num_diags, data.test)

5.3 Example 2 - Data exploration

That last example was easy. They might ask you to do something like the following:

Template code:

# This code takes a continuous variable and creates a binned factor variable. 
# The code applies it directly to the capital gain variable as an 
# example. right = FALSE means that the left number is included and 
# the right number excluded. So, in this case, the first bin runs from 0 to 
# 1000 and includes 0 and excludes 1000. Note that the code creates a new 
# variable, so the original variable is retained.
df$cap_gain_cut <- cut(df$cap_gain, breaks = c(0, 1000, 5000, Inf), right = FALSE, labels = c("lowcg", "mediumcg", "highcg"))

To answer this question correctly, you would need to

Understand that the code is taking the capital gains recorded on investments, cap_gain, and then creating bins so that the new variable is “lowcg” for values between 0 and 1000, “mediumcp” from 1000 to 5000, and “highcg” for all values above 5000.
Then you would need to interpret a statistical model
Finally, use this result to change these cutoff values so that “low cg” is all values less than 5095.5, “medium cg” is all values from 5095.5 to 7055.5, and so forth. You would need to do this for two data sets, data.train, and data.test.

Solution code:

# This code cuts a continuous variable into buckets. 
# The process is applied to both the training and test sets. 

data.train$cap_gain_cut <- cut(data.train$cap_gain, breaks = c(0, 5095.5, 7055.5, Inf), right = FALSE, labels = c("lowcg", "mediumcg", "highcg"))

data.test$cap_gain_cut <- cut(data.test$cap_gain, breaks = c(0, 5095.5, 7055.5, Inf), right = FALSE, labels = c("lowcg", "mediumcg", "highcg"))

Do not panic if all of this code is confusing. Just focus on reading the comments. As you can see, this is less of a programming question than it is a “logic and reasoning” question.