7 Data exploration

Already enrolled? Watch the full video: Practice Exams + Lessons

Exploration is about making discoveries and you should have a curious mind and strong technical skills to make sense of the data. This chapter will give you the tools to take dirty, unprocessed data, clean it up, discover interesting trends using graphs and summary statistics, and communicate your results to a business audience.

About 10 - 15% of your grade will be based on data exploration. Every version of Exam PA, such as December 2018, June 2019, and December 2019, have all had questions about data exploration. Putting in extra practice in this area is guaranteed to give you a better score because it will free up time that you can use elsewhere.

The syllabus divides this into three learning objectives, which can make the topic intimidating. Data Visualization, Data Types and Exploration, and Data Issues and Resolutions, but you will always be doing these things together, and so this chapter will cover all three at once.

Exploratory Data Analysis (EDA) is usually the first phase of an analytics project. This is as much of an art as it is a science because everyone has their style. If you ask two actuaries to perform EDA on the same data set, they will likely use different methods. These can be broken down into phases.

Figure 7.1: Data Exploration Phases - From R for Data Science

From the language on the December 2020 Syllabus, these are

Import:

Evaluate the quality of appropriate data sources for a problem.
Identify the regulations, standards, and ethics surrounding predictive modeling and data

Tidy:

Identify structured, unstructured, and semi-structured data.
Identify the types of variables and terminology used in predictive modeling.
Understand basic methods of handling missing data.
Implement effective data design with respect to time frame, sampling, and granularity.

Transform:

Identify opportunities to create features from the primary data that may add value.
Identify outliers and other data issues.
Handle non-linear relationships via transformations.

Visualize:

Apply univariate and bivariate data exploration techniques.
Understand the fundamental principles of constructing graphs.
Create a variety of graphs using the ggplot2 package.

Model:

Fit and interpret models

Communicate:

Write the report

7.1 How to make graphs in R?

Let us create a histogram of the claims. The first step is to create a blank canvas that holds the columns that are needed. The library to make this is called ggplot2.

The aesthetic argument, aes, means that the variable shown will the the claims.

The sample_frac means that only 20% of the data is used. This makes the visuzlization less cluttered.

library(ExamPAData)
df <- readmission %>% sample_frac(0.05)
p <- df %>% 
  ggplot(aes(HCC.Riskscore))

If we look at p, we see that it is nothing but white space with axis for count and income.

7.1.1 Add a plot

We add a histogram

p + geom_histogram()

Different plots are called “geoms” for “geometric objects.” Geometry = Geo (space) + meter (measure), and graphs measure data. For instance, instead of creating a histogram, we can draw a gamma distribution with stat_density.

p + stat_density()

Create an xy plot by adding and x and a y argument to aesthetic.

df %>% 
  ggplot(aes(x = HCC.Riskscore, y = Age)) + 
  geom_point()

7.2 The different graph types

Only four types of graphs are used for data exploration. You will only need to understand how to interpret them. The SOA will provide you with the code needed to create them. There will not be enough time for you to make your graphs.

7.2.1 Histogram

The histogram is used when you want to look at the probability distribution of a continuous variable.

The template code in your .Rmd file will look like this. Just change “variable” to the name without quotes of the variable and then copy and paste.

7.2.2 Box plot

The boxplot compactly displays the distribution of a continuous variable. It visualizes five summary statistics (the median, two hinges, and two whiskers) and all “outlying” observations individually.

7.2.3 Scatterplot

The point geom is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable or two categorical variables.

7.2.4 Bar charts

There are two types of bar charts: geom_bar() (Univariate) and geom_col() (Bivatiate). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position.

7.3 How to save time with dplyr?

You may have noticed that writing code for data manipulation can be slow. Fortunately, there is a faster, 100%-legal way of doing data manipulation that has worked for hundreds of exam candidates (the author included) who have taken Exam PA.

Up to this point, we have been using old R libraries. By making use of newer R libraries, we can save ourselves time. These will all be provided for you at Prometric within the tidyverse library.

7.3.1 Data manipulation chaining

Pipes allow for data manipulations to be chained with visualizations. The possibilities are nearly limitless.

library(tidyverse)
iris %>%
  select_if(is.numeric) %>%
  gather(feature,value) %>%
  ggplot(aes(value)) +
  geom_histogram() +
  facet_wrap(vars(feature))

Suggested reading of R for Data Science (https://r4ds.had.co.nz/index.html):

Chapter	Topic
9	Introduction
10	Tibbles
12	Tidy data
15	Factors
17	Introduction
18	Pipes
19	Functions
20	Vectors

7.4 How to explore the data?

Let us look at the health insurance data set that contains information on patients along with their annual health care costs.

The descriptions of the columns are below.

age: Age of the individual
sex: Sex
bmi: Body Mass Index
children: Number of children
smoker: Is this person a smoker?
region: Region
charges: Annual health care costs.

head() shows the top n rows. head(20) shows the top 20 rows.

head(health_insurance)

## # A tibble: 6 × 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.

Using a pipe is an alternative way of doing this.

health_insurance %>% head()

Shortcut: Use CTRL + SHFT + M to create pipes %>%

The glimpse function is similar to str().

health_insurance %>% glimpse()

## Rows: 1,338
## Columns: 7
## $ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 19, 52, 23, 56, 30, 60, 3…
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "female", "female", "male", "fema…
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.740, 29.830, 25.840, 26.22…
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 2, 3, 0, 2, 1…
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwest", "southeast", "southe…
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622, 8240.590, 7281.506, 640…

One of the most useful data science tools is counting things. The function count() gives the number of observations by a categorical feature.

health_insurance %>% dplyr::count(children)

## # A tibble: 6 × 2
##   children     n
##      <dbl> <int>
## 1        0   574
## 2        1   324
## 3        2   240
## 4        3   157
## 5        4    25
## 6        5    18

Two categories can be counted at once. This creates a table with all combinations of region and sex and shows the number ofobservations in each category.

health_insurance %>% count(region, sex)

## # A tibble: 8 × 3
##   region    sex        n
##   <chr>     <chr>  <int>
## 1 northeast female   161
## 2 northeast male     163
## 3 northwest female   164
## 4 northwest male     161
## 5 southeast female   175
## 6 southeast male     189
## 7 southwest female   162
## 8 southwest male     163

The summary() function is shows a statistical summary. One caveat is that each column needs to be in its appropriate type. For example, smoker, region, and sex are all listed as characters when if they were factors, summary would give you count info.

With incorrect data types

health_insurance %>% summary()

##       age            sex                 bmi           children        smoker         
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000   Length:1338       
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000   Class :character  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000   Mode  :character  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095                     
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000                     
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000                     
##     region             charges     
##  Length:1338        Min.   : 1122  
##  Class :character   1st Qu.: 4740  
##  Mode  :character   Median : 9382  
##                     Mean   :13270  
##                     3rd Qu.:16640  
##                     Max.   :63770

With correct data types

This tells you that there are 324 patients in the northeast, 325 in the northwest, 364 in the southeast, etc.

health_insurance <- health_insurance %>% 
  mutate_if(is.character, as.factor)

health_insurance %>% 
  summary()

##       age            sex           bmi           children     smoker           region   
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064   northeast:324  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274   northwest:325  
##  Median :39.00                Median :30.40   Median :1.000              southeast:364  
##  Mean   :39.21                Mean   :30.66   Mean   :1.095              southwest:325  
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000                             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000                             
##     charges     
##  Min.   : 1122  
##  1st Qu.: 4740  
##  Median : 9382  
##  Mean   :13270  
##  3rd Qu.:16640  
##  Max.   :63770

Shortcut: This “mutate_if” trick can save you time from having to copy and paste from the template code of SOA and convert each variable to a factor individually

7.5 How to transform the data?

Transforming, manipulating, querying, and wrangling are synonyms in data terminology.

R syntax is designed to be similar to SQL. They begin with a SELECT, use GROUP BY to aggregate, and have a WHERE to remove observations. Unlike SQL, the ordering of these does not matter. SELECT can come after a WHERE.

R to SQL translation

select() -> SELECT
mutate() -> user-defined columns
summarize() -> aggregated columns
left_join() -> LEFT JOIN
filter() -> WHERE
group_by() -> GROUP BY
filter() -> HAVING
arrange() -> ORDER BY

health_insurance %>% 
  select(age, region) %>% 
  head()

## # A tibble: 6 × 2
##     age region   
##   <dbl> <fct>    
## 1    19 southwest
## 2    18 southeast
## 3    28 southeast
## 4    33 northwest
## 5    32 northwest
## 6    31 southeast

Let us look at only those in the southeast region. Instead of WHERE, use filter.

health_insurance %>% 
  filter(region == "southeast") %>% 
  select(age, region) %>% 
  head()

## # A tibble: 6 × 2
##     age region   
##   <dbl> <fct>    
## 1    18 southeast
## 2    28 southeast
## 3    31 southeast
## 4    46 southeast
## 5    62 southeast
## 6    56 southeast

The SQL translation is

SELECT age, region
FROM health_insurance
WHERE region = 'southeast'

Instead of ORDER BY, use arrange. Unlike SQL, the order does not matter and ORDER BY doesn’t need to be last.

health_insurance %>% 
  arrange(age) %>% 
  select(age, region) %>% 
  head()

## # A tibble: 6 × 2
##     age region   
##   <dbl> <fct>    
## 1    18 southeast
## 2    18 southeast
## 3    18 northeast
## 4    18 northeast
## 5    18 northeast
## 6    18 southeast

The group_by comes before the aggregation, unlike in SQL where the GROUP BY comes last.

health_insurance %>% 
  group_by(region) %>% 
  summarise(avg_age = mean(age))

## # A tibble: 4 × 2
##   region    avg_age
##   <fct>       <dbl>
## 1 northeast    39.3
## 2 northwest    39.2
## 3 southeast    38.9
## 4 southwest    39.5

In SQL, this would be

SELECT region, 
       AVG(age) as avg_age
FROM health_insurance
GROUP BY region

Just like in SQL, many different aggregate functions can be used such as SUM, MEAN, MIN, MAX, and so forth.

health_insurance %>% 
  group_by(region) %>% 
  summarise(avg_age = mean(age),
            max_age = max(age),
            median_charges = median(charges),
            bmi_std_dev = sd(bmi))

## # A tibble: 4 × 5
##   region    avg_age max_age median_charges bmi_std_dev
##   <fct>       <dbl>   <dbl>          <dbl>       <dbl>
## 1 northeast    39.3      64         10058.        5.94
## 2 northwest    39.2      64          8966.        5.14
## 3 southeast    38.9      64          9294.        6.48
## 4 southwest    39.5      64          8799.        5.69

To create new columns, the mutate function is used. For example, if we wanted a column of the annual charges of a person divided by their age

health_insurance %>% 
  mutate(charges_over_age = charges/age) %>% 
  select(age, charges, charges_over_age) %>% 
  head(5)

## # A tibble: 5 × 3
##     age charges charges_over_age
##   <dbl>   <dbl>            <dbl>
## 1    19  16885.            889. 
## 2    18   1726.             95.9
## 3    28   4449.            159. 
## 4    33  21984.            666. 
## 5    32   3867.            121.

We can create as many new columns as we want.

health_insurance %>% 
  mutate(age_squared  = age^2,
         age_cubed = age^3,
         age_fourth = age^4) %>% 
  head(5)

## # A tibble: 5 × 10
##     age sex      bmi children smoker region    charges age_squared age_cubed age_fourth
##   <dbl> <fct>  <dbl>    <dbl> <fct>  <fct>       <dbl>       <dbl>     <dbl>      <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.         361      6859     130321
## 2    18 male    33.8        1 no     southeast   1726.         324      5832     104976
## 3    28 male    33          3 no     southeast   4449.         784     21952     614656
## 4    33 male    22.7        0 no     northwest  21984.        1089     35937    1185921
## 5    32 male    28.9        0 no     northwest   3867.        1024     32768    1048576

The CASE WHEN function is quite similar to SQL. For example, we can create a column which is 0 when age < 50, 1 when 50 <= age <= 70, and 2 when age > 70.

health_insurance %>% 
  mutate(age_bucket = case_when(age < 50 ~ 0,
                                age <= 70 ~ 1,
                                age > 70 ~ 2)) %>% 
  select(age, age_bucket)

## # A tibble: 1,338 × 2
##      age age_bucket
##    <dbl>      <dbl>
##  1    19          0
##  2    18          0
##  3    28          0
##  4    33          0
##  5    32          0
##  6    31          0
##  7    46          0
##  8    37          0
##  9    37          0
## 10    60          1
## # … with 1,328 more rows

SQL translation:

SELECT CASE WHEN AGE < 50 THEN 0
       ELSE WHEN AGE <= 70 THEN 1
       ELSE 2
FROM health_insurance

7.6 Missing values

The most recent PA exams have had questions related to missing values. You need to perform checks before moving on to the model-building phases. In real life, man-made data always have missing observations.

In R, there are two ways that data can be missing. The first is the NA value. If you read in a csv file that has blank values, R will translate them into NAs.

df <- tibble(x = c(1,2,NA), y = c(NA, NA, 2))

You can check for NA values with summary or sum(is.na(df)).

summary(df)

##        x              y    
##  Min.   :1.00   Min.   :2  
##  1st Qu.:1.25   1st Qu.:2  
##  Median :1.50   Median :2  
##  Mean   :1.50   Mean   :2  
##  3rd Qu.:1.75   3rd Qu.:2  
##  Max.   :2.00   Max.   :2  
##  NA's   :1      NA's   :2

sum(is.na(df))

## [1] 3

The second type is for “Not A Number” (NaN) or infinity Inf values. Dividing by zero produces Inf and taking the square root of a negative number produces NaN. Notice that NaN counts as NA but Inf does not. R says that there are only two missing values in this table even though there are two values which are infinity.

df <- tibble(x = c(1/0, 4/0, 4, 1), y = c(sqrt(-1), NA, 2, 4))

## Warning in sqrt(-1): NaNs produced

sum(is.na(df))

## [1] 2

summary(df)

##        x              y      
##  Min.   :1.00   Min.   :2.0  
##  1st Qu.:3.25   1st Qu.:2.5  
##  Median : Inf   Median :3.0  
##  Mean   : Inf   Mean   :3.0  
##  3rd Qu.: Inf   3rd Qu.:3.5  
##  Max.   : Inf   Max.   :4.0  
##                 NA's   :2

7.6.1 Types of Missing Values

Read the Data Dictionary from the Project Statement and check for these three types of missingness.

Missing at random (MAR): There is no pattern between the missingness and the value of missing variable.
Missing not at random (MNAR): The value of the missing variable is related to the reason it is missing. Example: A survey concerning illegal drug use where respondents are reluctant to admit that they have broken the law.
Hidden missingness: When a value is coded incorrectly, such as when a numeric variable with “999” or “0” or a factor does not match the Data Dictionary.

7.6.2 Missing Value Resolutions:

You can use any of these methods but make sure that you describe your reasoning in the report.

Remove observations: Use when there are only a few missing values or when they are MAR.
Create new level “missing:” Use when there are more values or when they are MNAR.
Impute with the mean: Use when there are numeric values MAR.
Remove variable altogether: Use when most observations are missing (≥ 50% of observations), and there is no way to repair the data.

While you are reviewing the data for missingness, you may also decide to remove some variables altogether. This could be because of racial or ethical concerns, limitations of future availability, instability of the data over time, or inexplicability.

7.7 Example: SOA PA 12/12/19, Task 1

This chapter is based on the first task of the December 2019 Exam PA.

Already enrolled? Watch the full video: Practice Exams | Practice Exams + Lessons

TASK 1 (12 points)

Examine each variable and make appropriate adjustments.

Examine each predictor variable other than cap_gain both on its own and for value_flag. Make appropriate adjustments. Do not make any adjustments to the cap_gain variable at this time.

There should be no further variable adjustments unless specifically requested.

As a reminder, all data for this book can be accessed from the package ExamPAData. In the actual exam, you will read the file from the Prometric computer.

Because the data is already loaded, use the below code to access the data.

library(ExamPAData)

To save keystrokes, give the data a short name such as df for “data frame.”

df <- customer_value %>% mutate_if(is.character, as.factor)

7.7.1 Garbage in; garbage out 🗑

This is a common saying when working with predictive models. No matter how complex the model, the actual result will be wrong if the data is terrible. For this exam, master the art of data manipulation, and everything becomes easier!

Begin by looking at a summary.

summary(df)

##       age        education_num                 marital_status     occupation       cap_gain    
##  Min.   :17.00   Min.   : 1.00   Divorced             : 6633   Group 1 : 7237   Min.   :    0  
##  1st Qu.:28.00   1st Qu.: 9.00   Married-AF-spouse    :   37   Group 2 :10123   1st Qu.:    0  
##  Median :37.00   Median :10.00   Married-civ-spouse   :22379   Group 3 :13971   Median :    0  
##  Mean   :38.64   Mean   :10.08   Married-spouse-absent:  628   Group 4 : 2444   Mean   : 1079  
##  3rd Qu.:48.00   3rd Qu.:12.00   Never-married        :16117   Group 5 :12258   3rd Qu.:    0  
##  Max.   :90.00   Max.   :16.00   Separated            : 1530   Group NA: 2809   Max.   :99999  
##                                  Widowed              : 1518                                   
##  hours_per_week      score       value_flag  
##  Min.   : 1.00   Min.   :43.94   High:11687  
##  1st Qu.:40.00   1st Qu.:57.50   Low :37155  
##  Median :40.00   Median :60.24               
##  Mean   :40.42   Mean   :60.23               
##  3rd Qu.:45.00   3rd Qu.:62.95               
##  Max.   :99.00   Max.   :76.53               
##

str(df)

## spec_tbl_df [48,842 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:48842] 39 50 38 53 28 37 49 52 31 42 ...
##  $ education_num : num [1:48842] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 6 levels "Group 1","Group 2",..: 2 5 1 1 5 5 1 5 5 5 ...
##  $ cap_gain      : num [1:48842] 2174 0 0 0 0 ...
##  $ hours_per_week: num [1:48842] 40 13 40 40 40 40 16 45 50 40 ...
##  $ score         : num [1:48842] 59 55.8 62.8 60.1 53.3 ...
##  $ value_flag    : Factor w/ 2 levels "High","Low": 2 2 2 2 2 2 2 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   education_num = col_double(),
##   ..   marital_status = col_character(),
##   ..   occupation = col_character(),
##   ..   cap_gain = col_double(),
##   ..   hours_per_week = col_double(),
##   ..   score = col_double(),
##   ..   value_flag = col_character()
##   .. )

7.7.2 Be a detective 🔎

Sherlock Holmes is famous for saying, “You see, Watson, but you do not observe!”

Just like detectives, actuaries need to collect data and make observations. Each exam has a few “anomalies” in the data, which they expect candidates to mention. These could be

Any value that does not match the Data Dictionary in the Project Statement
Variables that have NA values
Variables that have many factor levels
Incorrect data types
- Factors that are read as characters or vice versa
- Numeric variables that are factor/character
Extreme values (Numeric values are too high or low)

You will get very good at spotting these with practice. Just from looking at the above summary, we can observe the following:

Observations

The data consists of 48,842 obs. of 8 variables.
The lowest age is 17 but the Project Statement says to only include observations with age >= 25.
The cap_gain distribution is right-skewed because the median (0) is less than the mean (1079), but the Project Statement said not to transform this. Otherwise, I would apply a log transform.
education_num takes integer values between 1 and 16. There are many values that are low.
There are missing values when occupation is group NA, which means that the occupation of the person is unknown.
The amount that people work per work, hours_per_week, varies by a lot. The lowest is 1 hour, and the highest is 99. Most people work 40 hours per week.

Study Tip: Never apply a log transform to the target variable. Only the predictor variables get logs taken. At this stage, we note if it is right-skewed or not.

The solution of SOA recommends leaving comments in your .Rmd file. This helps to give you partial credit on questions that you may answer incorrectly.

Good comments

#I observe that there are no missing values other than those indicated by Group NA for occupation.

#I removed the code provided by my assistant. It is embedded in later chunks as needed.

#I excluded people under the age of 25
df <- df[df$age >= 25,]

#I convert the target to 0-1.
df$value_flag <- ifelse(df$value_flag == "High",1,0)

Useless comments

#this is a comment

#this loads a library called ggplot2
library(ggplot2)

7.7.3 A picture is worth a thousand words 📷

What is your favorite type of graph? Mine is a radar chart. This is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.

That is pretty, right?

Do not waste time trying to make the graphs perfect!

Only change code that you need to change. If you need assistance to do it faster then SOA can help you save time by giving you templates.

This is the code template that they give you. You only need to change the “variable” names.

# This code makes a histogram for a continuous variable.
ggplot(df, aes(x = variable)) + 
  geom_histogram(bins = 30) +
  labs(x = "variable") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# This code makes a bar chart for a factor variable.
ggplot(df, aes(x = variable)) +
  geom_bar() +
  labs(x = "variable") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Okay, that was not 100% true. It is common for the code to require small changes. changes. For instance, the histogram that they give you has bins that are too narrow for age, which causes these spikes to show up in the graph.

The best candidates altered the code to create more appropriate plots and summary tables.

# This code makes a histogram for a continuous variable.
ggplot(df, aes(x = age)) + 
  geom_histogram(bins = 30) + #not the right number of bins
  labs(x = "age") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Bad: Default histogram with spikes")

#Histograms of continuous variables
ggplot(df, aes(x = age)) +
  geom_histogram(breaks = seq(24.5,99.5, by = 5)) +  #make bins wider and set range 
  labs(x = "Age") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Good: After adjusting the breaks")

How do you know when a variable should a factor and when it should be numeric?

7.7.4 Factor or numeric ❓

Which variables should be converted to factors and which should be numeric?

Questions of this sort have come up twice. On Hospital Readmissions, a Length of Stay variable was numeric but had only a few values, so some candidates treated it as a factor. The education_num variable here is also numeric but only has 16 unique values. So should this be a numeric or a factor?

table(df$education_num)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16 
##    68   231   446   877   618   995  1042   382 13320  7812  1862  1411  7298  2621   834   593

Ask yourself this question: is there a way of comparing two values of the variable together?

If yes, then use numeric
If no, then use a factor

For exmaple, we can say that education_num = 2 is less than education_num = 4, which means there is a natural order. This is also known as an ordinal.

If the factor is say, color, which can be red, blue, or green, then there is no way of comparing values together. Is red greater than blue? This question has no meaning.

ggplot(df, aes(x = education_num)) +
  geom_histogram(bins = 30) +
  labs(x = "Education") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Default number of bins (30)")

ggplot(df, aes(x = education_num)) +
  geom_histogram(bins = 16) +
  labs(x = "Education") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle("Set number of bins to number of factor levels (16) ")

# Set bins equal to number of levels, could have made bar chart.

We could also use a bar plot.

ggplot(df, aes(x = education_num)) + 
  geom_bar(stat = "count") + 
  ggtitle("Bar plot")

Lastly, read the Project Statement carefully and only do what it tells you to do.

Candidates were not required to make a plot for cap_gain. This solution has the plot made here rather than in Task 6.

7.7.5 73.6% of statistics are false 😱

Really? No, but statistics can help you see patterns that data visualization by itself can miss. Along with the ggplot codes, there will be code to look at summary statistics. Here is a refresher on what these statistics mean (no pun intended).

Mean: The average. This gets skewed by outliers easily. If the mean is greater than the median, then the distribution is right-skewed.
Median: The “middle” value. This is an average that reduces the impact of outliers.
Variance: The amount by which each observation differs from the mean.
Standard Deviation: The square root of the variance.
n(): The number of observations. Always take note of groups that do not have many observations.

# This code provides, for each level of a factor variable, the number for which value_flag is zero, the number for which it is one, the total number of observations, and the proportion of ones.
# Note that the variable name should not be enclosed in quotation marks.
df %>%
  group_by(variable) %>%
  dplyr::summarise(
    zeros = sum(value_flag == 0),
    ones = sum(value_flag == 1),
    n = n(),
    proportion = mean(value_flag)
  )

Factors levels should be simplified. If a group has only a few observations, then there will be problems with the model. In our data, take a look at the marital_status column. Do you observe anything unusual?

table(df$marital_status)

## 
##              Divorced     Married-AF-spouse    Married-civ-spouse Married-spouse-absent 
##                  6498                    31                 21661                   573 
##         Never-married             Separated               Widowed 
##                  8697                  1438                  1512

Only 31 observations have Married-AF-spouse. This is because the sample size n = 31 is too small. In modeling jargon, this is “statistical insignificant” and will cause the p-value on marital_status to be large. You can fix this in a few different ways

Delete these observations (Not recommended)
Group these observations together with Married-spouse (Simplest method)

Let us use the second method.

First, look at the levels of the factor variable.

levels(df$marital_status)

## [1] "Divorced"              "Married-AF-spouse"     "Married-civ-spouse"    "Married-spouse-absent"
## [5] "Never-married"         "Separated"             "Widowed"

Now look at the profitability across marital status. For Married-AF-spouse and Married-civ-spouse the proportion of high profit customers is high, but for Married-spouse-absent it is low. Even though these are all “married,” it would be a bad idea to combine them because the profitability is so different.

#Proportion of ones by category of factor variable
df %>%
  group_by(marital_status) %>%
  dplyr::summarise(
    zeros = sum(value_flag == 0),
    ones = sum(value_flag == 1),
    n = n(),
    proportion = mean(value_flag)
  )

## # A tibble: 7 × 5
##   marital_status        zeros  ones     n proportion
##   <fct>                 <int> <int> <int>      <dbl>
## 1 Divorced               5829   669  6498     0.103 
## 2 Married-AF-spouse        19    12    31     0.387 
## 3 Married-civ-spouse    11727  9934 21661     0.459 
## 4 Married-spouse-absent   515    58   573     0.101 
## 5 Never-married          8000   697  8697     0.0801
## 6 Separated              1342    96  1438     0.0668
## 7 Widowed                1384   128  1512     0.0847

Then create a vector that has the simpler levels that you want. The order needs to be the same.

simple_levels <- c("Divorced", "Married-spouse", "Married-spouse", "Married-spuouse-absent", "Neber-married", "Separated", "Widowed")

The function mapvalues takes in three arguments. You can read about this by typing ?mapvalues into the console.

x : the factor or vector to modify

from : a vector of the items to replace

to : a vector of replacement values

Then map the old values to the simpler values.

#Combine the two marital status levels
var.levels <- levels(df$marital_status)
df$marital_status <- mapvalues(x = df$marital_status,
                               from = var.levels, 
                               to = simple_levels)

Now, when you look at the marital_status levels, you will see the simpler levels.

levels(df$marital_status)

## [1] "Divorced"               "Married-spouse"         "Married-spuouse-absent" "Neber-married"         
## [5] "Separated"              "Widowed"

You can also check that the number of observations is what you expect.

table(df$marital_status)

## 
##               Divorced         Married-spouse Married-spuouse-absent          Neber-married 
##                   6498                  21692                    573                   8697 
##              Separated                Widowed 
##                   1438                   1512

7.8 Exercises

7.8.1 Data Exploration Practice

7.8.2 Dplyr Practice

Run this code on your computer to answer these exercises.

The data actuary_salaries contains the salaries of actuaries collected from the DW Simpson survey. Use this data to answer the exercises below.

actuary_salaries %>% glimpse()

## Rows: 138
## Columns: 6
## $ industry    <chr> "Casualty", "Casualty", "Casualty", "Casualty", "Casualty", "Casualty", "Casualty"…
## $ exams       <chr> "1 Exam", "2 Exams", "3 Exams", "4 Exams", "1 Exam", "2 Exams", "3 Exams", "4 Exam…
## $ experience  <dbl> 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7…
## $ salary      <chr> "48 - 65", "50 - 71", "54 - 77", "58 - 82", "54 - 72", "57 - 81", "62 - 87", "63 -…
## $ salary_low  <dbl> 48, 50, 54, 58, 54, 57, 62, 63, 65, 70, 72, 85, 55, 58, 66, 71, 73, 76, 85, 90, 11…
## $ salary_high <chr> "65", "71", "77", "82", "72", "81", "87", "91", "95", "97", "99", "112", "80", "83…

How many industries are represented?
The salary_high column is a character type when it should be numeric. Change this column to numeric.
1. What are the highest and lowest salaries for an actuary in Health with 5 exams passed?
Create a new column called salary_mid which has the middle of the salary_low and salary_high columns.
When grouping by industry, what is the highest salary_mid? What about salary_high? What is the lowest salary_low?
There is a mistake when salary_low == 11. Find and fix this mistake, and then rerun the code from the previous task.
Create a new column, called n_exams, which is an integer. Use 7 for ASA/ACAS and 10 for FSA/FCAS. Use the code below as a starting point and fill in the _ spaces
Create a column called social_life, which is equal to n_exams/experience. What is the average (mean) social_life by industry? Bonus question: what is wrong with using this as a statistical measure?

actuary_salaries <- actuary_salaries %>% 
  mutate(n_exams = case_when(exams == "FSA" ~ _,
                             exams == "ASA" ~ _,
                             exams == "FCAS" ~ _,
                             exams == "ACAS" ~ _,
                             TRUE ~ as.numeric(substr(exams,_,_))))

Create a column called social_life, which is equal to n_exams/experience. What is the average (mean) social_life by industry? Bonus question: what is wrong with using this as a statistical measure?

7.9 Answers to exercises

Answers to these exercises, along with a video tutorial, are available at ExamPA.net.