12 Additional GLM topics

As you can tell, PA has a lot of small topics related to GLMs. This chapter completes some of the residual (no pun intended) topics.

12.1 Residuals

Learning from mistakes is the path to improvement. For GLMs, the residual analysis looks for patterns in the errors to find ways of improving the model.

12.1.1 Raw residuals

The word “residual” by itself means the “raw residual” in GLM language. This is the difference in actual vs. predicted values.

\[\text{Raw Residual} = y_i - \hat{y_i}\]

12.1.2 Deviance residuals

This is not meant for GLMs with non-Gaussian distributions. To adjust for other distributions, we need the concept of deviance residuals.

Deviance is a way of assessing the adequacy of a model by comparing it with a more general model with the maximum number of parameters that can be estimated. It is referred to as the saturated model and it has one parameter per observation.

The deviance assesses the goodness of fit for the model by looking at the difference between the log-likelihood functions of the saturated model and the model under investigation, i.e. \(l(b_{sat},y) - l(b,y)\). Here sat \(b_{sat}\) denotes the maximum likelihood estimator of the parameter vector of the saturated model, \(\beta_{sat}\) , and \(b\) is the maximum likelihood estimator of the parameters of the model under investigation, \(\beta\). The maximum likelihood estimator is the estimator that maximizes the likelihood function. The deviance is defined as

\[D = 2[l(b_{sat},y) - l(b,y)]\]

The deviance residual uses the deviance of the ith observation \(d_i\) and then takes the square root and applies the same sign (aka, the + or - part) of the raw residual.

\[\text{Deviance Residual} = \text{sign}(y_i - \hat{y_i})\sqrt{d_i}\]

12.2 Example

Just as with OLS, there is a formula and data argument. In addition, we need to specify the target distribution and link function.

model = glm(formula = charges ~ age + sex + smoker, 
            family = Gamma(link = "log"),
            data = health_insurance)

We see that age, sex, and smoker are all significant (p <0.01). Reading off the coefficient signs, we see that claims

  • Increase as age increases
  • Are higher for women
  • Are higher for smokers
model %>% tidy()
## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   7.82     0.0600     130.   0        
## 2 age           0.0290   0.00134     21.6  3.40e- 89
## 3 sexmale      -0.0468   0.0377      -1.24 2.15e-  1
## 4 smokeryes     1.50     0.0467      32.1  3.25e-168

Below you can see graph of deviance residuals vs. the predicted values.

If this were a perfect model, all of these below assumptions would be met:

  • Scattered around zero?
  • Constant variance?
  • No obvious pattern?
plot(model, which = 3)

The quantile-quantile (QQ) plot shows the quantiles of the deviance residuals (i.e., after adjusting for the Gamma distribution) against theoretical Gaussian quantiles.

In a perfect model, all of these assumptions would be met:

  • Points lie on a straight line?
  • Tails are not significantly above or below line? Some tail deviation is ok.
  • No sudden “jumps?” This indicates many \(Y\)’s which have the same value, such as insurance claims which all have the exact value of $100.00 or $0.00.
plot(model, which = 2)

12.3 Log transforms of predictors

When a log link is used, taking the natural logs of continuous variables allows for the scale of each predictor to match the scale of the thing that they are predicting, the log of the mean of the response. In addition, when the distribution of the continuous variable is skewed, taking the log helps to make it more symmetric.

For \(\mu\) the mean response,

\[log(\mu) = \beta_0 + \beta_1 log(X)\] To solve for \(\mu\), take the exponent of both sides

\[\mu = e^{\beta_1} e^{\beta_1 log(X)} = e^{\beta_0} X^{\beta_1}\]

12.4 Example

In the Hospital Readmission sample project, one of the predictor variables, “Length of stay,” is the number of days since a person has been readmitted to the hospital. You can tell that it is right-skewd because the median is higher than the mean.

summary(readmission$LOS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   6.693   8.000  36.000

But it could also be thought of as a discrete variable because it only takes on 36 values. Should you still apply a log transform?

readmission %>% count(LOS)
## # A tibble: 36 × 2
##      LOS     n
##    <dbl> <int>
##  1     1   986
##  2     2  7646
##  3     3  9775
##  4     4 11325
##  5     5  8365
##  6     6  6020
##  7     7  4600
##  8     8  3534
##  9     9  2719
## 10    10  1997
## # 
 with 26 more rows

Here are the histograms

Yes, the SOA’s solution applys the log transform.

12.5 Reference levels

When a categorical variable is used in a GLM, the model actually uses indicator variables for each level. The default reference level is the order of the R factors. For the sex variable, the order is female and then male. This means that the base level is female by default.

health_insurance$sex %>% as.factor() %>% levels()
## [1] "female" "male"

Why does this matter? Statistically, the coefficients are most stable when there are more observations.

health_insurance$sex %>% as.factor() %>% summary()
## female   male 
##    662    676

There is already a function to do this in the tidyverse called fct_infreq. Let’s quickly fix the sex column so that these factor levels are in order of frequency.

health_insurance <- health_insurance %>% 
  mutate(sex = fct_infreq(sex))

Now male is the base level.

health_insurance$sex %>% as.factor() %>% levels()
## [1] "male"   "female"

12.6 Interactions

An interaction occurs when the effect of a variable on the response is different depending on the level of other variables in the model.

Consider this model:

Let \(x_2\) be an indicator variable, which is 1 for some observations and 0 otherwise.

\[\hat{y_i} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2\]

There are now two different linear models depending on whether x_1 is 0 or 1.

When \(x_1 = 0\),

\[\hat{y_i} = \beta_0 + \beta_2 x_2\]

and when \(x_1 = 1\)

\[\hat{y_i} = \beta_0 + \beta_1 + \beta_2 x_2 + \beta_3 x_2\] By rewriting this we can see that the intercept changes from \(\beta_0\) to \(\beta_0^*\) and the slope changes from \(\beta_1\) to \(\beta_1^*\)

\[ (\beta_0 + \beta_1) + (\beta_2 + \beta_3 ) x_2 \\ = \beta_0^* + \beta_1^* x_2 \] Here is an example from the auto_claim data. The lines show the slope of a linear model, assuming that only BLUEBOOK and CAR_TYPE were predictors in the model. You can see that the slope for Sedans and Sports Cars is higher than for Vans and Panel Trucks.

auto_claim %>% 
  sample_frac(0.2) %>% 
  ggplot(aes(log(CLM_AMT), log(BLUEBOOK), color = CAR_TYPE)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", se = F) + 
  labs(title = "Kelly Bluebook Value vs Claim Amount")
## `geom_smooth()` using formula 'y ~ x'
Example of strong interaction

Figure 12.1: Example of strong interaction

Any time that effect of one variable on the response is different depending on the value of other variables, we say that there is an interaction. We can also use a hypothesis test with a GLM to check this. Simply include an interaction term and see if the coefficient is zero at the desired significance level.

12.7 Offsets

In certain situations, it is convenient to include a constant term in the linear predictor. This is the same as including a variable that has a coefficient equal to 1. We call this an offset.

\[g(\mu) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \text{offset}\] On Exam PA, offsets will only be used for one special case:

  1. With Poisson regression
  2. With a log link function
  3. As a measure of exposure (usually length of policy period)

While it is technically possible to use offsets in other ways, this is not likely to appear on PA.

If modeling the spread of COVID, the exposure would be the number of people exposed to the virus and the response would be the number of people infected.

In auto insurance, the exposure might be the number of months of coverage, and the response would be the claims incurred. Consider a very simple model which only uses the year that the car was manufactured as a predictor. This expected value of the claims, the target variable, would be

\[log(E[\frac{\text{Claims}}{\text{Months}}]) = \beta_0 + \beta_1 \text{Year}\] Then you can use the property of the log where \(log(\frac{A}{B}) = log(A) - log(B)\) to move things around. Because \(\text{Months}\) is known, you can remove the expected value. This is the offset term.

\[log(E[\text{Claims}]) = \beta_0 + \beta_1 \text{Year} + \text{Months}\]

12.8 Tweedie regression

While this topic is briefly mentioned in the modules, the only R libraries which support Tweedie Regression (statmod and tweedie) are not on the syllabus, and so there is no way that the SOA could ask you to build a tweedie model. This means that you can safely skip this section.