12 Additional GLM topics
As you can tell, PA has a lot of small topics related to GLMs. This chapter completes some of the residual (no pun intended) topics.
12.1 Residuals
Learning from mistakes is the path to improvement. For GLMs, the residual analysis looks for patterns in the errors to find ways of improving the model.
12.1.1 Raw residuals
The word âresidualâ by itself means the âraw residualâ in GLM language. This is the difference in actual vs. predicted values.
\[\text{Raw Residual} = y_i - \hat{y_i}\]
12.1.2 Deviance residuals
This is not meant for GLMs with non-Gaussian distributions. To adjust for other distributions, we need the concept of deviance residuals.
Deviance is a way of assessing the adequacy of a model by comparing it with a more general model with the maximum number of parameters that can be estimated. It is referred to as the saturated model and it has one parameter per observation.
The deviance assesses the goodness of fit for the model by looking at the difference between the log-likelihood functions of the saturated model and the model under investigation, i.e. \(l(b_{sat},y) - l(b,y)\). Here sat \(b_{sat}\) denotes the maximum likelihood estimator of the parameter vector of the saturated model, \(\beta_{sat}\) , and \(b\) is the maximum likelihood estimator of the parameters of the model under investigation, \(\beta\). The maximum likelihood estimator is the estimator that maximizes the likelihood function. The deviance is defined as
\[D = 2[l(b_{sat},y) - l(b,y)]\]
The deviance residual uses the deviance of the ith observation \(d_i\) and then takes the square root and applies the same sign (aka, the + or - part) of the raw residual.
\[\text{Deviance Residual} = \text{sign}(y_i - \hat{y_i})\sqrt{d_i}\]
12.2 Example
Just as with OLS, there is a formula
and data argument
. In addition, we need to specify the target distribution and link function.
= glm(formula = charges ~ age + sex + smoker,
model family = Gamma(link = "log"),
data = health_insurance)
We see that age
, sex
, and smoker
are all significant (p <0.01). Reading off the coefficient signs, we see that claims
- Increase as age increases
- Are higher for women
- Are higher for smokers
%>% tidy() model
## # A tibble: 4 Ă 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.82 0.0600 130. 0
## 2 age 0.0290 0.00134 21.6 3.40e- 89
## 3 sexmale -0.0468 0.0377 -1.24 2.15e- 1
## 4 smokeryes 1.50 0.0467 32.1 3.25e-168
Below you can see graph of deviance residuals vs. the predicted values.
If this were a perfect model, all of these below assumptions would be met:
- Scattered around zero?
- Constant variance?
- No obvious pattern?
plot(model, which = 3)
The quantile-quantile (QQ) plot shows the quantiles of the deviance residuals (i.e., after adjusting for the Gamma distribution) against theoretical Gaussian quantiles.
In a perfect model, all of these assumptions would be met:
- Points lie on a straight line?
- Tails are not significantly above or below line? Some tail deviation is ok.
- No sudden âjumps?â This indicates many \(Y\)âs which have the same value, such as insurance claims which all have the exact value of $100.00 or $0.00.
plot(model, which = 2)
12.3 Log transforms of predictors
When a log link is used, taking the natural logs of continuous variables allows for the scale of each predictor to match the scale of the thing that they are predicting, the log of the mean of the response. In addition, when the distribution of the continuous variable is skewed, taking the log helps to make it more symmetric.
For \(\mu\) the mean response,
\[log(\mu) = \beta_0 + \beta_1 log(X)\] To solve for \(\mu\), take the exponent of both sides
\[\mu = e^{\beta_1} e^{\beta_1 log(X)} = e^{\beta_0} X^{\beta_1}\]
12.4 Example
In the Hospital Readmission sample project, one of the predictor variables, âLength of stay,â is the number of days since a person has been readmitted to the hospital. You can tell that it is right-skewd because the median is higher than the mean.
summary(readmission$LOS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 6.693 8.000 36.000
But it could also be thought of as a discrete variable because it only takes on 36 values. Should you still apply a log transform?
%>% count(LOS) readmission
## # A tibble: 36 Ă 2
## LOS n
## <dbl> <int>
## 1 1 986
## 2 2 7646
## 3 3 9775
## 4 4 11325
## 5 5 8365
## 6 6 6020
## 7 7 4600
## 8 8 3534
## 9 9 2719
## 10 10 1997
## # ⊠with 26 more rows
Here are the histograms
Yes, the SOAâs solution applys the log transform.
12.5 Reference levels
When a categorical variable is used in a GLM, the model actually uses indicator variables for each level. The default reference level is the order of the R factors. For the sex
variable, the order is female
and then male
. This means that the base level is female
by default.
$sex %>% as.factor() %>% levels() health_insurance
## [1] "female" "male"
Why does this matter? Statistically, the coefficients are most stable when there are more observations.
$sex %>% as.factor() %>% summary() health_insurance
## female male
## 662 676
There is already a function to do this in the tidyverse
called fct_infreq
. Letâs quickly fix the sex
column so that these factor levels are in order of frequency.
<- health_insurance %>%
health_insurance mutate(sex = fct_infreq(sex))
Now male
is the base level.
$sex %>% as.factor() %>% levels() health_insurance
## [1] "male" "female"
12.6 Interactions
An interaction occurs when the effect of a variable on the response is different depending on the level of other variables in the model.
Consider this model:
Let \(x_2\) be an indicator variable, which is 1 for some observations and 0 otherwise.
\[\hat{y_i} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2\]
There are now two different linear models depending on whether x_1
is 0 or 1.
When \(x_1 = 0\),
\[\hat{y_i} = \beta_0 + \beta_2 x_2\]
and when \(x_1 = 1\)
\[\hat{y_i} = \beta_0 + \beta_1 + \beta_2 x_2 + \beta_3 x_2\] By rewriting this we can see that the intercept changes from \(\beta_0\) to \(\beta_0^*\) and the slope changes from \(\beta_1\) to \(\beta_1^*\)
\[
(\beta_0 + \beta_1) + (\beta_2 + \beta_3 ) x_2 \\
= \beta_0^* + \beta_1^* x_2
\]
Here is an example from the auto_claim
data. The lines show the slope of a linear model, assuming that only BLUEBOOK
and CAR_TYPE
were predictors in the model. You can see that the slope for Sedans and Sports Cars is higher than for Vans and Panel Trucks.
%>%
auto_claim sample_frac(0.2) %>%
ggplot(aes(log(CLM_AMT), log(BLUEBOOK), color = CAR_TYPE)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = F) +
labs(title = "Kelly Bluebook Value vs Claim Amount")
## `geom_smooth()` using formula 'y ~ x'
Any time that effect of one variable on the response is different depending on the value of other variables, we say that there is an interaction. We can also use a hypothesis test with a GLM to check this. Simply include an interaction term and see if the coefficient is zero at the desired significance level.
12.7 Offsets
In certain situations, it is convenient to include a constant term in the linear predictor. This is the same as including a variable that has a coefficient equal to 1. We call this an offset.
\[g(\mu) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \text{offset}\] On Exam PA, offsets will only be used for one special case:
- With Poisson regression
- With a log link function
- As a measure of exposure (usually length of policy period)
While it is technically possible to use offsets in other ways, this is not likely to appear on PA.
If modeling the spread of COVID, the exposure would be the number of people exposed to the virus and the response would be the number of people infected.
In auto insurance, the exposure might be the number of months of coverage, and the response would be the claims incurred. Consider a very simple model which only uses the year that the car was manufactured as a predictor. This expected value of the claims, the target variable, would be
\[log(E[\frac{\text{Claims}}{\text{Months}}]) = \beta_0 + \beta_1 \text{Year}\] Then you can use the property of the log where \(log(\frac{A}{B}) = log(A) - log(B)\) to move things around. Because \(\text{Months}\) is known, you can remove the expected value. This is the offset term.
\[log(E[\text{Claims}]) = \beta_0 + \beta_1 \text{Year} + \text{Months}\]
12.8 Tweedie regression
While this topic is briefly mentioned in the modules, the only R libraries which support Tweedie Regression (statmod
and tweedie
) are not on the syllabus, and so there is no way that the SOA could ask you to build a tweedie model. This means that you can safely skip this section.
12.9 Combinations of Link Functions and Target Distributions
What is an example of when to use a log link with a Gaussian response? What about a Gamma family with an inverse link? What about an inverse
Gaussian response and an inverse square link? As these questions illustrate, there are many combinations of link and response families. In the real world, a model never fits perfectly, and so often, these choices come down to the judgment of the modeler - which model is the best fit and meets the business objectives?
However, there is one way that we can know for certain which link and response family is the best, and that is if we generate the data ourselves.
Recall that a GLM has two parts:
A random component: \(Y|X \sim \text{some exponential family distribution}\)
A link function: between the random component and the covariates: \(g(\mu(X)) = X\beta\) where \(\mu = E[Y|X]\)
Following this recipe, we can simulate data from any combination of link function and response family. This helps us to understand the GLM framework very clearly.
12.9.1 Gaussian Response with Log Link
We create a function that takes in data \(x\) and returns a Gaussian random variable that has mean equal to the inverse link, which in the case of a log link is the exponent. We add 10 to \(x\) so that the values will always be positive, as will be described later on.
<- function(x) {
sim_norm rnorm(1, mean = exp(10 + x), sd = 1)
}
The values of \(X\) do not need to be normal. The above assumption is merely that the mean of the response \(Y\) is related to \(X\) through the link function, mean = exp(10 + x)
, and that the distribution is normal. This has been accomplished with rnorm
already. For illustration, here we use \(X\)âs from a uniform distribution.
<- tibble(x = runif(500)) %>%
data mutate(y = x %>% map_dbl(sim_norm))
We already know what the answer is: a Gaussian response with a log link. We fit a GLM and see a perfect fit.
<- glm(y ~ x, family = gaussian(link = "log"), data = data)
glm
summary(glm)
##
## Call:
## glm(formula = y ~ x, family = gaussian(link = "log"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.05488 -0.73818 -0.01268 0.71014 2.93377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+01 2.981e-06 3354536 <2e-16 ***
## x 1.000e+00 4.383e-06 228152 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.061249)
##
## Null deviance: 5.817e+10 on 499 degrees of freedom
## Residual deviance: 5.285e+02 on 498 degrees of freedom
## AIC: 1452.7
##
## Number of Fisher Scoring iterations: 2
par(mfrow = c(2,2))
plot(glm, cex = 0.4)
12.9.2 Gaussian Response with Inverse Link
The same steps are repeated except the link function is now the inverse, mean = 1/x
. We see that some values of \(Y\) are negative, which is ok.
<- function(x) {
sim_norm rnorm(1, mean = 1/x, 1)
}
<- tibble(x = runif(500)) %>%
data mutate(y = x %>% map_dbl(sim_norm))
summary(data)
## x y
## Min. :0.002351 Min. : -1.392
## 1st Qu.:0.278258 1st Qu.: 1.149
## Median :0.528553 Median : 2.287
## Mean :0.509957 Mean : 6.875
## 3rd Qu.:0.760526 3rd Qu.: 4.014
## Max. :0.999992 Max. :425.760
<- glm(y ~ x, family = gaussian(link = "inverse"), data = data)
glm
summary(glm)
##
## Call:
## glm(formula = y ~ x, family = gaussian(link = "inverse"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.11464 -0.77276 -0.02954 0.70110 2.50767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.605e-06 1.238e-05 0.13 0.897
## x 9.983e-01 4.032e-03 247.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.080454)
##
## Null deviance: 383534.20 on 499 degrees of freedom
## Residual deviance: 538.07 on 498 degrees of freedom
## AIC: 1461.6
##
## Number of Fisher Scoring iterations: 4
par(mfrow = c(2,2))
plot(glm, cex = 0.4)
12.9.3 Gaussian Response with Identity Link
And now the link is the identity, mean = x
.
<- function(x) {
sim_norm rnorm(1, mean = x, 1)
}
<- tibble(x = rnorm(500)) %>%
data mutate(y = x %>% map_dbl(sim_norm))
<- glm(y ~ x, family = gaussian(link = "identity"), data = data)
glm
summary(glm)
##
## Call:
## glm(formula = y ~ x, family = gaussian(link = "identity"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8731 -0.6460 0.0225 0.6519 3.3242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02411 0.04503 0.535 0.593
## x 1.01347 0.04599 22.035 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.0125)
##
## Null deviance: 995.82 on 499 degrees of freedom
## Residual deviance: 504.23 on 498 degrees of freedom
## AIC: 1429.1
##
## Number of Fisher Scoring iterations: 2
par(mfrow = c(2,2))
plot(glm, cex = 0.4)
12.9.4 Gaussian Response with Log Link and Negative Values
By Gaussian response we say that the mean of the response is Gaussian. The range of a normal random variable is \((-\infty, +\infty)\), which means that negative values are always possible. If the mean is a large positive number, then negative values are much less likely but still possible: about 95% of the observations will be within 2 standard deviations of the mean.
We see below that there are some \(Y\) values which are negative.
<- function(x) {
sim_norm rnorm(1, mean = exp(x), sd = 1)
}
<- tibble(x = runif(500)) %>%
data mutate(y = x %>% map_dbl(sim_norm))
summary(data)
## x y
## Min. :0.0002768 Min. :-1.505
## 1st Qu.:0.2282455 1st Qu.: 0.947
## Median :0.5205783 Median : 1.660
## Mean :0.5081372 Mean : 1.680
## 3rd Qu.:0.7499811 3rd Qu.: 2.424
## Max. :0.9984118 Max. : 4.707
We can also see this from the histogram.
%>% ggplot(aes(y)) + geom_density( fill = 1, alpha = 0.3) data
If we try to fit a GLM with a log link, there is an error.
<- glm(y ~ x, family = gaussian(link = "log"), data = data) glm
Error in eval(family$initialize) : cannot find valid starting values: please specify some
This is because the domain of the natural logarithm only includes positive numbers, and we just tried to take the log of negative numbers.
Our initial reaction might be to add some constant to each \(Y\), say 10, for instance, so that they are all positive. This does produce a model which is a good fit.
<- glm(y + 10 ~ x, family = gaussian(link = "log"), data = data)
glm summary(glm)
##
## Call:
## glm(formula = y + 10 ~ x, family = gaussian(link = "log"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.14633 -0.63601 -0.01264 0.70930 2.87307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.386801 0.007862 303.58 <2e-16 ***
## x 0.138190 0.012846 10.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.017812)
##
## Null deviance: 624.76 on 499 degrees of freedom
## Residual deviance: 506.87 on 498 degrees of freedom
## AIC: 1431.8
##
## Number of Fisher Scoring iterations: 4
par(mfrow = c(2,2))
plot(glm, cex = 0.4)
We see that on average, the predictions are 10 higher than the target. This is no surprise since \(E[Y + 10] = E[Y] + 10\).
<- data$y
y <- predict(glm, type = "response")
y_hat mean(y_hat) - mean(y)
## [1] 9.999967
However, we see that the actual predictions are bad. If we were to look at the R-squared, MAE, RMSE, or any other metric, it would tell us the same story. This is because our GLM assumption not that \(Y\) is related to the link function of \(X\), but that the mean of \(Y\) is.
tibble(y = y, y_hat = y_hat - 10) %>% ggplot(aes(y, y_hat)) + geom_point()
One solution is to adjust the \(X\) which the model is based on. Add a constant term to \(X\) so that the mean of \(Y\) is larger, and hence \(Y\) is non zero. While is a viable approach in the case of only one predictor variable, with more predictors this would not be easy to do.
<- tibble(x = runif(500) + 10) %>%
data mutate(y = x %>% map_dbl(sim_norm))
summary(data)
## x y
## Min. :10.00 Min. :22040
## 1st Qu.:10.27 1st Qu.:28835
## Median :10.50 Median :36402
## Mean :10.50 Mean :37838
## 3rd Qu.:10.75 3rd Qu.:46688
## Max. :11.00 Max. :59756
<- glm(y ~ x, family = gaussian(link = "log"), data = data)
glm par(mfrow = c(2,2))
plot(glm, cex = 0.4)
A better approach may be to use an inverse link even though the data was generated from a log link. This is a good illustration of the saying âall models are wrong, but some are usefulâ in that the statistical assumption of the model is not correct but the model still works.
<- tibble(x = runif(500)) %>%
data mutate(y = x %>% map_dbl(sim_norm))
<- glm(y ~ x, family = gaussian(link = "inverse"), data = data)
glm par(mfrow = c(2,2))
plot(glm, cex = 0.4)
summary(glm)
##
## Call:
## glm(formula = y ~ x, family = gaussian(link = "inverse"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.63660 -0.71949 -0.02573 0.70123 2.68053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87559 0.04503 19.44 <2e-16 ***
## x -0.53090 0.05703 -9.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.104314)
##
## Null deviance: 662.67 on 499 degrees of freedom
## Residual deviance: 549.95 on 498 degrees of freedom
## AIC: 1472.5
##
## Number of Fisher Scoring iterations: 6
12.9.5 Gamma Response with Log Link
The gamma distribution with rate parameter \(\alpha\) and scale parameter \(\theta\) is density.
\[f(y) = \frac{(y/\theta)^\alpha}{x \Gamma(\alpha)}e^{-x/\theta}\]
The mean is \(\alpha\theta\).
Letâs use a gamma with shape 2 and scale 0.5, which has mean 1.
<- rgamma(500, shape=2, scale = 0.5)
gammas mean(gammas)
## [1] 1.018104
We then generate random gamma values. Because the mean now depends on two parameters instead of one, which was just \(\mu\) in the Gaussian case, we need to use a slightly different approach to simulate the random values. The link function here is seen in exp(x)
.
#random component
<- runif(1000, min=0, max=100)
x
#relate Y to X with a log link function
<- gammas*exp(x)
y
<- tibble(x = x, y = y)
data summary(data)
## x y
## Min. : 0.1447 Min. :0.000e+00
## 1st Qu.:25.0189 1st Qu.:4.231e+10
## Median :49.2899 Median :1.476e+21
## Mean :50.0386 Mean :2.544e+41
## 3rd Qu.:76.4823 3rd Qu.:8.476e+32
## Max. :99.8365 Max. :2.595e+43
As expected, the residual plots are all perfect because the model is perfect.
<- glm(y ~ x, family = Gamma(link = "log"), data = data)
glm par(mfrow = c(2,2))
plot(glm, cex = 0.4)
If we had tried using an inverse instead of the log, the residual plots would look much worse.
<- glm(y ~ x, family = Gamma(link = "inverse"), data = data)
glm par(mfrow = c(2,2))
plot(glm, cex = 0.4)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
12.9.6 Gamma with Inverse Link
With the inverse link, the mean has a factor 1/(x + 1)
. Note that we need to add 1 to x to avoid dividing by zero.
#relate Y to X with a log link function
<- gammas*1/(x + 1)
y
<- tibble(x = x, y = y)
data summary(data)
## x y
## Min. : 0.1447 Min. :0.0005573
## 1st Qu.:25.0189 1st Qu.:0.0095364
## Median :49.2899 Median :0.0177203
## Mean :50.0386 Mean :0.0454184
## 3rd Qu.:76.4823 3rd Qu.:0.0383153
## Max. :99.8365 Max. :1.5248073
<- glm(y ~ x, family = Gamma(link = "inverse"), data = data)
glm par(mfrow = c(2,2))
plot(glm, cex = 0.4)