After this lecture, you should
lm,After this lecture, you should
know how to fit a multiple regression model using lm,
understand and be able to interpret adjusted R2, and
After this lecture, you should
know how to fit a multiple regression model using lm,
understand and be able to interpret adjusted R2, and
be able to use diagnostic plots to assess the validity of a linear fit.
mariokart data set which consists of auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.mariokart data set which consists of auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.head(mariokart)
## # A tibble: 6 x 12## id duration n_bids cond start_pr ship_pr total_pr ship_sp seller_rate## <dbl> <int> <int> <fct> <dbl> <dbl> <dbl> <fct> <int>## 1 1.50e11 3 20 new 0.99 4 51.6 standard 1580## 2 2.60e11 7 13 used 0.99 3.99 37.0 firstCl~ 365## 3 3.20e11 3 16 new 0.99 3.5 45.5 firstCl~ 998## 4 2.80e11 3 18 new 0.99 0 44 standard 7## 5 1.70e11 1 20 new 0.01 0 71 media 820## 6 3.60e11 3 19 new 0.99 4 45 standard 270144## # ... with 3 more variables: stock_photo <fct>, wheels <int>, title <fct>mariokart data set which consists of auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.head(mariokart)
## # A tibble: 6 x 12## id duration n_bids cond start_pr ship_pr total_pr ship_sp seller_rate## <dbl> <int> <int> <fct> <dbl> <dbl> <dbl> <fct> <int>## 1 1.50e11 3 20 new 0.99 4 51.6 standard 1580## 2 2.60e11 7 13 used 0.99 3.99 37.0 firstCl~ 365## 3 3.20e11 3 16 new 0.99 3.5 45.5 firstCl~ 998## 4 2.80e11 3 18 new 0.99 0 44 standard 7## 5 1.70e11 1 20 new 0.01 0 71 media 820## 6 3.60e11 3 19 new 0.99 4 45 standard 270144## # ... with 3 more variables: stock_photo <fct>, wheels <int>, title <fct>mariokartglimpse(mariokart)
## Rows: 141## Columns: 12## $ id <dbl> 150377422259, 260483376854, 320432342985, 280405224677, 17~## $ duration <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1, 1~## $ n_bids <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, 6, ~## $ cond <fct> new, used, new, new, new, new, used, new, used, used, new,~## $ start_pr <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 19.9~## $ ship_pr <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4.00~## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, 47~## $ ship_sp <fct> standard, firstClass, firstClass, standard, media, standar~## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 4858,~## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, yes, ~## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2, 0~## $ title <fct> "~~ Wii MARIO KART & WHEEL ~ NINTENDO Wii ~ BRAND NEW ~mariokartglimpse(mariokart)
## Rows: 141## Columns: 12## $ id <dbl> 150377422259, 260483376854, 320432342985, 280405224677, 17~## $ duration <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1, 1~## $ n_bids <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, 6, ~## $ cond <fct> new, used, new, new, new, new, used, new, used, used, new,~## $ start_pr <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 19.9~## $ ship_pr <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4.00~## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, 47~## $ ship_sp <fct> standard, firstClass, firstClass, standard, media, standar~## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 4858,~## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, yes, ~## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2, 0~## $ title <fct> "~~ Wii MARIO KART & WHEEL ~ NINTENDO Wii ~ BRAND NEW ~total_pr) at which a game is sold? cond) as the only predictor:cond) as the only predictor:lm_fit <- lm(total_pr~cond,data=mariokart)summary(lm_fit)
## ## Call:## lm(formula = total_pr ~ cond, data = mariokart)## ## Residuals:## Min 1Q Median 3Q Max ## -13.8911 -5.8311 0.1289 4.1289 22.1489 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 53.7707 0.9596 56.034 < 2e-16 ***## condused -10.8996 1.2583 -8.662 1.06e-14 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 7.371 on 139 degrees of freedom## Multiple R-squared: 0.3506, Adjusted R-squared: 0.3459 ## F-statistic: 75.03 on 1 and 139 DF, p-value: 1.056e-14

As we will see, in R it is extremely easy to fit a model with many predictors. Why might we want to do this?
We would like to fit a model that includes all potentially important variables simultaneously.
As we will see, in R it is extremely easy to fit a model with many predictors. Why might we want to do this?
We would like to fit a model that includes all potentially important variables simultaneously.
As we will see, in R it is extremely easy to fit a model with many predictors. Why might we want to do this?
We would like to fit a model that includes all potentially important variables simultaneously.
Multiple regression can help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables.
Let's fit a more complicated linear model.
^y=β0+β1x1+β2x2+⋯+βkxk
when there are k predictors. We always estimate the βi parameters using statistical software.
^y=β0+β1x1+β2x2+⋯+βkxk
when there are k predictors. We always estimate the βi parameters using statistical software.
cond, stock_photo (whether the auction feature photo was a stock photo or not), duration (auction length, in days), and wheels (number of Wii wheels included in the auction) all as predictors of price for the mariokart data. ^y=β0+β1x1+β2x2+⋯+βkxk
when there are k predictors. We always estimate the βi parameters using statistical software.
For example, we may want to use cond, stock_photo (whether the auction feature photo was a stock photo or not), duration (auction length, in days), and wheels (number of Wii wheels included in the auction) all as predictors of price for the mariokart data.
Let's obtain a linear fit with these predictors using lm.
lm_fit2 <- lm(total_pr~cond+stock_photo+duration+wheels,data=mariokart)summary(lm_fit2)
## ## Call:## lm(formula = total_pr ~ cond + stock_photo + duration + wheels, ## data = mariokart)## ## Residuals:## Min 1Q Median 3Q Max ## -11.3788 -2.9854 -0.9654 2.6915 14.0346 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.34153 1.71167 24.153 < 2e-16 ***## condused -5.13056 1.05112 -4.881 2.91e-06 ***## stock_photoyes 1.08031 1.05682 1.022 0.308 ## duration -0.02681 0.19041 -0.141 0.888 ## wheels 7.28518 0.55469 13.134 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 4.901 on 136 degrees of freedom## Multiple R-squared: 0.719, Adjusted R-squared: 0.7108 ## F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16Notice that when we have controlled for other features, the condition (new versus used) of the game has a smaller impact on the price of the game since the slope estimate has gone from -10.90 to -5.13.
For simple linear regression, we used R2 to determine the amount of variability in the response that was explained by the model. Recall that
R2=1−variability in residualsvariability in the response
Notice that when we have controlled for other features, the condition (new versus used) of the game has a smaller impact on the price of the game since the slope estimate has gone from -10.90 to -5.13.
For simple linear regression, we used R2 to determine the amount of variability in the response that was explained by the model. Recall that
R2=1−variability in residualsvariability in the response
R2adj=1−s2residualss2responsen−1n−k−1 where n is the number of observations and k is the number of predictor variables. Remember that a categorical predictor with p levels will contribute p−1 to the number of variables in the model.
R2adj=1−s2residualss2responsen−1n−k−1 where n is the number of observations and k is the number of predictor variables. Remember that a categorical predictor with p levels will contribute p−1 to the number of variables in the model.
R2adj=1−s2residualss2responsen−1n−k−1 where n is the number of observations and k is the number of predictor variables. Remember that a categorical predictor with p levels will contribute p−1 to the number of variables in the model.
Notice that the adjusted R2 will be smaller than the unadjusted R2.
One of the main benefits of using adjusted R2 for multiple regression is that it accounts for model complexity.
R2adj=1−s2residualss2responsen−1n−k−1 where n is the number of observations and k is the number of predictor variables. Remember that a categorical predictor with p levels will contribute p−1 to the number of variables in the model.
Notice that the adjusted R2 will be smaller than the unadjusted R2.
One of the main benefits of using adjusted R2 for multiple regression is that it accounts for model complexity.
The best model is not always the most complicated one. For one, more complex models are more likely to overfit.
Model selection seeks to identify variables in the model that may not be helpful.
The model that includes all available explanatory variables is referred to as the full model.
Model selection seeks to identify variables in the model that may not be helpful.
The model that includes all available explanatory variables is referred to as the full model.
There are a variety of model selection strategies that are used in practice. We will discuss two of the more common approaches.
Backward Elimination. In this approach, we would identify the predictor corresponding to the largest p-value. If the p-value is above the significance level (usually α=0.05), then we drop that variable, refit the model, and repeat the process. If the largest p-value is less than the significance level, then we would not eliminate any predictors.
Forward Selection. This approach begins with no predictors, then we fit a model with each individual predictor one at a time and keep the predictor that has the smallest p-value. Forward selection proceeds by continuing to add at each step a predictor that results in the smallest p-value that is less than the significance level. When none of the remaining predictors can be added to the model and have a p-value less than the significance level, we stop.
Backward Elimination. In this approach, we would identify the predictor corresponding to the largest p-value. If the p-value is above the significance level (usually α=0.05), then we drop that variable, refit the model, and repeat the process. If the largest p-value is less than the significance level, then we would not eliminate any predictors.
Forward Selection. This approach begins with no predictors, then we fit a model with each individual predictor one at a time and keep the predictor that has the smallest p-value. Forward selection proceeds by continuing to add at each step a predictor that results in the smallest p-value that is less than the significance level. When none of the remaining predictors can be added to the model and have a p-value less than the significance level, we stop.
It is important to note that backward elimination and forward selection may not produce the same final model.
mariokart data. Note that the full model is (in R formula notation) total_pr ~ cond + stock_photo + duration + wheels.mariokart data. Note that the full model is (in R formula notation) total_pr ~ cond + stock_photo + duration + wheels.mariokart data. Note that the full model is (in R formula notation) total_pr ~ cond + stock_photo + duration + wheels.Let's work out the details together in R.
Backward elimination and forward selection use p-values in deciding which variables will make up the final model. However, there are other measures that are used in other approaches to model selection. For example, one could seek a model that has the largest adjusted R2 value. Information theoretic measures such as AIC and BIC are also often used. A discussion on these matters falls outside the scope of this course.
mariokart data. Note that the full model is (in R formula notation) total_pr ~ cond + stock_photo + duration + wheels.Let's work out the details together in R.
Backward elimination and forward selection use p-values in deciding which variables will make up the final model. However, there are other measures that are used in other approaches to model selection. For example, one could seek a model that has the largest adjusted R2 value. Information theoretic measures such as AIC and BIC are also often used. A discussion on these matters falls outside the scope of this course.
We note that there are packages associated with statistical software that implement various variable selection algorithms. For example, olsrr is an R package that implements a variety of variable selection methods.
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the response.
^y=β0+β1x1+β2x2+⋯+βkxk generally depend on the following four conditions:
the residuals of the model are nearly normal,
the variability of the residuals is nearly constant,
the residuals are independent, and
each variable is linearly related to the response.
mariokart data have the following histogram:
mariokart data have the following histogram:


It can also be useful to examine the following types of plots:
It can also be useful to examine the following types of plots:
Residuals in the order of their data collection. Such a plot is helpful in identifying any connection between cases that are close to one another.
Residuals against each predictor. We are looking for any notable change in variability between groups.
It can also be useful to examine the following types of plots:
Residuals in the order of their data collection. Such a plot is helpful in identifying any connection between cases that are close to one another.
Residuals against each predictor. We are looking for any notable change in variability between groups.
These plots are shown for model results for the mariokart data on pages 369 and 370 of the textbook. Let's look at these together and discuss.
When it comes to regression, we have only scratched the surface. There is more we could discuss regarding multiple regression, and there are also other types of regression. The following video provides an introduction to logistic regression
For even more on regression, we recommend the text Linear Models with R by Faraway.
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |