class: center, middle, inverse, title-slide

# Lecture 10
### JMG
### MATH 204

---

# Linear Regression: Introduction

- Linear regression is a statistical method for fitting a line to data.

--

- Recall that a (non-vertical) line in the `\(x,y\)`-plane is determined by

`$$y = \text{slope} \times x + \text{intercept}$$`

--

- There are two aspects to fitting a line to data that we will study:

--

- Estimating the slope and intercept values, and
  
--

- Assessing the uncertainty of our estimates for the slope and intercept values.
 
--

- In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression. 
  
--

- We encourage you to watch the video on the next slide to help in getting introduced to linear regression.

---

# Regression Intro Video

<div class="vembedr" align="center">
<div>
<iframe src="https://www.youtube.com/embed/mPvtZhdPBhQ" width="533" height="300" frameborder="0" allowfullscreen=""></iframe>
</div>
</div>

---

# Learning Objectives

- After this lecture, you should

--

- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
  
--

- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
  
--

- Know how to diagnose problems with a linear fit by least squares regression. (8.3)
  
--

- Understand the methods of inference of least squares regression. (8.4)
  
--

- Know how to obtain a linear fit using  R with the `lm` function. 
  
--

- Be able to assess and interpret the results of a linear fit using  R with the `lm` function.
  
---

# Simple Regression Model

- Simple linear regression models the relationship between two (numerical) variables `\(x\)` and `\(y\)` by the formula

`$$y = \beta_{0} + \beta_{1} x + \epsilon$$`
where

--

- `\(\beta_{0}\)` (intercept) and `\(\beta_{1}\)` (slope) are the model **parameters**

--

- `\(\epsilon\)` is the **error** 
  
--

- The parameters are estimated using data and their point estimates are denoted by `\(b_{0}\)` (intercept estimate) and `\(b_{1}\)` (slope estimate).

--

- In linear regression, `\(x\)` is called the explanatory or predictor variable while `\(y\)` is called the response variable.

--

- Let's look at an example data set for which linear regression is a potentially useful model.

---

# Australian Brushtail Possum

.center[
<img src="https://www.dropbox.com/s/5hxo31udpanhnle/opossum.jpg?raw=1" width="45%" />
]

--

- The `possum` data set records measurements of 104 brushtail possums from Australia and New Guinea, the first few rows of the data are shown below

```r
head(possum,4)
```

```
## # A tibble: 4 x 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38
```

---

# Possum Data Example

- Suppose that we as researchers are interested to study the relationship between the head length (`head_l`) and total length (`total_l`) measurements of the brushtail possum of Australia.

--

- Note that head length (`head_l`) and total length (`total_l`)  are both (continuous) numerical variables.

--

- The next slide displays a scatterplot for `head_l` versus `total_l`.

---

# Possum Data Scatterplot

![](lect_10_thurs_sept_30_files/figure-html/unnamed-chunk-3-1.png)

--

- Describe the features of any association that there appears to be between the two variables in the plot.

---

# Possum Data Regression Line

- We have added the "best fit" line to the scatter plot of `head_l` versus `total_l`. Later we discuss how this line is obtained.

![](lect_10_thurs_sept_30_files/figure-html/unnamed-chunk-4-1.png)

--

- A **residual** is the vertical distance between a data point and the best fit line.  The next slide shows all of the residuals for the data.

---

# Possum Regression Residuals

![](lect_10_thurs_sept_30_files/figure-html/unnamed-chunk-5-1.png)

--

- The best fit or regression line is the line that minimizes all of the residuals simultaneously.

---

# Possum Regression Residual Plot

- A residual plot displays the residual values versus the `\(x\)` values from the data.

![](lect_10_thurs_sept_30_files/figure-html/unnamed-chunk-6-1.png)

--

- As we will see, residual plots play an important role in assessing the results of a regression model.

---

# Fits and Residual Plots

- Let's look some linear fits and their corresponding residual plots.

--

.center[
<img src="https://www.dropbox.com/s/1q360mf7pgyn7t7/sampleLinesAndResPlots.png?raw=1" width="80%" />
]

--

- In the first column, the residuals show no obvious pattern, this is desirable. In the second column, the residuals show a pattern that suggests a linear model is inappropriate. In the third column, it's not clear if the linear fit is statistically significant.

---

# Correlation

- **Correlation**, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by `\(R\)`.

--

- In R, correlation is computed with the `cor` command. For example, the correlation between the `head_l` and `total_l` variables in the `possum` data set is computed as

```r
cor(possum$head_l,possum$total_l)
```

```
## [1] 0.6910937
```

--

- The plot in the next slide shows several scatter plots together with the corresponding correlation value.

---

# Correlation Illustrations

.center[
<img src="https://www.dropbox.com/s/jbgilwujdebjwwf/posNegCorPlots.png?raw=1" width="100%" />
]

---

# Strongly Related Variables with Weak Correlations

- It is important to note that two variables may have a strong association even if their correlation is relatively weak. This is because correlation measures **linear** association and variables may be a strong **nonlinear** association.

--

.center[
<img src="https://www.dropbox.com/s/rmujuxuoihdg5xb/corForNonLinearPlots.png?raw=1" width="100%" />
]

---

# Least Squares Regression

- We now begin to discuss the details of how to fit a simple linear regression model to data.

--

- The approach we take is called *least squares regression*.

--

- The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point `\((x_{i},y_{i})\)`, we find `\(b_{0}\)` and `\(b_{1}\)` such that if `\(\hat{y}_{i} = b_{0} + b_{1}x_{i}\)`, then

`$$RSS = \sum_{i=1}^{n}(\hat{y}_{i} - y_{i})^2$$`

is as small as possible.

--

- For this to work out well, several conditions need to be met. These conditions are spelled out on the next slide.

---

# Conditions for Least Squares

- **Linearity.** The data should show a linear trend.

--

- **Normality.** Generally, the distribution of the residuals should be close to normal.

--

- **Constant Variance.** The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.

--

- **Independence.** We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points.

--

- The next slide shows plot of data where at least one of the conditions for least squares regression fails to hold.

---

# Regression Assumption Failures

.center[
<img src="https://www.dropbox.com/s/hu01jaaqx2ohbgq/whatCanGoWrongWithLinearModel.png?raw=1" width="100%" />
]

--

- In the first column, the linearity condition fails. In the second column, the normality condition fails.

--

- In the third column, the constant variance condition fails. In the fourth column, the independence condition fails.

--

- Notice how in each case, the residual plot can be used to diagnose problems with a least squares linear regression fit.

---

# Fitting a Linear Model

- For simple least squares linear regression with one predictor variable ( `\(x\)` ), one can fit the model to data "by hand". The mathematical formula is

`$$b_{1} = \frac{s_{y}}{s_{x}} R,$$`

`$$b_{0} = \bar{y} - b_{1} \bar{x},$$`

where

- `\(R\)` is the correlation between `\(x\)` and `\(y\)`,

--
  
  - `\(s_{y}\)` and `\(s_{x}\)` are the sample standard deviations for `\(y\)` and `\(x\)`, and 
  
--

- `\(\bar{y}\)` and `\(\bar{x}\)` are the sample means for `\(y\)` and `\(x\)`. 
  
--

- Let's apply these formulas to the `possum` data set with `\(x\)` the `total_l` variable and `\(y\)` the `head_l` variable.

---

# Applying the Regression Formulas

- We need to compute the correlation, sample means, and sample standard deviations:

```r
x <- possum$total_l; y <- possum$head_l
x_bar <- mean(x); y_bar <- mean(y)
s_x <- sd(x); s_y <- sd(y)
R <- cor(x,y)
```

--

- Now we can compute our estimates `\(b_{0}\)` and `\(b_{1}\)`:

```r
(b_1 <- (s_y/s_x)*R)
```

```
## [1] 0.5729013
```

```r
(b_0 <- y_bar - b_1*x_bar)
```

```
## [1] 42.70979
```

--

- There is an R command, `lm` (linear model) that will compute these values and much more for us.

---

# The lm Command

- Let's see an example of how the `lm` command is used:

```r
lm(head_l ~ total_l, data=possum)
```

```
## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729
```

--

- Notice that this returns the point estimate values for `\(b_{0}\)` (Intercept) and the slope `\(b_{1}\)`, and that these values are the same as what we obtained using the mathematical formulas on the last slide.

---

# Interpreting Model Parameters

- For a linear model,

--

- The slope describes the estimate difference in the `\(y\)` variable if the explanatory variable `\(x\)` for a case happened to be one unit larger. 
  
--

- The intercept describes the average outcome of `\(y\)` if `\(x=0\)` **and** the linear model is valid all the way to `\(x=0\)`, which in many applications is not the case.
 
--

- To evaluate the strength of a linear fit, we compute `\(R^{2}\)` (R-squared). The value of `\(R^{2}\)` tells us the percent of variation in the response that is explained by the explanatory variable.

--

- There are some pitfalls in interpreting the results of a linear model. In particular,

--

- Applying a model to estimate values outside of the realm of the original data is called **extrapolation**. Generally, extrapolation is unreliable. 
  
--

- In many cases, even when there is a real association between variables, we cannot interpret a causal connection between the variables. 
  
---

# Example

- Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.

--

- We will look at a linear model for the data set, `cheddar` from the `faraway` package.

--

- Let's do this example together in RStudio.

---

# Outlier Issues

- Outliers in regression are observations that fall far from the cloud of points.  
--

- Outliers can have a strong influence on the least squares line.

--

- Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage**.

--

- A  data point is called an **influential point** if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.

--

- The next slide shows data with outliers together with the corresponding regression line and residual plot.

---

# Regression Outliers

.center[
<img src="https://www.dropbox.com/s/7t2qjdtwz7rkuok/outlierPlots.png?raw=1" width="70%" />
]

---

# Inference for Regression

- Recall that a simple linear model has the form

`$$y = \beta_{0} + \beta_{1} x + \epsilon$$`

--

- Least squares is a method for obtaining point estimates `\(b_{0}\)` and `\(b_{1}\)` for the parameters `\(\beta_{0}\)` and `\(\beta_{1}\)`. Thus, `\(\beta_{0}\)` and `\(\beta_{1}\)` are unknowns that correspond to population values that we want to infer information about.

--

- A somewhat subtle point is that we also do not know the population standard deviation `\(\sigma\)` for the error `\(\epsilon\)`. This is an additional model parameter.

--

- We would like answers to the following questions:

--

- How do we obtain confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`, specifically how to we get the standard error?
  
--

- How do we conduct hypothesis tests related to the parameters `\(\beta_{0}\)` and `\(\beta_{1}\)`?
  
---

# Confidence Intervals for Model Coefficients

- **Fact:** The sampling distribution for estimates for  `\(\beta_{0}\)` and `\(\beta_{1}\)` is a t-distribution. So we can obtain confidence intervals with

`$$b_{i} \pm t^{\ast}_{\text{df}} \times SE_{b_{i}}$$`

--

- All the numerical information you need to obtain confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)` is provided in the output of the `summary` command for a linear model fit with `lm`.

--

- Suppose we fit a linear model for the `possum` data again:

```r
(lm_fit <- lm(head_l ~ total_l, data=possum))
```

```
## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729
```

--

- On the next slide, we print the output of `summary(lm_fit)`

---

# lm summary output

```r
summary(lm_fit)
```

```
## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,	Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16
```

--

- For obtaining a confidence interval, the relevant information is provided by `Estimate`, `Std. Error`, and the reported degrees of freedom.

---

# Regression CI Example

- Based on the information provided in the `summary` command output, we can construct confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`. First we observe that the degrees of freedom is 102, then the `\(t^{\ast}_{102}\)` value for a 95% CI is

```r
(t_ast <- -qt((1.0-0.95)/2,df=102))
```

```
## [1] 1.983495
```

--

- Now we can obtain our confidence intervals for `\(\beta_{0}\)` and `\(\beta_{1}\)`:

```r
(beta_0_CI <- 42.71 + 1.98*c(-1,1)*5.17)
```

```
## [1] 32.4734 52.9466
```

```r
(beta_1_CI <- 0.57 + 1.98*c(-1,1)*0.06)
```

```
## [1] 0.4512 0.6888
```

---

# Hypothesis Testing for Linear Regression

- There are actually several types of hypothesis tests one can conduct relating to  linear regression models.

--

- The most common test is of the form

- `\(H_{0}: \beta_{1} = 0\)`. The true linear model has slope zero. Versus

--

- `\(H_{A}: \beta_{1} \neq 0\)`. The true linear model has a slope different than zero.

--

- The `summary` command output includes a p-value for testing such a hypothesis. However, be aware that the `lm` command does not check whether the conditions for a linear model are met and the results for inference on model parameters is only valid if   the conditions for a linear model are met.

---

# Hypothesis Test for possum Data

- Again, the output for `summary(lm_fit)` is

```r
summary(lm_fit)
```

```
## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,	Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16
```

--

- The p-value corresponding to the slope estimate `\(b_{1}\)` is much smaller than 0.05 so we will decide to reject the null hypothesis `\(H_{0}: \beta_{1} = 0\)` at the `\(\alpha = 0.05\)` significance level.

---

# Inference for Regression Video

- Please watch this video to gain further perspective on inference for linear regression:

<div class="vembedr" align="center">
<div>
<iframe src="https://www.youtube.com/embed/depiT-hTaGA" width="533" height="300" frameborder="0" allowfullscreen=""></iframe>
</div>
</div>

---

# More R For Linear Models

- To fit a linear model in R we use the `lm` command. The necessary input is a formula of the form `y ~ x` and the data. The `summary` command outputs all of the information relevant for inferential purposes.

--

- However, the output from `summary` is not necessarily formatted in the most convenient way.

--

- Another approach to working with regression model output is provided by functions in the `broom` package. For example, the `tidy` function from `broom` displays the results of the model fit:

```r
tidy(lm_fit)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   42.7      5.17        8.26 5.66e-13
## 2 total_l        0.573    0.0593      9.66 4.68e-16
```

--

- Let's go to R together, work some more examples,  and learn to work with the `broom` functions.

---

# Worked Examples

- Let's work some more examples relating to linear regression.

---

# Comparing Many Means

- In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called **analysis of variance** (ANOVA), see 7.5.

--

- ANOVA uses a single hypothesis to check whether the means across many groups are equal:

--

- `\(H_{0}: \mu_{1}=\mu_{2}=\cdots = \mu_{k}\)`. The mean outcome is the same across all groups. Versus
  
  - `\(H_{A}:\)` At least one mean is different. 
  
--

- We must check three conditions for ANOVA:

--

- (1) Observations are independent across groups. (2) The data within each group are nearly normal. (3) The variability across each group is about equal.

---

# Example Data Motivating ANOVA

- Let's consider our chicken feed data again. The first few rows are shown below:

```r
head(chickwts)
```

```
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
```

--

- The next slide show a plot of this data.

---

# Plot of Data

```r
chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot()
```

![](lect_10_thurs_sept_30_files/figure-html/unnamed-chunk-19-1.png)

--

- It appears that the mean weight is not the same for all groups of chickens. The question we would like to have an answer to is, is the observed difference significant?

---

# The F Statistics and the F-Test

- Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.

--

- ANOVA uses a test statistic denoted by `\(F\)`, which represents a standardized ratio of variability in the sample means relative to the variability within groups.

--

- ANOVA uses an `\(F\)` distribution to compute a p-value that corresponds to the probability of observing an `\(F\)` statistic value that is as or more extreme than the sample `\(F\)` statistic value under the assumption that the null hypothesis is true.

--

- We will see how to conduct ANOVA and an `\(F\)`-test using R.

--

- Before conducting ANOVA, we should discuss the necessary conditions for an ANOVA analysis.

---

# Conditions for ANOVA

- There are three conditions we must check for an ANOVA:

--

- Independence. If the data are a simple random sample, this condition is satisfied. 
  
--

- Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA. 
  
--

- Constant variance. The variance in the groups should be close to equal.  This assumption can be checked with side-by-side box plots. 
  
---

# ANOVA Examples

- Let's see some examples of conducting ANOVA. We will do this together in R.

---

# Next Time: Inference for Categorical Data

- Now that we have explored a lot about inference for numerical data, our next topic is some methods of inference for categorical data. The included video is a good place to start:

<div class="vembedr" align="center">
<div>
<iframe src="https://www.youtube.com/embed/Uk36WGxujkc" width="533" height="300" frameborder="0" allowfullscreen=""></iframe>
</div>
</div>

Notes for current slide

Notes for next slide

Lecture 10JMGMATH 2041 / 41

Linear Regression: IntroductionLinear regression is a statistical method for fitting a line to data. 
2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

There are two aspects to fitting a line to data that we will study:

2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

There are two aspects to fitting a line to data that we will study:
- Estimating the slope and intercept values, and

2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

There are two aspects to fitting a line to data that we will study:
- Estimating the slope and intercept values, and
- Assessing the uncertainty of our estimates for the slope and intercept values.

2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

There are two aspects to fitting a line to data that we will study:
- Estimating the slope and intercept values, and
- Assessing the uncertainty of our estimates for the slope and intercept values.
In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression.

2 / 41

Linear Regression: Introduction

Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the $x, y$ -plane is determined by

$y = slope \times x + intercept$

There are two aspects to fitting a line to data that we will study:
- Estimating the slope and intercept values, and
- Assessing the uncertainty of our estimates for the slope and intercept values.
In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression.
We encourage you to watch the video on the next slide to help in getting introduced to linear regression.

2 / 41

Regression Intro Video

3 / 41

Learning ObjectivesAfter this lecture, you should
4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)

4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)

4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
- Know how to diagnose problems with a linear fit by least squares regression. (8.3)

4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
- Know how to diagnose problems with a linear fit by least squares regression. (8.3)
- Understand the methods of inference of least squares regression. (8.4)

4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
- Know how to diagnose problems with a linear fit by least squares regression. (8.3)
- Understand the methods of inference of least squares regression. (8.4)
- Know how to obtain a linear fit using R with the lm function.

4 / 41

Learning Objectives

After this lecture, you should
- Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
- Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
- Know how to diagnose problems with a linear fit by least squares regression. (8.3)
- Understand the methods of inference of least squares regression. (8.4)
- Know how to obtain a linear fit using R with the lm function.
- Be able to assess and interpret the results of a linear fit using R with the lm function.

4 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

5 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

$β_{0}$ (intercept) and $β_{1}$ (slope) are the model parameters

5 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

$β_{0}$ (intercept) and $β_{1}$ (slope) are the model parameters
$ϵ$ is the error

5 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

$β_{0}$ (intercept) and $β_{1}$ (slope) are the model parameters
$ϵ$ is the error

The parameters are estimated using data and their point estimates are denoted by $b_{0}$ (intercept estimate) and $b_{1}$ (slope estimate).

5 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

$β_{0}$ (intercept) and $β_{1}$ (slope) are the model parameters
$ϵ$ is the error

The parameters are estimated using data and their point estimates are denoted by $b_{0}$ (intercept estimate) and $b_{1}$ (slope estimate).
In linear regression, $x$ is called the explanatory or predictor variable while $y$ is called the response variable.

5 / 41

Simple Regression Model

Simple linear regression models the relationship between two (numerical) variables $x$ and $y$ by the formula

$y = β_{0} + β_{1} x + ϵ$ where

$β_{0}$ (intercept) and $β_{1}$ (slope) are the model parameters
$ϵ$ is the error

The parameters are estimated using data and their point estimates are denoted by $b_{0}$ (intercept estimate) and $b_{1}$ (slope estimate).
In linear regression, $x$ is called the explanatory or predictor variable while $y$ is called the response variable.
Let's look at an example data set for which linear regression is a potentially useful model.

5 / 41

Australian Brushtail Possum

6 / 41

Australian Brushtail Possum

The possum data set records measurements of 104 brushtail possums from Australia and New Guinea, the first few rows of the data are shown below

head(possum,4)

## # A tibble: 4 x 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38

6 / 41

Possum Data ExampleSuppose that we as researchers are interested to study the relationship between the head length (head_l) and total length (total_l) measurements of the brushtail possum of Australia. 
7 / 41

Possum Data Example

Suppose that we as researchers are interested to study the relationship between the head length (head_l) and total length (total_l) measurements of the brushtail possum of Australia.
Note that head length (head_l) and total length (total_l) are both (continuous) numerical variables.

7 / 41

Possum Data Example

Suppose that we as researchers are interested to study the relationship between the head length (head_l) and total length (total_l) measurements of the brushtail possum of Australia.
Note that head length (head_l) and total length (total_l) are both (continuous) numerical variables.
The next slide displays a scatterplot for head_l versus total_l.

7 / 41

Possum Data Scatterplot

8 / 41

Possum Data Scatterplot

Describe the features of any association that there appears to be between the two variables in the plot.

8 / 41

Possum Data Regression Line

We have added the "best fit" line to the scatter plot of head_l versus total_l. Later we discuss how this line is obtained.

9 / 41

Possum Data Regression Line

We have added the "best fit" line to the scatter plot of head_l versus total_l. Later we discuss how this line is obtained.

A residual is the vertical distance between a data point and the best fit line. The next slide shows all of the residuals for the data.

9 / 41

Possum Regression Residuals

10 / 41

Possum Regression Residuals

The best fit or regression line is the line that minimizes all of the residuals simultaneously.

10 / 41

Possum Regression Residual Plot

A residual plot displays the residual values versus the $x$ values from the data.

11 / 41

Possum Regression Residual Plot

A residual plot displays the residual values versus the $x$ values from the data.

As we will see, residual plots play an important role in assessing the results of a regression model.

11 / 41

Fits and Residual PlotsLet's look some linear fits and their corresponding residual plots. 
12 / 41

Fits and Residual Plots

Let's look some linear fits and their corresponding residual plots.

12 / 41

Fits and Residual Plots

Let's look some linear fits and their corresponding residual plots.

In the first column, the residuals show no obvious pattern, this is desirable. In the second column, the residuals show a pattern that suggests a linear model is inappropriate. In the third column, it's not clear if the linear fit is statistically significant.

12 / 41

CorrelationCorrelation, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by RR. 
13 / 41

Correlation

Correlation, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by $R$ .
In R, correlation is computed with the cor command. For example, the correlation between the head_l and total_l variables in the possum data set is computed as

cor(possum$head_l,possum$total_l)

## [1] 0.6910937

13 / 41

Correlation

Correlation, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by $R$ .
In R, correlation is computed with the cor command. For example, the correlation between the head_l and total_l variables in the possum data set is computed as

cor(possum$head_l,possum$total_l)

## [1] 0.6910937

The plot in the next slide shows several scatter plots together with the corresponding correlation value.

13 / 41

Correlation Illustrations

14 / 41

Strongly Related Variables with Weak CorrelationsIt is important to note that two variables may have a strong association even if their correlation is relatively weak. This is because correlation measures linear association and variables may be a strong nonlinear association. 
15 / 41

It is important to note that two variables may have a strong association even if their correlation is relatively weak. This is because correlation measures linear association and variables may be a strong nonlinear association.

15 / 41

Least Squares RegressionWe now begin to discuss the details of how to fit a simple linear regression model to data. 
16 / 41

Least Squares Regression

We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.

16 / 41

Least Squares Regression

We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.
The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point $(x_{i}, y_{i})$ , we find $b_{0}$ and $b_{1}$ such that if ${\hat{y}}_{i} = b_{0} + b_{1} x_{i}$ , then

$R S S = \sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}$

is as small as possible.

16 / 41

Least Squares Regression

We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.
The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point $(x_{i}, y_{i})$ , we find $b_{0}$ and $b_{1}$ such that if ${\hat{y}}_{i} = b_{0} + b_{1} x_{i}$ , then

$R S S = \sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}$

is as small as possible.

For this to work out well, several conditions need to be met. These conditions are spelled out on the next slide.

16 / 41

Conditions for Least SquaresLinearity. The data should show a linear trend. 
17 / 41

Conditions for Least Squares

Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.

17 / 41

Conditions for Least Squares

Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.

17 / 41

Conditions for Least Squares

Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.
Independence. We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points.

17 / 41

Conditions for Least Squares

Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.
Independence. We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points.
The next slide shows plot of data where at least one of the conditions for least squares regression fails to hold.

17 / 41

Regression Assumption Failures

18 / 41

Regression Assumption Failures

In the first column, the linearity condition fails. In the second column, the normality condition fails.

18 / 41

Regression Assumption Failures

In the first column, the linearity condition fails. In the second column, the normality condition fails.
In the third column, the constant variance condition fails. In the fourth column, the independence condition fails.

18 / 41

Regression Assumption Failures

In the first column, the linearity condition fails. In the second column, the normality condition fails.
In the third column, the constant variance condition fails. In the fourth column, the independence condition fails.
Notice how in each case, the residual plot can be used to diagnose problems with a least squares linear regression fit.

18 / 41

Fitting a Linear Model

For simple least squares linear regression with one predictor variable ( $x$ ), one can fit the model to data "by hand". The mathematical formula is

$b_{1} = \frac{s_{y}}{s_{x}} R,$

$b_{0} = \bar{y} - b_{1} \bar{x},$

where

$R$ is the correlation between $x$ and $y$ ,

19 / 41

Fitting a Linear Model

For simple least squares linear regression with one predictor variable ( $x$ ), one can fit the model to data "by hand". The mathematical formula is

$b_{1} = \frac{s_{y}}{s_{x}} R,$

$b_{0} = \bar{y} - b_{1} \bar{x},$

where

$R$ is the correlation between $x$ and $y$ ,
$s_{y}$ and $s_{x}$ are the sample standard deviations for $y$ and $x$ , and

19 / 41

Fitting a Linear Model

For simple least squares linear regression with one predictor variable ( $x$ ), one can fit the model to data "by hand". The mathematical formula is

$b_{1} = \frac{s_{y}}{s_{x}} R,$

$b_{0} = \bar{y} - b_{1} \bar{x},$

where

$R$ is the correlation between $x$ and $y$ ,
$s_{y}$ and $s_{x}$ are the sample standard deviations for $y$ and $x$ , and
$\bar{y}$ and $\bar{x}$ are the sample means for $y$ and $x$ .

19 / 41

Fitting a Linear Model

For simple least squares linear regression with one predictor variable ( $x$ ), one can fit the model to data "by hand". The mathematical formula is

$b_{1} = \frac{s_{y}}{s_{x}} R,$

$b_{0} = \bar{y} - b_{1} \bar{x},$

where

$R$ is the correlation between $x$ and $y$ ,
$s_{y}$ and $s_{x}$ are the sample standard deviations for $y$ and $x$ , and
$\bar{y}$ and $\bar{x}$ are the sample means for $y$ and $x$ .

Let's apply these formulas to the possum data set with $x$ the total_l variable and $y$ the head_l variable.

19 / 41

Applying the Regression Formulas

We need to compute the correlation, sample means, and sample standard deviations:

x <- possum$total_l; y <- possum$head_l
x_bar <- mean(x); y_bar <- mean(y)
s_x <- sd(x); s_y <- sd(y)
R <- cor(x,y)

20 / 41

Applying the Regression Formulas

We need to compute the correlation, sample means, and sample standard deviations:

x <- possum$total_l; y <- possum$head_l
x_bar <- mean(x); y_bar <- mean(y)
s_x <- sd(x); s_y <- sd(y)
R <- cor(x,y)

Now we can compute our estimates $b_{0}$ and $b_{1}$ :

(b_1 <- (s_y/s_x)*R)

## [1] 0.5729013

(b_0 <- y_bar - b_1*x_bar)

## [1] 42.70979

20 / 41

Applying the Regression Formulas

We need to compute the correlation, sample means, and sample standard deviations:

x <- possum$total_l; y <- possum$head_l
x_bar <- mean(x); y_bar <- mean(y)
s_x <- sd(x); s_y <- sd(y)
R <- cor(x,y)

Now we can compute our estimates $b_{0}$ and $b_{1}$ :

(b_1 <- (s_y/s_x)*R)

## [1] 0.5729013

(b_0 <- y_bar - b_1*x_bar)

## [1] 42.70979

There is an R command, lm (linear model) that will compute these values and much more for us.

20 / 41

The lm Command

Let's see an example of how the lm command is used:

lm(head_l ~ total_l, data=possum)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729

21 / 41

The lm Command

Let's see an example of how the lm command is used:

lm(head_l ~ total_l, data=possum)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729

Notice that this returns the point estimate values for $b_{0}$ (Intercept) and the slope $b_{1}$ , and that these values are the same as what we obtained using the mathematical formulas on the last slide.

21 / 41

Interpreting Model ParametersFor a linear model,
22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.

22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.
- The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$ , which in many applications is not the case.

22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.
- The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$ , which in many applications is not the case.
To evaluate the strength of a linear fit, we compute $R^{2}$ (R-squared). The value of $R^{2}$ tells us the percent of variation in the response that is explained by the explanatory variable.

22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.
- The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$ , which in many applications is not the case.
To evaluate the strength of a linear fit, we compute $R^{2}$ (R-squared). The value of $R^{2}$ tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,

22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.
- The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$ , which in many applications is not the case.
To evaluate the strength of a linear fit, we compute $R^{2}$ (R-squared). The value of $R^{2}$ tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,
- Applying a model to estimate values outside of the realm of the original data is called extrapolation. Generally, extrapolation is unreliable.

22 / 41

Interpreting Model Parameters

For a linear model,
- The slope describes the estimate difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.
- The intercept describes the average outcome of $y$ if $x = 0$ and the linear model is valid all the way to $x = 0$ , which in many applications is not the case.
To evaluate the strength of a linear fit, we compute $R^{2}$ (R-squared). The value of $R^{2}$ tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,
- Applying a model to estimate values outside of the realm of the original data is called extrapolation. Generally, extrapolation is unreliable.
- In many cases, even when there is a real association between variables, we cannot interpret a causal connection between the variables.

22 / 41

ExampleBefore we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.  
23 / 41

Example

Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.
We will look at a linear model for the data set, cheddar from the faraway package.

23 / 41

Example

Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.
We will look at a linear model for the data set, cheddar from the faraway package.
Let's do this example together in RStudio.

23 / 41

Outlier IssuesOutliers in regression are observations that fall far from the cloud of points.  
24 / 41

Outlier IssuesOutliers in regression are observations that fall far from the cloud of points.  
Outliers can have a strong influence on the least squares line. 
24 / 41

Outlier Issues

Outliers in regression are observations that fall far from the cloud of points.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.

24 / 41

Outlier Issues

Outliers in regression are observations that fall far from the cloud of points.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
A data point is called an influential point if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.

24 / 41

Outlier Issues

Outliers in regression are observations that fall far from the cloud of points.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
A data point is called an influential point if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.
The next slide shows data with outliers together with the corresponding regression line and residual plot.

24 / 41

Regression Outliers

25 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

26 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

Least squares is a method for obtaining point estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ . Thus, $β_{0}$ and $β_{1}$ are unknowns that correspond to population values that we want to infer information about.

26 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

Least squares is a method for obtaining point estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ . Thus, $β_{0}$ and $β_{1}$ are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation $σ$ for the error $ϵ$ . This is an additional model parameter.

26 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

Least squares is a method for obtaining point estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ . Thus, $β_{0}$ and $β_{1}$ are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation $σ$ for the error $ϵ$ . This is an additional model parameter.
We would like answers to the following questions:

26 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

Least squares is a method for obtaining point estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ . Thus, $β_{0}$ and $β_{1}$ are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation $σ$ for the error $ϵ$ . This is an additional model parameter.
We would like answers to the following questions:
- How do we obtain confidence intervals for $β_{0}$ and $β_{1}$ , specifically how to we get the standard error?

26 / 41

Inference for Regression

Recall that a simple linear model has the form

$y = β_{0} + β_{1} x + ϵ$

Least squares is a method for obtaining point estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ . Thus, $β_{0}$ and $β_{1}$ are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation $σ$ for the error $ϵ$ . This is an additional model parameter.
We would like answers to the following questions:
- How do we obtain confidence intervals for $β_{0}$ and $β_{1}$ , specifically how to we get the standard error?
- How do we conduct hypothesis tests related to the parameters $β_{0}$ and $β_{1}$ ?

26 / 41

Confidence Intervals for Model Coefficients

Fact: The sampling distribution for estimates for $β_{0}$ and $β_{1}$ is a t-distribution. So we can obtain confidence intervals with

$b_{i} \pm t_{df}^{*} \times S E_{b_{i}}$

27 / 41

Confidence Intervals for Model Coefficients

Fact: The sampling distribution for estimates for $β_{0}$ and $β_{1}$ is a t-distribution. So we can obtain confidence intervals with

$b_{i} \pm t_{df}^{*} \times S E_{b_{i}}$

All the numerical information you need to obtain confidence intervals for $β_{0}$ and $β_{1}$ is provided in the output of the summary command for a linear model fit with lm.

27 / 41

Confidence Intervals for Model Coefficients

Fact: The sampling distribution for estimates for $β_{0}$ and $β_{1}$ is a t-distribution. So we can obtain confidence intervals with

$b_{i} \pm t_{df}^{*} \times S E_{b_{i}}$

All the numerical information you need to obtain confidence intervals for $β_{0}$ and $β_{1}$ is provided in the output of the summary command for a linear model fit with lm.
Suppose we fit a linear model for the possum data again:

(lm_fit <- lm(head_l ~ total_l, data=possum))

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729

27 / 41

Confidence Intervals for Model Coefficients

Fact: The sampling distribution for estimates for $β_{0}$ and $β_{1}$ is a t-distribution. So we can obtain confidence intervals with

$b_{i} \pm t_{df}^{*} \times S E_{b_{i}}$

All the numerical information you need to obtain confidence intervals for $β_{0}$ and $β_{1}$ is provided in the output of the summary command for a linear model fit with lm.
Suppose we fit a linear model for the possum data again:

(lm_fit <- lm(head_l ~ total_l, data=possum))

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Coefficients:
## (Intercept)      total_l  
##     42.7098       0.5729

On the next slide, we print the output of summary(lm_fit)

27 / 41

lm summary output

summary(lm_fit)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,    Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16

28 / 41

lm summary output

summary(lm_fit)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,    Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16

For obtaining a confidence interval, the relevant information is provided by Estimate, Std. Error, and the reported degrees of freedom.

28 / 41

Regression CI Example

Based on the information provided in the summary command output, we can construct confidence intervals for $β_{0}$ and $β_{1}$ . First we observe that the degrees of freedom is 102, then the $t_{102}^{*}$ value for a 95% CI is

(t_ast <- -qt((1.0-0.95)/2,df=102))

## [1] 1.983495

29 / 41

Regression CI Example

Based on the information provided in the summary command output, we can construct confidence intervals for $β_{0}$ and $β_{1}$ . First we observe that the degrees of freedom is 102, then the $t_{102}^{*}$ value for a 95% CI is

(t_ast <- -qt((1.0-0.95)/2,df=102))

## [1] 1.983495

Now we can obtain our confidence intervals for $β_{0}$ and $β_{1}$ :

(beta_0_CI <- 42.71 + 1.98*c(-1,1)*5.17)

## [1] 32.4734 52.9466

(beta_1_CI <- 0.57 + 1.98*c(-1,1)*0.06)

## [1] 0.4512 0.6888

29 / 41

Hypothesis Testing for Linear RegressionThere are actually several types of hypothesis tests one can conduct relating to  linear regression models. 
30 / 41

Hypothesis Testing for Linear Regression

There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
- $H_{0} : β_{1} = 0$ . The true linear model has slope zero. Versus

30 / 41

Hypothesis Testing for Linear Regression

There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
- $H_{0} : β_{1} = 0$ . The true linear model has slope zero. Versus
- $H_{A} : β_{1} \neq 0$ . The true linear model has a slope different than zero.

30 / 41

Hypothesis Testing for Linear Regression

There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
- $H_{0} : β_{1} = 0$ . The true linear model has slope zero. Versus
- $H_{A} : β_{1} \neq 0$ . The true linear model has a slope different than zero.
The summary command output includes a p-value for testing such a hypothesis. However, be aware that the lm command does not check whether the conditions for a linear model are met and the results for inference on model parameters is only valid if the conditions for a linear model are met.

30 / 41

Hypothesis Test for possum Data

Again, the output for summary(lm_fit) is

summary(lm_fit)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,    Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16

31 / 41

Hypothesis Test for possum Data

Again, the output for summary(lm_fit) is

summary(lm_fit)

## 
## Call:
## lm(formula = head_l ~ total_l, data = possum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1877 -1.5340 -0.3345  1.2788  7.3968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.70979    5.17281   8.257 5.66e-13 ***
## total_l      0.57290    0.05933   9.657 4.68e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 102 degrees of freedom
## Multiple R-squared:  0.4776,    Adjusted R-squared:  0.4725 
## F-statistic: 93.26 on 1 and 102 DF,  p-value: 4.681e-16

The p-value corresponding to the slope estimate $b_{1}$ is much smaller than 0.05 so we will decide to reject the null hypothesis $H_{0} : β_{1} = 0$ at the $α = 0.05$ significance level.

31 / 41

Inference for Regression VideoPlease watch this video to gain further perspective on inference for linear regression:

32 / 41

More R For Linear ModelsTo fit a linear model in R we use the lm command. The necessary input is a formula of the form y ~ x and the data. The summary command outputs all of the information relevant for inferential purposes.
33 / 41

More R For Linear Models

To fit a linear model in R we use the lm command. The necessary input is a formula of the form y ~ x and the data. The summary command outputs all of the information relevant for inferential purposes.
However, the output from summary is not necessarily formatted in the most convenient way.

33 / 41

More R For Linear Models

To fit a linear model in R we use the lm command. The necessary input is a formula of the form y ~ x and the data. The summary command outputs all of the information relevant for inferential purposes.
However, the output from summary is not necessarily formatted in the most convenient way.
Another approach to working with regression model output is provided by functions in the broom package. For example, the tidy function from broom displays the results of the model fit:

tidy(lm_fit)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   42.7      5.17        8.26 5.66e-13
## 2 total_l        0.573    0.0593      9.66 4.68e-16

33 / 41

More R For Linear Models

To fit a linear model in R we use the lm command. The necessary input is a formula of the form y ~ x and the data. The summary command outputs all of the information relevant for inferential purposes.
However, the output from summary is not necessarily formatted in the most convenient way.
Another approach to working with regression model output is provided by functions in the broom package. For example, the tidy function from broom displays the results of the model fit:

tidy(lm_fit)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   42.7      5.17        8.26 5.66e-13
## 2 total_l        0.573    0.0593      9.66 4.68e-16

Let's go to R together, work some more examples, and learn to work with the broom functions.

33 / 41

Worked ExamplesLet's work some more examples relating to linear regression.
34 / 41

Comparing Many MeansIn previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5. 
35 / 41

Comparing Many Means

In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:

35 / 41

Comparing Many Means

In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
- $H_{0} : μ_{1} = μ_{2} = \dots = μ_{k}$ . The mean outcome is the same across all groups. Versus
- $H_{A} :$ At least one mean is different.

35 / 41

Comparing Many Means

In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
- $H_{0} : μ_{1} = μ_{2} = \dots = μ_{k}$ . The mean outcome is the same across all groups. Versus
- $H_{A} :$ At least one mean is different.
We must check three conditions for ANOVA:

35 / 41

Comparing Many Means

In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
- $H_{0} : μ_{1} = μ_{2} = \dots = μ_{k}$ . The mean outcome is the same across all groups. Versus
- $H_{A} :$ At least one mean is different.
We must check three conditions for ANOVA:
(1) Observations are independent across groups. (2) The data within each group are nearly normal. (3) The variability across each group is about equal.

35 / 41

Example Data Motivating ANOVA

Let's consider our chicken feed data again. The first few rows are shown below:

head(chickwts)

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

36 / 41

Example Data Motivating ANOVA

Let's consider our chicken feed data again. The first few rows are shown below:

head(chickwts)

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

The next slide show a plot of this data.

36 / 41

Plot of Data

chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot()

37 / 41

Plot of Data

chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot()

It appears that the mean weight is not the same for all groups of chickens. The question we would like to have an answer to is, is the observed difference significant?

37 / 41

The F Statistics and the F-TestAnalysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups. 
38 / 41

The F Statistics and the F-Test

Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by $F$ , which represents a standardized ratio of variability in the sample means relative to the variability within groups.

38 / 41

The F Statistics and the F-Test

Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by $F$ , which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an $F$ distribution to compute a p-value that corresponds to the probability of observing an $F$ statistic value that is as or more extreme than the sample $F$ statistic value under the assumption that the null hypothesis is true.

38 / 41

The F Statistics and the F-Test

Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by $F$ , which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an $F$ distribution to compute a p-value that corresponds to the probability of observing an $F$ statistic value that is as or more extreme than the sample $F$ statistic value under the assumption that the null hypothesis is true.
We will see how to conduct ANOVA and an $F$ -test using R.

38 / 41

The F Statistics and the F-Test

Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by $F$ , which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an $F$ distribution to compute a p-value that corresponds to the probability of observing an $F$ statistic value that is as or more extreme than the sample $F$ statistic value under the assumption that the null hypothesis is true.
We will see how to conduct ANOVA and an $F$ -test using R.
Before conducting ANOVA, we should discuss the necessary conditions for an ANOVA analysis.

38 / 41

Conditions for ANOVAThere are three conditions we must check for an ANOVA:
39 / 41

Conditions for ANOVA

There are three conditions we must check for an ANOVA:
- Independence. If the data are a simple random sample, this condition is satisfied.

39 / 41

Conditions for ANOVA

There are three conditions we must check for an ANOVA:
- Independence. If the data are a simple random sample, this condition is satisfied.
- Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA.

39 / 41

Conditions for ANOVA

There are three conditions we must check for an ANOVA:
- Independence. If the data are a simple random sample, this condition is satisfied.
- Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA.
- Constant variance. The variance in the groups should be close to equal. This assumption can be checked with side-by-side box plots.

39 / 41

ANOVA ExamplesLet's see some examples of conducting ANOVA. We will do this together in R. 
40 / 41

Next Time: Inference for Categorical DataNow that we have explored a lot about inference for numerical data, our next topic is some methods of inference for categorical data. The included video is a good place to start:

41 / 41

Linear Regression: IntroductionLinear regression is a statistical method for fitting a line to data. 
2 / 41

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow