Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
There are two aspects to fitting a line to data that we will study:
Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
There are two aspects to fitting a line to data that we will study:
Estimating the slope and intercept values, and
Assessing the uncertainty of our estimates for the slope and intercept values.
Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
There are two aspects to fitting a line to data that we will study:
Estimating the slope and intercept values, and
Assessing the uncertainty of our estimates for the slope and intercept values.
In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression.
Linear regression is a statistical method for fitting a line to data.
Recall that a (non-vertical) line in the x,y-plane is determined by
y=slope×x+intercept
There are two aspects to fitting a line to data that we will study:
Estimating the slope and intercept values, and
Assessing the uncertainty of our estimates for the slope and intercept values.
In this lecture, we cover all of the concepts necessary to understand how to carry out and interpret linear regression.
We encourage you to watch the video on the next slide to help in getting introduced to linear regression.
After this lecture, you should
After this lecture, you should
Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
After this lecture, you should
Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
Know how to diagnose problems with a linear fit by least squares regression. (8.3)
After this lecture, you should
Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
Know how to diagnose problems with a linear fit by least squares regression. (8.3)
Understand the methods of inference of least squares regression. (8.4)
After this lecture, you should
Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
Know how to diagnose problems with a linear fit by least squares regression. (8.3)
Understand the methods of inference of least squares regression. (8.4)
Know how to obtain a linear fit using R with the lm
function.
After this lecture, you should
Understand the basic principles of simple linear regression: parameter estimates, residuals, and correlation. (8.1, 8.2)
Know the conditions for least squares regression: linearity, normality, constant variance, and independence. (8.2)
Know how to diagnose problems with a linear fit by least squares regression. (8.3)
Understand the methods of inference of least squares regression. (8.4)
Know how to obtain a linear fit using R with the lm
function.
Be able to assess and interpret the results of a linear fit using R with the lm
function.
y=β0+β1x+ϵ where
y=β0+β1x+ϵ where
y=β0+β1x+ϵ where
β0 (intercept) and β1 (slope) are the model parameters
ϵ is the error
y=β0+β1x+ϵ where
β0 (intercept) and β1 (slope) are the model parameters
ϵ is the error
y=β0+β1x+ϵ where
β0 (intercept) and β1 (slope) are the model parameters
ϵ is the error
The parameters are estimated using data and their point estimates are denoted by b0 (intercept estimate) and b1 (slope estimate).
In linear regression, x is called the explanatory or predictor variable while y is called the response variable.
y=β0+β1x+ϵ where
β0 (intercept) and β1 (slope) are the model parameters
ϵ is the error
The parameters are estimated using data and their point estimates are denoted by b0 (intercept estimate) and b1 (slope estimate).
In linear regression, x is called the explanatory or predictor variable while y is called the response variable.
Let's look at an example data set for which linear regression is a potentially useful model.
possum
data set records measurements of 104 brushtail possums from Australia and New Guinea, the first few rows of the data are shown belowhead(possum,4)
## # A tibble: 4 x 8## site pop sex age head_l skull_w total_l tail_l## <int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>## 1 1 Vic m 8 94.1 60.4 89 36 ## 2 1 Vic f 6 92.5 57.6 91.5 36.5## 3 1 Vic f 6 94 60 95.5 39 ## 4 1 Vic f 6 93.2 57.1 92 38
head_l
) and total length (total_l
) measurements of the brushtail possum of Australia. Suppose that we as researchers are interested to study the relationship between the head length (head_l
) and total length (total_l
) measurements of the brushtail possum of Australia.
Note that head length (head_l
) and total length (total_l
) are both (continuous) numerical variables.
Suppose that we as researchers are interested to study the relationship between the head length (head_l
) and total length (total_l
) measurements of the brushtail possum of Australia.
Note that head length (head_l
) and total length (total_l
) are both (continuous) numerical variables.
The next slide displays a scatterplot for head_l
versus total_l
.
head_l
versus total_l
. Later we discuss how this line is obtained. head_l
versus total_l
. Later we discuss how this line is obtained. Correlation, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by R.
In R, correlation is computed with the cor
command. For example, the correlation between the head_l
and total_l
variables in the possum
data set is computed as
cor(possum$head_l,possum$total_l)
## [1] 0.6910937
Correlation, which always takes values between -1 and 1 is a statistic that describes the strength of the linear relationship between two variables. Correlation is denoted by R.
In R, correlation is computed with the cor
command. For example, the correlation between the head_l
and total_l
variables in the possum
data set is computed as
cor(possum$head_l,possum$total_l)
## [1] 0.6910937
We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.
We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.
The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point (xi,yi), we find b0 and b1 such that if ^yi=b0+b1xi, then
RSS=n∑i=1(^yi−yi)2
is as small as possible.
We now begin to discuss the details of how to fit a simple linear regression model to data.
The approach we take is called least squares regression.
The idea is to chose parameter estimates that minimize all of the residuals simultaneously. That is, for each observed data point (xi,yi), we find b0 and b1 such that if ^yi=b0+b1xi, then
RSS=n∑i=1(^yi−yi)2
is as small as possible.
Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.
Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.
Independence. We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points.
Linearity. The data should show a linear trend.
Normality. Generally, the distribution of the residuals should be close to normal.
Constant Variance. The variability of points around the least squares line remains roughly constant. Residual plots are a good way to check this condition.
Independence. We want to avoid fitting a line to data via least squares whenever there is dependence between consecutive data points.
The next slide shows plot of data where at least one of the conditions for least squares regression fails to hold.
In the first column, the linearity condition fails. In the second column, the normality condition fails.
In the third column, the constant variance condition fails. In the fourth column, the independence condition fails.
In the first column, the linearity condition fails. In the second column, the normality condition fails.
In the third column, the constant variance condition fails. In the fourth column, the independence condition fails.
Notice how in each case, the residual plot can be used to diagnose problems with a least squares linear regression fit.
b1=sysxR,
b0=¯y−b1¯x,
where
b1=sysxR,
b0=¯y−b1¯x,
where
R is the correlation between x and y,
sy and sx are the sample standard deviations for y and x, and
b1=sysxR,
b0=¯y−b1¯x,
where
R is the correlation between x and y,
sy and sx are the sample standard deviations for y and x, and
¯y and ¯x are the sample means for y and x.
b1=sysxR,
b0=¯y−b1¯x,
where
R is the correlation between x and y,
sy and sx are the sample standard deviations for y and x, and
¯y and ¯x are the sample means for y and x.
possum
data set with x the total_l
variable and y the head_l
variable. x <- possum$total_l; y <- possum$head_lx_bar <- mean(x); y_bar <- mean(y)s_x <- sd(x); s_y <- sd(y)R <- cor(x,y)
x <- possum$total_l; y <- possum$head_lx_bar <- mean(x); y_bar <- mean(y)s_x <- sd(x); s_y <- sd(y)R <- cor(x,y)
(b_1 <- (s_y/s_x)*R)
## [1] 0.5729013
(b_0 <- y_bar - b_1*x_bar)
## [1] 42.70979
x <- possum$total_l; y <- possum$head_lx_bar <- mean(x); y_bar <- mean(y)s_x <- sd(x); s_y <- sd(y)R <- cor(x,y)
(b_1 <- (s_y/s_x)*R)
## [1] 0.5729013
(b_0 <- y_bar - b_1*x_bar)
## [1] 42.70979
lm
(linear model) that will compute these values and much more for us. lm
command is used:lm(head_l ~ total_l, data=possum)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Coefficients:## (Intercept) total_l ## 42.7098 0.5729
lm
command is used:lm(head_l ~ total_l, data=possum)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Coefficients:## (Intercept) total_l ## 42.7098 0.5729
For a linear model,
For a linear model,
The slope describes the estimate difference in the y variable if the explanatory variable x for a case happened to be one unit larger.
The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.
For a linear model,
The slope describes the estimate difference in the y variable if the explanatory variable x for a case happened to be one unit larger.
The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.
To evaluate the strength of a linear fit, we compute R2 (R-squared). The value of R2 tells us the percent of variation in the response that is explained by the explanatory variable.
For a linear model,
The slope describes the estimate difference in the y variable if the explanatory variable x for a case happened to be one unit larger.
The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.
To evaluate the strength of a linear fit, we compute R2 (R-squared). The value of R2 tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,
For a linear model,
The slope describes the estimate difference in the y variable if the explanatory variable x for a case happened to be one unit larger.
The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.
To evaluate the strength of a linear fit, we compute R2 (R-squared). The value of R2 tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,
For a linear model,
The slope describes the estimate difference in the y variable if the explanatory variable x for a case happened to be one unit larger.
The intercept describes the average outcome of y if x=0 and the linear model is valid all the way to x=0, which in many applications is not the case.
To evaluate the strength of a linear fit, we compute R2 (R-squared). The value of R2 tells us the percent of variation in the response that is explained by the explanatory variable.
There are some pitfalls in interpreting the results of a linear model. In particular,
Applying a model to estimate values outside of the realm of the original data is called extrapolation. Generally, extrapolation is unreliable.
In many cases, even when there is a real association between variables, we cannot interpret a causal connection between the variables.
Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.
We will look at a linear model for the data set, cheddar
from the faraway
package.
Before we discuss further details of regression, let's look at a detailed example of fitting and interpreting a linear model.
We will look at a linear model for the data set, cheddar
from the faraway
package.
Let's do this example together in RStudio.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
A data point is called an influential point if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.
Outliers can have a strong influence on the least squares line.
Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
A data point is called an influential point if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.
The next slide shows data with outliers together with the corresponding regression line and residual plot.
y=β0+β1x+ϵ
y=β0+β1x+ϵ
y=β0+β1x+ϵ
Least squares is a method for obtaining point estimates b0 and b1 for the parameters β0 and β1. Thus, β0 and β1 are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation σ for the error ϵ. This is an additional model parameter.
y=β0+β1x+ϵ
Least squares is a method for obtaining point estimates b0 and b1 for the parameters β0 and β1. Thus, β0 and β1 are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation σ for the error ϵ. This is an additional model parameter.
We would like answers to the following questions:
y=β0+β1x+ϵ
Least squares is a method for obtaining point estimates b0 and b1 for the parameters β0 and β1. Thus, β0 and β1 are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation σ for the error ϵ. This is an additional model parameter.
We would like answers to the following questions:
y=β0+β1x+ϵ
Least squares is a method for obtaining point estimates b0 and b1 for the parameters β0 and β1. Thus, β0 and β1 are unknowns that correspond to population values that we want to infer information about.
A somewhat subtle point is that we also do not know the population standard deviation σ for the error ϵ. This is an additional model parameter.
We would like answers to the following questions:
How do we obtain confidence intervals for β0 and β1, specifically how to we get the standard error?
How do we conduct hypothesis tests related to the parameters β0 and β1?
bi±t∗df×SEbi
bi±t∗df×SEbi
summary
command for a linear model fit with lm
. bi±t∗df×SEbi
All the numerical information you need to obtain confidence intervals for β0 and β1 is provided in the output of the summary
command for a linear model fit with lm
.
Suppose we fit a linear model for the possum
data again:
(lm_fit <- lm(head_l ~ total_l, data=possum))
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Coefficients:## (Intercept) total_l ## 42.7098 0.5729
bi±t∗df×SEbi
All the numerical information you need to obtain confidence intervals for β0 and β1 is provided in the output of the summary
command for a linear model fit with lm
.
Suppose we fit a linear model for the possum
data again:
(lm_fit <- lm(head_l ~ total_l, data=possum))
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Coefficients:## (Intercept) total_l ## 42.7098 0.5729
summary(lm_fit)
summary(lm_fit)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Residuals:## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 ***## total_l 0.57290 0.05933 9.657 4.68e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.595 on 102 degrees of freedom## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16
summary(lm_fit)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Residuals:## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 ***## total_l 0.57290 0.05933 9.657 4.68e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.595 on 102 degrees of freedom## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16
Estimate
, Std. Error
, and the reported degrees of freedom. summary
command output, we can construct confidence intervals for β0 and β1. First we observe that the degrees of freedom is 102, then the t∗102 value for a 95% CI is(t_ast <- -qt((1.0-0.95)/2,df=102))
## [1] 1.983495
summary
command output, we can construct confidence intervals for β0 and β1. First we observe that the degrees of freedom is 102, then the t∗102 value for a 95% CI is(t_ast <- -qt((1.0-0.95)/2,df=102))
## [1] 1.983495
(beta_0_CI <- 42.71 + 1.98*c(-1,1)*5.17)
## [1] 32.4734 52.9466
(beta_1_CI <- 0.57 + 1.98*c(-1,1)*0.06)
## [1] 0.4512 0.6888
There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
H0:β1=0. The true linear model has slope zero. Versus
HA:β1≠0. The true linear model has a slope different than zero.
There are actually several types of hypothesis tests one can conduct relating to linear regression models.
The most common test is of the form
H0:β1=0. The true linear model has slope zero. Versus
HA:β1≠0. The true linear model has a slope different than zero.
The summary
command output includes a p-value for testing such a hypothesis. However, be aware that the lm
command does not check whether the conditions for a linear model are met and the results for inference on model parameters is only valid if the conditions for a linear model are met.
summary(lm_fit)
issummary(lm_fit)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Residuals:## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 ***## total_l 0.57290 0.05933 9.657 4.68e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.595 on 102 degrees of freedom## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16
summary(lm_fit)
issummary(lm_fit)
## ## Call:## lm(formula = head_l ~ total_l, data = possum)## ## Residuals:## Min 1Q Median 3Q Max ## -7.1877 -1.5340 -0.3345 1.2788 7.3968 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 42.70979 5.17281 8.257 5.66e-13 ***## total_l 0.57290 0.05933 9.657 4.68e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.595 on 102 degrees of freedom## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725 ## F-statistic: 93.26 on 1 and 102 DF, p-value: 4.681e-16
lm
command. The necessary input is a formula of the form y ~ x
and the data. The summary
command outputs all of the information relevant for inferential purposes.To fit a linear model in R we use the lm
command. The necessary input is a formula of the form y ~ x
and the data. The summary
command outputs all of the information relevant for inferential purposes.
However, the output from summary
is not necessarily formatted in the most convenient way.
To fit a linear model in R we use the lm
command. The necessary input is a formula of the form y ~ x
and the data. The summary
command outputs all of the information relevant for inferential purposes.
However, the output from summary
is not necessarily formatted in the most convenient way.
Another approach to working with regression model output is provided by functions in the broom
package. For example, the tidy
function from broom
displays the results of the model fit:
tidy(lm_fit)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 42.7 5.17 8.26 5.66e-13## 2 total_l 0.573 0.0593 9.66 4.68e-16
To fit a linear model in R we use the lm
command. The necessary input is a formula of the form y ~ x
and the data. The summary
command outputs all of the information relevant for inferential purposes.
However, the output from summary
is not necessarily formatted in the most convenient way.
Another approach to working with regression model output is provided by functions in the broom
package. For example, the tidy
function from broom
displays the results of the model fit:
tidy(lm_fit)
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 42.7 5.17 8.26 5.66e-13## 2 total_l 0.573 0.0593 9.66 4.68e-16
broom
functions. In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
H0:μ1=μ2=⋯=μk. The mean outcome is the same across all groups. Versus
HA: At least one mean is different.
In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
H0:μ1=μ2=⋯=μk. The mean outcome is the same across all groups. Versus
HA: At least one mean is different.
We must check three conditions for ANOVA:
In previous lectures, we studied inference for a difference of means. There are also statistical methods for comparing more than two means. The primary method is called analysis of variance (ANOVA), see 7.5.
ANOVA uses a single hypothesis to check whether the means across many groups are equal:
H0:μ1=μ2=⋯=μk. The mean outcome is the same across all groups. Versus
HA: At least one mean is different.
We must check three conditions for ANOVA:
(1) Observations are independent across groups. (2) The data within each group are nearly normal. (3) The variability across each group is about equal.
head(chickwts)
## weight feed## 1 179 horsebean## 2 160 horsebean## 3 136 horsebean## 4 227 horsebean## 5 217 horsebean## 6 168 horsebean
head(chickwts)
## weight feed## 1 179 horsebean## 2 160 horsebean## 3 136 horsebean## 4 227 horsebean## 5 217 horsebean## 6 168 horsebean
chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot()
chickwts %>% ggplot(aes(x=feed,y=weight)) + geom_boxplot()
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by F, which represents a standardized ratio of variability in the sample means relative to the variability within groups.
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by F, which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an F distribution to compute a p-value that corresponds to the probability of observing an F statistic value that is as or more extreme than the sample F statistic value under the assumption that the null hypothesis is true.
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by F, which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an F distribution to compute a p-value that corresponds to the probability of observing an F statistic value that is as or more extreme than the sample F statistic value under the assumption that the null hypothesis is true.
We will see how to conduct ANOVA and an F-test using R.
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across 2 or more groups.
ANOVA uses a test statistic denoted by F, which represents a standardized ratio of variability in the sample means relative to the variability within groups.
ANOVA uses an F distribution to compute a p-value that corresponds to the probability of observing an F statistic value that is as or more extreme than the sample F statistic value under the assumption that the null hypothesis is true.
We will see how to conduct ANOVA and an F-test using R.
Before conducting ANOVA, we should discuss the necessary conditions for an ANOVA analysis.
There are three conditions we must check for an ANOVA:
There are three conditions we must check for an ANOVA:
Independence. If the data are a simple random sample, this condition is satisfied.
Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA.
There are three conditions we must check for an ANOVA:
Independence. If the data are a simple random sample, this condition is satisfied.
Normality. As with one- and two-sample testing for means, the normality assumption is especially important when the sample size is small. Grouped histograms are a good way to diagnose potential problems with the normality assumption for ANOVA.
Constant variance. The variance in the groups should be close to equal. This assumption can be checked with side-by-side box plots.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |