Lecture 1
Logistics and Introduction to Data
JMG
MATH 204
1 / 29

Welcome!

Welcome to MATH 204 Introduction to Statistics!

2 / 29

Welcome!

Welcome to MATH 204 Introduction to Statistics!

The course description, learning outcomes, grade scheme, etc. may be found in the course syllabus posted on the course learning management system.
Please make sure you have read the syllabus carefully before the next class meeting.
If you have any questions regarding the syllabus feel free to ask the instructor in-person (LSC 319A) or via email (jason.graham@scranton.edu).
Our first quiz will contain questions about the syllabus.
The number one rule for this course is, ask a lot of questions.
The number two rule for this course is, bring your computer to class each day.

2 / 29

3 / 29

The required textbook for this course is the 4th edition of OpenIntro Statistics which is available for free and can be downloaded as a pdf file if you wish. You may also, for a modest price, purchase a print copy of the book.

3 / 29

The required textbook for this course is the 4th edition of OpenIntro Statistics which is available for free and can be downloaded as a pdf file if you wish. You may also, for a modest price, purchase a print copy of the book.
There are a number of very useful resources associated with this book such as lecture videos, lecture slides, data sets, etc. We will make extensive use of many of these additional resources.

3 / 29

R and R Studio

In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.
R, R Studio, and RStudio Cloud are all free.

4 / 29

R and R Studio

In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.
R, R Studio, and RStudio Cloud are all free.
You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.

4 / 29

R and R Studio

In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.
R, R Studio, and RStudio Cloud are all free.
You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.
What can you do with R?

4 / 29

R and R Studio

In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.
R, R Studio, and RStudio Cloud are all free.
You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.
What can you do with R?
Great question!

4 / 29

What can you do with R?

R can be used as a calculator:

2 + 2

## [1] 4

5 / 29

What can you do with R?

R can be used as a calculator:

2 + 2

## [1] 4

R can compute summary statistics of data:

mean(c(5,7,2,3,2,5,4,7))

## [1] 4.375

5 / 29

What can you do with R?

R can be used as a calculator:

2 + 2

## [1] 4

R can compute summary statistics of data:

mean(c(5,7,2,3,2,5,4,7))

## [1] 4.375

R can simulate random sampling

rnorm(10)

##  [1]  0.6341227  0.1434006 -0.9974754  2.4204931 -2.0205946 -3.2013405
##  [7] -0.4428438  1.8298432 -1.7529270  1.3838205

5 / 29

What can you do with R?

R can be used as a calculator:

2 + 2

## [1] 4

R can compute summary statistics of data:

mean(c(5,7,2,3,2,5,4,7))

## [1] 4.375

R can simulate random sampling

rnorm(10)

##  [1]  0.6341227  0.1434006 -0.9974754  2.4204931 -2.0205946 -3.2013405
##  [7] -0.4428438  1.8298432 -1.7529270  1.3838205

R can do much, much more ...

5 / 29

Learning R

R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.
Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.

6 / 29

Learning R

R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.
Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.
If you want to start learning some R, or if you find that you want to learn more R than what we cover in this course, a great resource is the swirl package which allows you to learn R interactively within R. Now that's meta 😄.

6 / 29

Learning R

R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.
Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.
If you want to start learning some R, or if you find that you want to learn more R than what we cover in this course, a great resource is the swirl package which allows you to learn R interactively within R. Now that's meta 😄.
I'm also happy to provide you with other resources upon request.

6 / 29

Ziggy

Our dog Ziggy says, "hello!"

7 / 29

What is MATH 204 About?Obviously MATH 204 is about statistics, but what does this mean? 
8 / 29

What is MATH 204 About?

Obviously MATH 204 is about statistics, but what does this mean?
Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.

8 / 29

What is MATH 204 About?

Obviously MATH 204 is about statistics, but what does this mean?
Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.
Section 1.1 of the textbook presents a case study to motivate the study of statistics. There is also a corresponding lecture video which you are asked to watch on your own time. For your convenience, the video is included in the next slide.

8 / 29

What is MATH 204 About?

Obviously MATH 204 is about statistics, but what does this mean?
Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.
Section 1.1 of the textbook presents a case study to motivate the study of statistics. There is also a corresponding lecture video which you are asked to watch on your own time. For your convenience, the video is included in the next slide.
Some things to think about when watching the video are:
- What is the research question?
- What does the video say about random fluctuation?

8 / 29

Case Study: using stents to prevent strokes

After watching the case study video, explain the difference between a treatment group and a control group. 
9 / 29

On Data and it's Structure

The term "data" can be interpreted very broadly. However, the data that one typically analyzes using statistics or statistical methods have some common features that we will take a moment to point out:

10 / 29

On Data and it's Structure

The data is "structured" in the sense that it has an underlying order to it. This will be explained in greater detail soon.

10 / 29

On Data and it's Structure

The data is "structured" in the sense that it has an underlying order to it. This will be explained in greater detail soon.
The data is typically a "sample" in that it is but a minor representation of all of the data one could possibly collect.

10 / 29

An Example

Consider the following question:

11 / 29

An Example

Consider the following question:

Can you think of a way to answer this question?

11 / 29

An Example

Consider the following question:

Can you think of a way to answer this question?
One way, at least in principle, would be to ask every single student at the University of Scranton to tell us how much they sleep during the first week of classes.

11 / 29

An Example

Consider the following question:

Can you think of a way to answer this question?
One way, at least in principle, would be to ask every single student at the University of Scranton to tell us how much they sleep during the first week of classes.
Do you think this is a good way to answer our question or not?

11 / 29

On Populations

Observe that our question¹ is about a population², in this case, all University of Scranton students. If we (hypothetically) record how much every single UofS student sleeps during the first week of classes, it might look something like this:

##    r_number  Sun Mon Tues Wed Thurs Fri Sat   living  year college
## 1 R01920433  7.4 6.0  4.8 7.7   5.3 9.6 9.0   Martin First    PCPS
## 2 R01780024 10.0 7.2  3.9 7.3   4.9 9.3 9.9   Lynett First    KSOM
## 3 R01816495  8.6 6.7  7.2 5.6   5.3 8.5 8.7   Lynett First     CAS
## 4 R01647948  6.8 6.5  8.3 5.6   4.8 9.4 9.0 Driscoll First    PCPS
## 5 R01782171  8.2 6.6  6.1 5.6   4.8 8.3 9.4    Fitch First     CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

12 / 29

On Populations

##    r_number  Sun Mon Tues Wed Thurs Fri Sat   living  year college
## 1 R01920433  7.4 6.0  4.8 7.7   5.3 9.6 9.0   Martin First    PCPS
## 2 R01780024 10.0 7.2  3.9 7.3   4.9 9.3 9.9   Lynett First    KSOM
## 3 R01816495  8.6 6.7  7.2 5.6   5.3 8.5 8.7   Lynett First     CAS
## 4 R01647948  6.8 6.5  8.3 5.6   4.8 9.4 9.0 Driscoll First    PCPS
## 5 R01782171  8.2 6.6  6.1 5.6   4.8 8.3 9.4    Fitch First     CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

A population is often large and it is unfeasible to observe every individual in a population.

12 / 29

On Populations

##    r_number  Sun Mon Tues Wed Thurs Fri Sat   living  year college
## 1 R01920433  7.4 6.0  4.8 7.7   5.3 9.6 9.0   Martin First    PCPS
## 2 R01780024 10.0 7.2  3.9 7.3   4.9 9.3 9.9   Lynett First    KSOM
## 3 R01816495  8.6 6.7  7.2 5.6   5.3 8.5 8.7   Lynett First     CAS
## 4 R01647948  6.8 6.5  8.3 5.6   4.8 9.4 9.0 Driscoll First    PCPS
## 5 R01782171  8.2 6.6  6.1 5.6   4.8 8.3 9.4    Fitch First     CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

A population is often large and it is unfeasible to observe every individual in a population.
In practice, we take a sample and try to use the sample to infer something about the population.

[1] Recall that our question is "How much sleep do University of Scranton students get during the first week of classes?"

[2] Read section 1.3.1 for a discussion of populations and samples.

12 / 29

On Samples

Here is a sample of the population of size $n = 25$ :

##     r_number  Sun Mon Tues Wed Thurs  Fri Sat        living   year college
## 1  R01718887  7.9 6.3  5.3 5.0   4.6  9.7 9.2 Junior/Senior Fourth     CAS
## 2  R01943517  7.4 5.8  5.9 6.2   4.9  7.5 8.9       McCourt  First     CAS
## 3  R01866257  9.0 6.5  6.5 6.4   4.0  8.5 9.7 Junior/Senior Fourth    KSOM
## 4  R01942243  5.8 5.5  5.4 6.0   4.5  9.1 8.6    Off Campus  Third     CAS
## 5  R01934011  6.4 5.1  5.5 7.3   5.2  9.6 8.9    Off Campus  Third    PCPS
## 6  R01691560  9.4 7.2  6.0 7.0   5.2  9.5 9.2       Gavigan Second    PCPS
## 7  R01718539  7.7 5.2  5.8 5.7   5.4 10.2 8.5 Junior/Senior Fourth    KSOM
## 8  R01615516  6.4 5.5  5.2 7.4   4.8 11.7 9.1      Driscoll  First     CAS
## 9  R01769847  6.6 5.7  7.6 5.1   5.1  8.4 9.3    Off Campus Fourth     CAS
## 10 R01850111  9.3 6.4  7.0 6.6   5.2  7.5 9.5    Off Campus  Third    KSOM
## 11 R01574251  9.5 5.3  6.1 2.8   5.0  9.3 9.3            DE  First     CAS
## 12 R01599827  8.6 7.2  6.0 7.4   3.2  9.5 8.8       Gavigan Second    KSOM
## 13 R01590698  9.0 6.3  6.0 6.7   5.6 12.3 9.5         Casey  First     CAS
## 14 R01950549  6.6 6.3  7.3 6.1   4.6  9.7 8.9    Off Campus Second     CAS
## 15 R01804370 10.4 7.2  4.8 4.2   4.8 10.4 9.1 Junior/Senior Fourth    KSOM
## 16 R01839042  6.9 6.2  5.7 6.6   5.3  9.5 9.2       Gavigan Second     CAS
## 17 R01937793  7.8 4.4  8.0 4.3   5.0  8.1 8.8    Off Campus  Third    KSOM
## 18 R01716376  8.9 6.2  5.2 5.1   5.3  7.0 8.1    Off Campus  Third     CAS
## 19 R01937156  8.2 7.1  4.6 5.5   4.6  8.1 9.0       Gavigan Second    KSOM
## 20 R01589019  6.7 7.2  5.5 6.3   4.6  7.5 8.9 Junior/Senior  Third     CAS
## 21 R01632235  7.9 5.5  6.3 5.1   4.7  7.9 8.3  Giblin-Kelly  First     CAS
## 22 R01624959  8.7 6.9  6.3 6.6   5.5  8.6 8.5       McCourt  First     CAS
## 23 R01913524  6.0 4.4  6.6 7.0   5.7 10.1 9.6 Junior/Senior Fourth     CAS
## 24 R01578674  5.1 6.4  5.6 7.8   4.5  7.2 9.2 Junior/Senior  Third     CAS
## 25 R01752767  7.4 5.4  5.1 5.2   5.3  9.5 8.4 Junior/Senior Fourth     CAS

13 / 29

SamplingHow did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly. 
14 / 29

Sampling

How did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly.
Section 1.3 covers sampling principles in detail.

14 / 29

Sampling

How did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly.
Section 1.3 covers sampling principles in detail.
Before we get into sampling principles, let's take a moment to reflect on how our (sample) data is represented.

14 / 29

Data Organization

Let's look at the first five rows of our sample data:

##    r_number Sun Mon Tues Wed Thurs Fri Sat        living   year college
## 1 R01718887 7.9 6.3  5.3 5.0   4.6 9.7 9.2 Junior/Senior Fourth     CAS
## 2 R01943517 7.4 5.8  5.9 6.2   4.9 7.5 8.9       McCourt  First     CAS
## 3 R01866257 9.0 6.5  6.5 6.4   4.0 8.5 9.7 Junior/Senior Fourth    KSOM
## 4 R01942243 5.8 5.5  5.4 6.0   4.5 9.1 8.6    Off Campus  Third     CAS
## 5 R01934011 6.4 5.1  5.5 7.3   5.2 9.6 8.9    Off Campus  Third    PCPS

Our data is organized into rows and columns, a so-called data matrix or data frame. Each row corresponds to a single observation which in this example is a single student.
The columns of our data correspond to variables, that is, the information or characteristics we observe and record about our observations.

15 / 29

Data Organization

Let's look at the first five rows of our sample data:

##    r_number Sun Mon Tues Wed Thurs Fri Sat        living   year college
## 1 R01718887 7.9 6.3  5.3 5.0   4.6 9.7 9.2 Junior/Senior Fourth     CAS
## 2 R01943517 7.4 5.8  5.9 6.2   4.9 7.5 8.9       McCourt  First     CAS
## 3 R01866257 9.0 6.5  6.5 6.4   4.0 8.5 9.7 Junior/Senior Fourth    KSOM
## 4 R01942243 5.8 5.5  5.4 6.0   4.5 9.1 8.6    Off Campus  Third     CAS
## 5 R01934011 6.4 5.1  5.5 7.3   5.2 9.6 8.9    Off Campus  Third    PCPS

Our data is organized into rows and columns, a so-called data matrix or data frame. Each row corresponds to a single observation which in this example is a single student.
The columns of our data correspond to variables, that is, the information or characteristics we observe and record about our observations.
Variables can usually be classified according to a type system.

15 / 29

Variables and Their Types

r_number - not a variable
Sun - Sat - continuous numerical
living & college - nominal categorical
year - ordinal categorical

We repeat the first five rows of our sample data:

##    r_number Sun Mon Tues Wed Thurs Fri Sat        living   year college
## 1 R01718887 7.9 6.3  5.3 5.0   4.6 9.7 9.2 Junior/Senior Fourth     CAS
## 2 R01943517 7.4 5.8  5.9 6.2   4.9 7.5 8.9       McCourt  First     CAS
## 3 R01866257 9.0 6.5  6.5 6.4   4.0 8.5 9.7 Junior/Senior Fourth    KSOM
## 4 R01942243 5.8 5.5  5.4 6.0   4.5 9.1 8.6    Off Campus  Third     CAS
## 5 R01934011 6.4 5.1  5.5 7.3   5.2 9.6 8.9    Off Campus  Third    PCPS

16 / 29

Variables and Their Types

r_number - not a variable
Sun - Sat - continuous numerical
living & college - nominal categorical
year - ordinal categorical

We repeat the first five rows of our sample data:

##    r_number Sun Mon Tues Wed Thurs Fri Sat        living   year college
## 1 R01718887 7.9 6.3  5.3 5.0   4.6 9.7 9.2 Junior/Senior Fourth     CAS
## 2 R01943517 7.4 5.8  5.9 6.2   4.9 7.5 8.9       McCourt  First     CAS
## 3 R01866257 9.0 6.5  6.5 6.4   4.0 8.5 9.7 Junior/Senior Fourth    KSOM
## 4 R01942243 5.8 5.5  5.4 6.0   4.5 9.1 8.6    Off Campus  Third     CAS
## 5 R01934011 6.4 5.1  5.5 7.3   5.2 9.6 8.9    Off Campus  Third    PCPS

r_number is not a variable because it is a unique identifier for each observation.

16 / 29

Variables and Their Types

r_number - not a variable
Sun - Sat - continuous numerical
living & college - nominal categorical
year - ordinal categorical

We repeat the first five rows of our sample data:

##    r_number Sun Mon Tues Wed Thurs Fri Sat        living   year college
## 1 R01718887 7.9 6.3  5.3 5.0   4.6 9.7 9.2 Junior/Senior Fourth     CAS
## 2 R01943517 7.4 5.8  5.9 6.2   4.9 7.5 8.9       McCourt  First     CAS
## 3 R01866257 9.0 6.5  6.5 6.4   4.0 8.5 9.7 Junior/Senior Fourth    KSOM
## 4 R01942243 5.8 5.5  5.4 6.0   4.5 9.1 8.6    Off Campus  Third     CAS
## 5 R01934011 6.4 5.1  5.5 7.3   5.2 9.6 8.9    Off Campus  Third    PCPS

r_number is not a variable because it is a unique identifier for each observation.
Note that just because the value of a variable is a number does not necessarily make it numerical. For example, we could have recorded year as 1, 2, 3, or 4 instead of "First", "Second", "Third", or "Fourth". A rule is, if it doesn't make sense to compute the average of a variable, then it's not numeric.

16 / 29

Categorical Variable Types

17 / 29

Data Basics Lecture Video

You should watch the following video on your own time.

Question: Which variable in the county data set discussed in the video is discrete and why?

18 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame

## # A tibble: 6 × 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38  
## 5     1 Vic   f         2   91.5    56.3    85.5   36  
## 6     1 Vic   f         1   93.1    54.8    90.5   35.5

19 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame

## # A tibble: 6 × 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38  
## 5     1 Vic   f         2   91.5    56.3    85.5   36  
## 6     1 Vic   f         1   93.1    54.8    90.5   35.5

State the type of each variable in the data matrix.

19 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame

## # A tibble: 6 × 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38  
## 5     1 Vic   f         2   91.5    56.3    85.5   36  
## 6     1 Vic   f         1   93.1    54.8    90.5   35.5

State the type of each variable in the data matrix.
- site, pop, and sex are nominal categorical
- age is discrete numerical
- columns head_l to tail_l are continuous numerical

19 / 29

Sampling Principles

The first step in conducting research is to identify topics or questions that are to be investigated. For example,

We need to consider how data are collected and how samples are obtained.

20 / 29

Sampling Principles

The first step in conducting research is to identify topics or questions that are to be investigated. For example,

We need to consider how data are collected and how samples are obtained.
Importantly, we need to avoid as much as possible picking a biased sample.

20 / 29

Data Collection

Watch this video on your own time in order to reinforce the concepts of:
- populations, samples, and bias

21 / 29

Biased Samples

Consider the question:

22 / 29

Biased Samples

Consider the question:

One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.

22 / 29

Biased Samples

Consider the question:

One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.
For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.

22 / 29

Biased Samples

Consider the question:

One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.
For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.
However, restricting our sample to students in a single class might not produce a sample that is sufficiently representative of the entire population. In other words, we may introduce bias by taking as our sample only students in this class.

22 / 29

Biased Samples

Consider the question:

One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.
For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.
However, restricting our sample to students in a single class might not produce a sample that is sufficiently representative of the entire population. In other words, we may introduce bias by taking as our sample only students in this class.
It is preferable (essential) to obtain a sample by choosing individuals at random from the population. Random samples reduce bias!

22 / 29

Sampling Strategies

Let's watch this video together and discuss.

23 / 29

Sampling Strategies

Let's watch this video together and discuss.

Question: What is the difference(s) between experimental and observational studies?

23 / 29

Common Sampling Strategies