+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 1

Logistics and Introduction to Data

JMG

MATH 204

1 / 29

Welcome!

Welcome to MATH 204 Introduction to Statistics!

2 / 29

Welcome!

Welcome to MATH 204 Introduction to Statistics!

  • The course description, learning outcomes, grade scheme, etc. may be found in the course syllabus posted on the course learning management system.

  • Please make sure you have read the syllabus carefully before the next class meeting.

  • If you have any questions regarding the syllabus feel free to ask the instructor in-person (LSC 319A) or via email (jason.graham@scranton.edu).

  • Our first quiz will contain questions about the syllabus.

  • The number one rule for this course is, ask a lot of questions.

  • The number two rule for this course is, bring your computer to class each day.

2 / 29
3 / 29
  • The required textbook for this course is the 4th edition of OpenIntro Statistics which is available for free and can be downloaded as a pdf file if you wish. You may also, for a modest price, purchase a print copy of the book.
3 / 29
  • The required textbook for this course is the 4th edition of OpenIntro Statistics which is available for free and can be downloaded as a pdf file if you wish. You may also, for a modest price, purchase a print copy of the book.

  • There are a number of very useful resources associated with this book such as lecture videos, lecture slides, data sets, etc. We will make extensive use of many of these additional resources.

3 / 29

R and R Studio

4 / 29

R and R Studio

  • In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.

  • R, R Studio, and RStudio Cloud are all free.

  • You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.

4 / 29

R and R Studio

  • In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.

  • R, R Studio, and RStudio Cloud are all free.

  • You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.

  • What can you do with R?

4 / 29

R and R Studio

  • In this course, we will exploit the power of the R statistical computing environment and the interface to R provided by RStudio. These can both be accessed via a web browser by using RStudio Cloud.

  • R, R Studio, and RStudio Cloud are all free.

  • You must sign up for a free RStudio Cloud account (you can use an existing google account if you have one). My plan in the course this semester is to add everyone to an RStudio Cloud workspace where you will be able to access homework and lab assignments.

  • What can you do with R?

  • Great question!

4 / 29

What can you do with R?

  • R can be used as a calculator:
2 + 2
## [1] 4
5 / 29

What can you do with R?

  • R can be used as a calculator:
2 + 2
## [1] 4
  • R can compute summary statistics of data:
mean(c(5,7,2,3,2,5,4,7))
## [1] 4.375
5 / 29

What can you do with R?

  • R can be used as a calculator:
2 + 2
## [1] 4
  • R can compute summary statistics of data:
mean(c(5,7,2,3,2,5,4,7))
## [1] 4.375
  • R can simulate random sampling
rnorm(10)
## [1] 0.6341227 0.1434006 -0.9974754 2.4204931 -2.0205946 -3.2013405
## [7] -0.4428438 1.8298432 -1.7529270 1.3838205
5 / 29

What can you do with R?

  • R can be used as a calculator:
2 + 2
## [1] 4
  • R can compute summary statistics of data:
mean(c(5,7,2,3,2,5,4,7))
## [1] 4.375
  • R can simulate random sampling
rnorm(10)
## [1] 0.6341227 0.1434006 -0.9974754 2.4204931 -2.0205946 -3.2013405
## [7] -0.4428438 1.8298432 -1.7529270 1.3838205
  • R can do much, much more ...
5 / 29

Learning R

  • R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.

  • Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.

6 / 29

Learning R

  • R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.

  • Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.

  • If you want to start learning some R, or if you find that you want to learn more R than what we cover in this course, a great resource is the swirl package which allows you to learn R interactively within R. Now that's meta 😄.

6 / 29

Learning R

  • R is a programming language and it takes a little time to learn it, we will soon work through an introduction to R and RStudio.

  • Once we get over the starting hurdle for learning R, I think you will really enjoy it. Plus, R will make learning statistics a much more enjoyable and useful experience.

  • If you want to start learning some R, or if you find that you want to learn more R than what we cover in this course, a great resource is the swirl package which allows you to learn R interactively within R. Now that's meta 😄.

  • I'm also happy to provide you with other resources upon request.

6 / 29

Ziggy

Our dog Ziggy says, "hello!"

7 / 29

What is MATH 204 About?

  • Obviously MATH 204 is about statistics, but what does this mean?
8 / 29

What is MATH 204 About?

  • Obviously MATH 204 is about statistics, but what does this mean?

  • Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.

8 / 29

What is MATH 204 About?

  • Obviously MATH 204 is about statistics, but what does this mean?

  • Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.

  • Section 1.1 of the textbook presents a case study to motivate the study of statistics. There is also a corresponding lecture video which you are asked to watch on your own time. For your convenience, the video is included in the next slide.

8 / 29

What is MATH 204 About?

  • Obviously MATH 204 is about statistics, but what does this mean?

  • Statistics is fundamentally about data, how to collect data, how to analyze data, and how to use data to make inferences and draw conclusions about the real world.

  • Section 1.1 of the textbook presents a case study to motivate the study of statistics. There is also a corresponding lecture video which you are asked to watch on your own time. For your convenience, the video is included in the next slide.

  • Some things to think about when watching the video are:

    • What is the research question?
    • What does the video say about random fluctuation?
8 / 29

Case Study: using stents to prevent strokes

  • After watching the case study video, explain the difference between a treatment group and a control group.
9 / 29

On Data and it's Structure

The term "data" can be interpreted very broadly. However, the data that one typically analyzes using statistics or statistical methods have some common features that we will take a moment to point out:

10 / 29

On Data and it's Structure

The term "data" can be interpreted very broadly. However, the data that one typically analyzes using statistics or statistical methods have some common features that we will take a moment to point out:

  1. The data is "structured" in the sense that it has an underlying order to it. This will be explained in greater detail soon.
10 / 29

On Data and it's Structure

The term "data" can be interpreted very broadly. However, the data that one typically analyzes using statistics or statistical methods have some common features that we will take a moment to point out:

  1. The data is "structured" in the sense that it has an underlying order to it. This will be explained in greater detail soon.

  2. The data is typically a "sample" in that it is but a minor representation of all of the data one could possibly collect.

10 / 29

An Example

Consider the following question:

"How much sleep do University of Scranton students get during the first week of classes?"

11 / 29

An Example

Consider the following question:

"How much sleep do University of Scranton students get during the first week of classes?"

  • Can you think of a way to answer this question?
11 / 29

An Example

Consider the following question:

"How much sleep do University of Scranton students get during the first week of classes?"

  • Can you think of a way to answer this question?

  • One way, at least in principle, would be to ask every single student at the University of Scranton to tell us how much they sleep during the first week of classes.

11 / 29

An Example

Consider the following question:

"How much sleep do University of Scranton students get during the first week of classes?"

  • Can you think of a way to answer this question?

  • One way, at least in principle, would be to ask every single student at the University of Scranton to tell us how much they sleep during the first week of classes.

  • Do you think this is a good way to answer our question or not?

11 / 29

On Populations

Observe that our question1 is about a population2, in this case, all University of Scranton students. If we (hypothetically) record how much every single UofS student sleeps during the first week of classes, it might look something like this:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01920433 7.4 6.0 4.8 7.7 5.3 9.6 9.0 Martin First PCPS
## 2 R01780024 10.0 7.2 3.9 7.3 4.9 9.3 9.9 Lynett First KSOM
## 3 R01816495 8.6 6.7 7.2 5.6 5.3 8.5 8.7 Lynett First CAS
## 4 R01647948 6.8 6.5 8.3 5.6 4.8 9.4 9.0 Driscoll First PCPS
## 5 R01782171 8.2 6.6 6.1 5.6 4.8 8.3 9.4 Fitch First CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

12 / 29

On Populations

Observe that our question1 is about a population2, in this case, all University of Scranton students. If we (hypothetically) record how much every single UofS student sleeps during the first week of classes, it might look something like this:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01920433 7.4 6.0 4.8 7.7 5.3 9.6 9.0 Martin First PCPS
## 2 R01780024 10.0 7.2 3.9 7.3 4.9 9.3 9.9 Lynett First KSOM
## 3 R01816495 8.6 6.7 7.2 5.6 5.3 8.5 8.7 Lynett First CAS
## 4 R01647948 6.8 6.5 8.3 5.6 4.8 9.4 9.0 Driscoll First PCPS
## 5 R01782171 8.2 6.6 6.1 5.6 4.8 8.3 9.4 Fitch First CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

  • A population is often large and it is unfeasible to observe every individual in a population.
12 / 29

On Populations

Observe that our question1 is about a population2, in this case, all University of Scranton students. If we (hypothetically) record how much every single UofS student sleeps during the first week of classes, it might look something like this:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01920433 7.4 6.0 4.8 7.7 5.3 9.6 9.0 Martin First PCPS
## 2 R01780024 10.0 7.2 3.9 7.3 4.9 9.3 9.9 Lynett First KSOM
## 3 R01816495 8.6 6.7 7.2 5.6 5.3 8.5 8.7 Lynett First CAS
## 4 R01647948 6.8 6.5 8.3 5.6 4.8 9.4 9.0 Driscoll First PCPS
## 5 R01782171 8.2 6.6 6.1 5.6 4.8 8.3 9.4 Fitch First CAS

(Note: There would be around 4,000 rows, so only some are shown here.)

  • A population is often large and it is unfeasible to observe every individual in a population.

  • In practice, we take a sample and try to use the sample to infer something about the population.

[1] Recall that our question is "How much sleep do University of Scranton students get during the first week of classes?"

[2] Read section 1.3.1 for a discussion of populations and samples.

12 / 29

On Samples

Here is a sample of the population of size n=25:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
## 6 R01691560 9.4 7.2 6.0 7.0 5.2 9.5 9.2 Gavigan Second PCPS
## 7 R01718539 7.7 5.2 5.8 5.7 5.4 10.2 8.5 Junior/Senior Fourth KSOM
## 8 R01615516 6.4 5.5 5.2 7.4 4.8 11.7 9.1 Driscoll First CAS
## 9 R01769847 6.6 5.7 7.6 5.1 5.1 8.4 9.3 Off Campus Fourth CAS
## 10 R01850111 9.3 6.4 7.0 6.6 5.2 7.5 9.5 Off Campus Third KSOM
## 11 R01574251 9.5 5.3 6.1 2.8 5.0 9.3 9.3 DE First CAS
## 12 R01599827 8.6 7.2 6.0 7.4 3.2 9.5 8.8 Gavigan Second KSOM
## 13 R01590698 9.0 6.3 6.0 6.7 5.6 12.3 9.5 Casey First CAS
## 14 R01950549 6.6 6.3 7.3 6.1 4.6 9.7 8.9 Off Campus Second CAS
## 15 R01804370 10.4 7.2 4.8 4.2 4.8 10.4 9.1 Junior/Senior Fourth KSOM
## 16 R01839042 6.9 6.2 5.7 6.6 5.3 9.5 9.2 Gavigan Second CAS
## 17 R01937793 7.8 4.4 8.0 4.3 5.0 8.1 8.8 Off Campus Third KSOM
## 18 R01716376 8.9 6.2 5.2 5.1 5.3 7.0 8.1 Off Campus Third CAS
## 19 R01937156 8.2 7.1 4.6 5.5 4.6 8.1 9.0 Gavigan Second KSOM
## 20 R01589019 6.7 7.2 5.5 6.3 4.6 7.5 8.9 Junior/Senior Third CAS
## 21 R01632235 7.9 5.5 6.3 5.1 4.7 7.9 8.3 Giblin-Kelly First CAS
## 22 R01624959 8.7 6.9 6.3 6.6 5.5 8.6 8.5 McCourt First CAS
## 23 R01913524 6.0 4.4 6.6 7.0 5.7 10.1 9.6 Junior/Senior Fourth CAS
## 24 R01578674 5.1 6.4 5.6 7.8 4.5 7.2 9.2 Junior/Senior Third CAS
## 25 R01752767 7.4 5.4 5.1 5.2 5.3 9.5 8.4 Junior/Senior Fourth CAS
13 / 29

Sampling

  • How did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly.
14 / 29

Sampling

  • How did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly.

  • Section 1.3 covers sampling principles in detail.

14 / 29

Sampling

  • How did we obtain the sample data? We selected 25 UofS students at random. The "at random" part is important and we will return to this point shortly.

  • Section 1.3 covers sampling principles in detail.

  • Before we get into sampling principles, let's take a moment to reflect on how our (sample) data is represented.

14 / 29

Data Organization

Let's look at the first five rows of our sample data:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
  • Our data is organized into rows and columns, a so-called data matrix or data frame. Each row corresponds to a single observation which in this example is a single student.

  • The columns of our data correspond to variables, that is, the information or characteristics we observe and record about our observations.

15 / 29

Data Organization

Let's look at the first five rows of our sample data:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
  • Our data is organized into rows and columns, a so-called data matrix or data frame. Each row corresponds to a single observation which in this example is a single student.

  • The columns of our data correspond to variables, that is, the information or characteristics we observe and record about our observations.

  • Variables can usually be classified according to a type system.

15 / 29

Variables and Their Types

  • r_number - not a variable
  • Sun - Sat - continuous numerical
  • living & college - nominal categorical
  • year - ordinal categorical

We repeat the first five rows of our sample data:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
16 / 29

Variables and Their Types

  • r_number - not a variable
  • Sun - Sat - continuous numerical
  • living & college - nominal categorical
  • year - ordinal categorical

We repeat the first five rows of our sample data:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
  • r_number is not a variable because it is a unique identifier for each observation.
16 / 29

Variables and Their Types

  • r_number - not a variable
  • Sun - Sat - continuous numerical
  • living & college - nominal categorical
  • year - ordinal categorical

We repeat the first five rows of our sample data:

## r_number Sun Mon Tues Wed Thurs Fri Sat living year college
## 1 R01718887 7.9 6.3 5.3 5.0 4.6 9.7 9.2 Junior/Senior Fourth CAS
## 2 R01943517 7.4 5.8 5.9 6.2 4.9 7.5 8.9 McCourt First CAS
## 3 R01866257 9.0 6.5 6.5 6.4 4.0 8.5 9.7 Junior/Senior Fourth KSOM
## 4 R01942243 5.8 5.5 5.4 6.0 4.5 9.1 8.6 Off Campus Third CAS
## 5 R01934011 6.4 5.1 5.5 7.3 5.2 9.6 8.9 Off Campus Third PCPS
  • r_number is not a variable because it is a unique identifier for each observation.

  • Note that just because the value of a variable is a number does not necessarily make it numerical. For example, we could have recorded year as 1, 2, 3, or 4 instead of "First", "Second", "Third", or "Fourth". A rule is, if it doesn't make sense to compute the average of a variable, then it's not numeric.

16 / 29

Categorical Variable Types

17 / 29

Data Basics Lecture Video

You should watch the following video on your own time.

  • Question: Which variable in the county data set discussed in the video is discrete and why?
18 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame
## # A tibble: 6 × 8
## site pop sex age head_l skull_w total_l tail_l
## <int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Vic m 8 94.1 60.4 89 36
## 2 1 Vic f 6 92.5 57.6 91.5 36.5
## 3 1 Vic f 6 94 60 95.5 39
## 4 1 Vic f 6 93.2 57.1 92 38
## 5 1 Vic f 2 91.5 56.3 85.5 36
## 6 1 Vic f 1 93.1 54.8 90.5 35.5
19 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame
## # A tibble: 6 × 8
## site pop sex age head_l skull_w total_l tail_l
## <int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Vic m 8 94.1 60.4 89 36
## 2 1 Vic f 6 92.5 57.6 91.5 36.5
## 3 1 Vic f 6 94 60 95.5 39
## 4 1 Vic f 6 93.2 57.1 92 38
## 5 1 Vic f 2 91.5 56.3 85.5 36
## 6 1 Vic f 1 93.1 54.8 90.5 35.5
  • State the type of each variable in the data matrix.
19 / 29

Another Example

Consider the possum data set from the openintro R package that corresponds with the course text book, the first few rows are shown here:

head(possum) # R command used to print first few rows of a data frame
## # A tibble: 6 × 8
## site pop sex age head_l skull_w total_l tail_l
## <int> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Vic m 8 94.1 60.4 89 36
## 2 1 Vic f 6 92.5 57.6 91.5 36.5
## 3 1 Vic f 6 94 60 95.5 39
## 4 1 Vic f 6 93.2 57.1 92 38
## 5 1 Vic f 2 91.5 56.3 85.5 36
## 6 1 Vic f 1 93.1 54.8 90.5 35.5
  • State the type of each variable in the data matrix.

    • site, pop, and sex are nominal categorical
    • age is discrete numerical
    • columns head_l to tail_l are continuous numerical
19 / 29

Sampling Principles

  • The first step in conducting research is to identify topics or questions that are to be investigated. For example,

"How much sleep do University of Scranton students get during the first week of classes?"

  • We need to consider how data are collected and how samples are obtained.
20 / 29

Sampling Principles

  • The first step in conducting research is to identify topics or questions that are to be investigated. For example,

"How much sleep do University of Scranton students get during the first week of classes?"

  • We need to consider how data are collected and how samples are obtained.

  • Importantly, we need to avoid as much as possible picking a biased sample.

20 / 29

Data Collection

  • Watch this video on your own time in order to reinforce the concepts of:

    • populations, samples, and bias
21 / 29

Biased Samples

Consider the question:

"What is the most popular Starbucks drink for current students at the University?"

22 / 29

Biased Samples

Consider the question:

"What is the most popular Starbucks drink for current students at the University?"

  • One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.
22 / 29

Biased Samples

Consider the question:

"What is the most popular Starbucks drink for current students at the University?"

  • One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.

  • For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.

22 / 29

Biased Samples

Consider the question:

"What is the most popular Starbucks drink for current students at the University?"

  • One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.

  • For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.

  • However, restricting our sample to students in a single class might not produce a sample that is sufficiently representative of the entire population. In other words, we may introduce bias by taking as our sample only students in this class.

22 / 29

Biased Samples

Consider the question:

"What is the most popular Starbucks drink for current students at the University?"

  • One approach to data collection for answering this question is to select some subset of students at the U and ask them about their favorite Starbucks drink.

  • For example, out of convenience we can ask this question to everyone in this class. In this case, our sample would be the students in this class.

  • However, restricting our sample to students in a single class might not produce a sample that is sufficiently representative of the entire population. In other words, we may introduce bias by taking as our sample only students in this class.

  • It is preferable (essential) to obtain a sample by choosing individuals at random from the population. Random samples reduce bias!

22 / 29

Sampling Strategies

Let's watch this video together and discuss.

23 / 29

Sampling Strategies

Let's watch this video together and discuss.

  • Question: What is the difference(s) between experimental and observational studies?
23 / 29

Common Sampling Strategies

Here we list and describe some of the most common sampling strategies:

24 / 29

Common Sampling Strategies

Here we list and describe some of the most common sampling strategies:

  1. Simple random sampling. In a simple random sample, each case in the population has an equal chance of being included in the sample.
24 / 29

Common Sampling Strategies

Here we list and describe some of the most common sampling strategies:

  1. Simple random sampling. In a simple random sample, each case in the population has an equal chance of being included in the sample.

  2. Stratified sampling. The population is divided into groups called strata that are chosen so that similar cases are grouped together. Then, some other sampling method such as simple random sampling is used to select a sample from within each group. (In our Starbucks example we could first divide students by cohort, First year, Second year, etc. and then take a sample from each cohort).

24 / 29

Common Sampling Strategies

Here we list and describe some of the most common sampling strategies:

  1. Simple random sampling. In a simple random sample, each case in the population has an equal chance of being included in the sample.

  2. Stratified sampling. The population is divided into groups called strata that are chosen so that similar cases are grouped together. Then, some other sampling method such as simple random sampling is used to select a sample from within each group. (In our Starbucks example we could first divide students by cohort, First year, Second year, etc. and then take a sample from each cohort).

  3. Cluster sampling. This breaks the population up into many groups called clusters, then we sample a fixed number of clusters and include observations from each of those clusters.

24 / 29

Common Sampling Strategies

Here we list and describe some of the most common sampling strategies:

  1. Simple random sampling. In a simple random sample, each case in the population has an equal chance of being included in the sample.

  2. Stratified sampling. The population is divided into groups called strata that are chosen so that similar cases are grouped together. Then, some other sampling method such as simple random sampling is used to select a sample from within each group. (In our Starbucks example we could first divide students by cohort, First year, Second year, etc. and then take a sample from each cohort).

  3. Cluster sampling. This breaks the population up into many groups called clusters, then we sample a fixed number of clusters and include observations from each of those clusters.

  4. Multistage sampling. This is like cluster sampling but rather than keeping all observations in each cluster, we collect a random sample within each cluster.

24 / 29

Relationship Between Variables

  • Many analyses are motivated by a researcher looking for a relationship between two or more variables.

    • For example, one could ask the question, is there a relationship between level of education of a person and their salary at age 30.
25 / 29

Relationship Between Variables

  • Many analyses are motivated by a researcher looking for a relationship between two or more variables.

    • For example, one could ask the question, is there a relationship between level of education of a person and their salary at age 30.
  • When two variables show some connection with one another, they are called associated variables.

25 / 29

Relationship Between Variables

  • Many analyses are motivated by a researcher looking for a relationship between two or more variables.

    • For example, one could ask the question, is there a relationship between level of education of a person and their salary at age 30.
  • When two variables show some connection with one another, they are called associated variables.

"A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent."

25 / 29

Relationship Between Variables

  • Many analyses are motivated by a researcher looking for a relationship between two or more variables.

    • For example, one could ask the question, is there a relationship between level of education of a person and their salary at age 30.
  • When two variables show some connection with one another, they are called associated variables.

"A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent."

  • It is important to point out that association does not imply causation.
25 / 29

Relationship Between Variables

  • Many analyses are motivated by a researcher looking for a relationship between two or more variables.

    • For example, one could ask the question, is there a relationship between level of education of a person and their salary at age 30.
  • When two variables show some connection with one another, they are called associated variables.

"A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent."

  • It is important to point out that association does not imply causation.

  • In the next class meeting, we will go through an introduction to R where we will work with some data and see explicit examples where two (or more) variables in a data set might be related.

25 / 29

Explanatory and response variables

  • When we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.
26 / 29

Explanatory and response variables

  • When we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.

  • For example, consider the county data set from the openintro R package. The first few rows of select columns of this data set are shown below:

## # A tibble: 6 × 4
## name state pop_change median_hh_income
## <chr> <fct> <dbl> <int>
## 1 Autauga County Alabama 1.48 55317
## 2 Baldwin County Alabama 9.19 52562
## 3 Barbour County Alabama -6.22 33368
## 4 Bibb County Alabama 0.73 43404
## 5 Blount County Alabama 0.68 47412
## 6 Bullock County Alabama -2.28 29655
26 / 29

Explanatory and response variables

  • When we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.

  • For example, consider the county data set from the openintro R package. The first few rows of select columns of this data set are shown below:

## # A tibble: 6 × 4
## name state pop_change median_hh_income
## <chr> <fct> <dbl> <int>
## 1 Autauga County Alabama 1.48 55317
## 2 Baldwin County Alabama 9.19 52562
## 3 Barbour County Alabama -6.22 33368
## 4 Bibb County Alabama 0.73 43404
## 5 Blount County Alabama 0.68 47412
## 6 Bullock County Alabama -2.28 29655
  • We could consider the following question:

"If there is an increase in the median household income in a county, does this drive an increase in its population?"

26 / 29

Explanatory and response variables

  • When we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other.

  • For example, consider the county data set from the openintro R package. The first few rows of select columns of this data set are shown below:

## # A tibble: 6 × 4
## name state pop_change median_hh_income
## <chr> <fct> <dbl> <int>
## 1 Autauga County Alabama 1.48 55317
## 2 Baldwin County Alabama 9.19 52562
## 3 Barbour County Alabama -6.22 33368
## 4 Bibb County Alabama 0.73 43404
## 5 Blount County Alabama 0.68 47412
## 6 Bullock County Alabama -2.28 29655
  • We could consider the following question:

"If there is an increase in the median household income in a county, does this drive an increase in its population?"

  • Here we asking if median household income is an explanatory variable for the response variable population change.
26 / 29

Reflection

In this lecture, we have covered the following topics and concepts:

  • Course logistics, syllabus, textbook, use or R, etc.

  • Data and it's structure, e.g., data matrices and variable types.

  • Sampling principles and strategies.

  • Relationship between variables.

27 / 29

Before Next Class Meeting

Before the next class meeting, please complete the following tasks:

  • Read the syllabus carefully.

  • Respond via email to my "getting to know you" prompt.

  • Review Chapter 1 of Textbook

  • Register for RStudio Cloud, if you already have a gmail account, you can just log in using it.

  • Accept my invitation to the MATH204 RStudio Cloud Workspace.

  • Make sure to bring your computer with you to class.

  • Have a great rest of the day.

28 / 29

Looking Ahead

If you want to get a head start for the next class or two, watch this video:

29 / 29

Welcome!

Welcome to MATH 204 Introduction to Statistics!

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow