Lecture 4Data Summaries for Categorical DataJMGMATH 204Thursday, September 91 / 23

Learning Objectives

In this lecture, we will

Continue our discussion of numerical and visual summaries of data
- We discuss contingency tables and bar plots for summarizing categorical data. Textbook section 2.2.1.
- We see how to use R to compute numerical summaries and visualizations for categorical data.

2 / 23

Summaries of Categorical Data Video

Please watch this video on your own time. 
3 / 23

A Note on Categorical Variables in ROften, a categorical variable is called a factor, and each category is called a level. 
4 / 23

A Note on Categorical Variables in R

Often, a categorical variable is called a factor, and each category is called a level.
For example, consider letter grades such as we assign at the University. This is a factor with 11 levels: A, A-, B+, B, B-, C+, C, C-, D+, D, and F.

4 / 23

A Note on Categorical Variables in R

Often, a categorical variable is called a factor, and each category is called a level.
For example, consider letter grades such as we assign at the University. This is a factor with 11 levels: A, A-, B+, B, B-, C+, C, C-, D+, D, and F.
Note that here we know all of the possible outcomes (levels) a priori.

4 / 23

Creating a Factor in R

Here is an example of creating a factor in R:

my_grades <- factor(
  c(rep("A",4),rep("B",6),rep("B-",4),rep("C",8),rep("C-",2),
                      rep("D",2),rep("F",3)),
                    levels=c("A","A-","B+","B","B-","C+","C","C-","D+","D","F")
  )

5 / 23

Recall the `table` Function

table(my_grades)

## my_grades
##  A A- B+  B B- C+  C C- D+  D  F 
##  4  0  0  6  4  0  8  2  0  2  3

6 / 23

Recall the `table` Function

table(my_grades)

## my_grades
##  A A- B+  B B- C+  C C- D+  D  F 
##  4  0  0  6  4  0  8  2  0  2  3

The table function returns the count of how many times each level of a factor appears in the data. This is sometimes called a frequency table.

6 / 23

Recall the `table` Function

table(my_grades)

## my_grades
##  A A- B+  B B- C+  C C- D+  D  F 
##  4  0  0  6  4  0  8  2  0  2  3

The table function returns the count of how many times each level of a factor appears in the data. This is sometimes called a frequency table.
A table such as the one we just obtained is a typical way to summarize a single categorical variable.

6 / 23

Bar Plots

A barplot is the visual analog of a frequency table.

grades_df <- tibble(my_grades=my_grades)
gf_bar(~my_grades,data=grades_df)

7 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

8 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

Contingency tables

8 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

Contingency tables
Proportion tables

8 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

Contingency tables
Proportion tables
Stacked or side-by-side bar plots

8 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

Contingency tables
Proportion tables
Stacked or side-by-side bar plots
Mosaic plots

8 / 23

Summarizing Data for Two Categorical Variables

Scatterplots provide a way to summarize together two numerical variables, methods for summarizing together two categorical variables include:

Contingency tables
Proportion tables
Stacked or side-by-side bar plots
Mosaic plots

We will explain each of these tools and illustrate how to obtain them in R. We will work with the loans_dat data set. This data set represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals.

8 / 23

Contingency Tables

Contingency tables display the number of times a particular combination of variable outcomes occurs.

9 / 23

Contingency Tables

Contingency tables display the number of times a particular combination of variable outcomes occurs.

For example, we construct a contingency table for the variables homeownership (ownership status of the applicant's residence) and application_type (type of application: either individual or joint):

9 / 23

Contingency Tables

Contingency tables display the number of times a particular combination of variable outcomes occurs.

For example, we construct a contingency table for the variables homeownership (ownership status of the applicant's residence) and application_type (type of application: either individual or joint):

with(loans_dat,addmargins(table(application_type,homeownership)))

##                 homeownership
## application_type MORTGAGE   OWN  RENT   Sum
##       individual     3839  1170  3496  8505
##       joint           950   183   362  1495
##       Sum            4789  1353  3858 10000

9 / 23

Contingency Tables

Contingency tables display the number of times a particular combination of variable outcomes occurs.

For example, we construct a contingency table for the variables homeownership (ownership status of the applicant's residence) and application_type (type of application: either individual or joint):

with(loans_dat,addmargins(table(application_type,homeownership)))

##                 homeownership
## application_type MORTGAGE   OWN  RENT   Sum
##       individual     3839  1170  3496  8505
##       joint           950   183   362  1495
##       Sum            4789  1353  3858 10000

If we create a barplot for homeownership, we will see bars corresponding to the first three values in the last row of our table. Similarly, a barplot for application_type will show bars corresponding to the first two values in the last column of our table.

9 / 23

Bar Plots for `loans_dat`

##                 homeownership
## application_type MORTGAGE   OWN  RENT   Sum
##       individual     3839  1170  3496  8505
##       joint           950   183   362  1495
##       Sum            4789  1353  3858 10000

10 / 23

Proportion Tables

At times, it is useful to compute proportions instead of counts.

11 / 23

Proportion Tables

At times, it is useful to compute proportions instead of counts.

A proportion table displays the same essential information as a contingency table except we divide entries by either the row sums (row proportion table) or the column sum (column proportion table).

11 / 23

Proportion Tables

At times, it is useful to compute proportions instead of counts.

A proportion table displays the same essential information as a contingency table except we divide entries by either the row sums (row proportion table) or the column sum (column proportion table).
For example, if we divide each entry in the first row of the previous table by 8505, and divide each entry in the second row by 1495, we obtain

(c(3839,1170,3496) / 8505)

## [1] 0.4513815 0.1375661 0.4110523

(c(950, 183, 362) / 1495)

## [1] 0.6354515 0.1224080 0.2421405

11 / 23

Proportion Tables

At times, it is useful to compute proportions instead of counts.

A proportion table displays the same essential information as a contingency table except we divide entries by either the row sums (row proportion table) or the column sum (column proportion table).
For example, if we divide each entry in the first row of the previous table by 8505, and divide each entry in the second row by 1495, we obtain

(c(3839,1170,3496) / 8505)

## [1] 0.4513815 0.1375661 0.4110523

(c(950, 183, 362) / 1495)

## [1] 0.6354515 0.1224080 0.2421405

This gives us our values for a row proportion table.

11 / 23

Proportion Tables Example

Row proportion table

with(loans_dat,addmargins(prop.table(table(application_type,homeownership),
                                     margin=1)))

##                 homeownership
## application_type  MORTGAGE       OWN      RENT       Sum
##       individual 0.4513815 0.1375661 0.4110523 1.0000000
##       joint      0.6354515 0.1224080 0.2421405 1.0000000
##       Sum        1.0868330 0.2599742 0.6531928 2.0000000

12 / 23

Proportion Tables Example

Row proportion table

with(loans_dat,addmargins(prop.table(table(application_type,homeownership),
                                     margin=1)))

##                 homeownership
## application_type  MORTGAGE       OWN      RENT       Sum
##       individual 0.4513815 0.1375661 0.4110523 1.0000000
##       joint      0.6354515 0.1224080 0.2421405 1.0000000
##       Sum        1.0868330 0.2599742 0.6531928 2.0000000

Column proportion table

with(loans_dat,addmargins(prop.table(table(application_type,homeownership),
                                     margin=2)))

##                 homeownership
## application_type  MORTGAGE       OWN      RENT       Sum
##       individual 0.8016287 0.8647450 0.9061690 2.5725427
##       joint      0.1983713 0.1352550 0.0938310 0.4274573
##       Sum        1.0000000 1.0000000 1.0000000 3.0000000

12 / 23

Stacked and Side-By-Side Barplots

p1 <- gf_bar(~homeownership,data=loans_dat,fill=~application_type) + 
  ggtitle("Stacked")
p2 <- gf_bar(~homeownership,data=loans_dat,fill=~application_type,
             position = position_dodge()) + ggtitle("Side-By-Side")
p1 + p2

13 / 23

Standardized Stacked Bar Plot

Stacked bar plots can be used to contruct a visualization of a proportion table.

14 / 23

Standardized Stacked Bar Plot

Stacked bar plots can be used to contruct a visualization of a proportion table.

For example, the following stacked bar plot displays our column proportion table as a plot:

##                 homeownership
## application_type  MORTGAGE       OWN      RENT       Sum
##       individual 0.8016287 0.8647450 0.9061690 2.5725427
##       joint      0.1983713 0.1352550 0.0938310 0.4274573
##       Sum        1.0000000 1.0000000 1.0000000 3.0000000

14 / 23

Mosaic Plots

A mosiac plot is a visualization that corresponds to contingency tables. They can be one-variable of multi-variable.

15 / 23

Grouped Numerical DataGrouped numerical data arises when we want to study the distribution of a numerical variable across two or more distinguishing groups. 
16 / 23

Grouped Numerical Data

Grouped numerical data arises when we want to study the distribution of a numerical variable across two or more distinguishing groups.
In other words, we are looking for association between two variables where one variable is numerical (typically viewed as the response variable) and the other is categorical (typically viewed as the explanatory variable).

16 / 23

Grouped Numerical Data

Grouped numerical data arises when we want to study the distribution of a numerical variable across two or more distinguishing groups.
In other words, we are looking for association between two variables where one variable is numerical (typically viewed as the response variable) and the other is categorical (typically viewed as the explanatory variable).
For example, consider out possum data set again.

head(possum,5)

## # A tibble: 5 x 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38  
## 5     1 Vic   f         2   91.5    56.3    85.5   36

16 / 23

Grouped Numerical Data

Grouped numerical data arises when we want to study the distribution of a numerical variable across two or more distinguishing groups.
In other words, we are looking for association between two variables where one variable is numerical (typically viewed as the response variable) and the other is categorical (typically viewed as the explanatory variable).
For example, consider out possum data set again.

head(possum,5)

## # A tibble: 5 x 8
##    site pop   sex     age head_l skull_w total_l tail_l
##   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
## 1     1 Vic   m         8   94.1    60.4    89     36  
## 2     1 Vic   f         6   92.5    57.6    91.5   36.5
## 3     1 Vic   f         6   94      60      95.5   39  
## 4     1 Vic   f         6   93.2    57.1    92     38  
## 5     1 Vic   f         2   91.5    56.3    85.5   36

We could ask, is there a difference in the distribution of possum tail length between female and male possums?

16 / 23

Grouped Summaries

We can compute grouped numerical summaries:

possum %>% group_by(sex) %>% summarise(mean_tail_l=mean(tail_l),
                                        median_tail_l=median(tail_l),
                                        tail_l_var=var(tail_l),
                                        sd_tail_l=sd(tail_l))

## # A tibble: 2 x 5
##   sex   mean_tail_l median_tail_l tail_l_var sd_tail_l
##   <fct>       <dbl>         <dbl>      <dbl>     <dbl>
## 1 f            37.1          37.5       3.35      1.83
## 2 m            36.9          36.5       4.23      2.06

17 / 23

Grouped plots

We can also create grouped plots:

p1 <- gf_boxplot(tail_l~sex,data=possum,color=~sex,binwidth=2)
p2 <- gf_histogram(~tail_l | sex,data=possum,fill=~sex,binwidth=2)
p1 + p2

18 / 23

R Tips: the Tidyverse

In order to work with data, compute summaries, and obtain visualizations, we are employing the tidyverse family of R packages. This includes

dplyr for working with and summarizing data
ggplot2 for graphics and visualizations
readr for reading data into R
etc.

19 / 23

R Tips: the Tidyverse

In order to work with data, compute summaries, and obtain visualizations, we are employing the tidyverse family of R packages. This includes

dplyr for working with and summarizing data
ggplot2 for graphics and visualizations
readr for reading data into R
etc.
The tidyverse utilizes the principle of tidy data for facilitating analyses.

19 / 23

Tidy Data

20 / 23

Visualizations in R

21 / 23

Reflection

In this lecture, we covered the topics of

22 / 23

Reflection

In this lecture, we covered the topics of

Graphical and numerical summaries for categorical data

22 / 23

Reflection

In this lecture, we covered the topics of

Graphical and numerical summaries for categorical data
We discussed contigency tables and bar plot

22 / 23

Reflection

In this lecture, we covered the topics of

Graphical and numerical summaries for categorical data
We discussed contigency tables and bar plot
We introduced the notion of grouped data and grouped summaries

22 / 23

For Next Time

In the next lecture, we will begin our discussion of probability which forms the foundation of statistics. In preparation, you are encouraged to watch the included video.

23 / 23

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Lecture 4

Data Summaries for Categorical Data

JMG

MATH 204

Thursday, September 9

Learning Objectives

Summaries of Categorical Data Video

A Note on Categorical Variables in R

A Note on Categorical Variables in R

A Note on Categorical Variables in R

Creating a Factor in R

Recall the table Function

Recall the table Function

Recall the table Function

Bar Plots

Summarizing Data for Two Categorical Variables

Summarizing Data for Two Categorical Variables

Summarizing Data for Two Categorical Variables

Summarizing Data for Two Categorical Variables

Summarizing Data for Two Categorical Variables

Summarizing Data for Two Categorical Variables

Contingency Tables

Contingency Tables

Contingency Tables

Contingency Tables

Bar Plots for loans_dat

Proportion Tables

Proportion Tables

Proportion Tables

Proportion Tables

Proportion Tables Example

Proportion Tables Example

Stacked and Side-By-Side Barplots

Standardized Stacked Bar Plot

Standardized Stacked Bar Plot

Mosaic Plots

Grouped Numerical Data

Grouped Numerical Data

Grouped Numerical Data

Grouped Numerical Data

Grouped Summaries

Grouped plots

R Tips: the Tidyverse

R Tips: the Tidyverse

Tidy Data

Visualizations in R

Reflection

Reflection

Reflection

Reflection

For Next Time

Learning Objectives

Help

Recall the `table` Function

Recall the `table` Function

Recall the `table` Function

Bar Plots for `loans_dat`