class: center, middle, inverse, title-slide # Lecture 3 ## Intro to Data Summaries ### JMG ### MATH 204 ### Tuesday, September 7 --- # Learning Objectives In this lecture, we will - Introduce Numerical and Visual Summaries of Data - We will introduce mean, median, mode, variance, and standard deviation. Textbook sections 2.1.2, 2.1.4, 2.1.5. - We discuss scatterplots, histograms, and boxplots for visualizing numerical data. Textbook sections 2.1.1, 2.1.3, 2.1.5. - We discuss contingency tables and bar plots for summarizing categorical data. Textbook section 2.2.1. - We see how to use R to compute numerical summaries and visualizations for data. --- # Recollections Recall from a previous lecture, we introduced data, data collection and sampling strategies, and the structure of data and variable types. - Research begins with a question -- - Then, data is collected to answer the question -- - In this lecture, we describe the next step in the research process --- # The Research Workflow .center[ <img src="https://www.dropbox.com/s/ycrjss9s0mn9o8e/3-s2.0-B9780128207888000109-f01-01-9780128207888.gif?raw=1" width="50%" /> ] --- # Ziggy .center[ <img src="https://www.dropbox.com/s/qqzywowynekav0z/ZiggyHairPins.jpg?raw=1" width="50%" /> ] --- # Exploratory Data Analysis (EDA) - Once we have collected data, the next step in the statistical process is beginning to explore the data. -- - Typically, it is not helpful (possible) to look at an entire data set and gain meaningful insight. Instead, we work with summaries of our data, both graphical summaries and numeric summaries. -- - Numeric summaries of data are often called descriptive statistics. -- - It is important to note that the type of a variable, *i.e.*, numerical or categorical, will determine the kind of graphical or numeric summary that is used. --- # Exploring Numerical Variables We begin our introduction to EDA by discussing summaries for numerical data. Specifically, we introduce - Scatterplots for paired numerical variables -- - Dot plots and the mean -- - Histograms and shape -- - Variance and standard deviation -- - Box plots, the median, and robust statistics These topics are discussed in section 2.1 of the textbook and the video included on the next slide provides an overview. --- # Summarizing and Graphing Numerical Data Video
- Please watch this video on your own time. --- # Example Data Sets - In 2.1, the text discusses aspects of the `loan50` data set. It is helpful to look at what is contained in this data. ```r dim(loan50) ``` ``` ## [1] 50 18 ``` -- - We see that the `loan50` data set contains 50 observations (*i.e.*, rows) and a total of 18 variables (*i.e.*, columns). -- - The `glimpse` command may be used to examine the types for each of the variables that is contained in the data. --- # The `loan50` Variables ```r glimpse(loan50) ``` ``` ## Rows: 50 ## Columns: 18 ## $ state <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI~ ## $ emp_length <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,~ ## $ term <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36~ ## $ homeownership <fct> rent, rent, mortgage, rent, mortgage, mortgage~ ## $ annual_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288~ ## $ verified_income <fct> Not Verified, Not Verified, Verified, Not Veri~ ## $ debt_to_income <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667~ ## $ total_credit_limit <int> 95131, 51929, 301373, 59890, 422619, 349825, 1~ ## $ total_credit_utilized <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872~ ## $ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3~ ## $ loan_purpose <fct> debt_consolidation, credit_card, debt_consolid~ ## $ loan_amount <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1~ ## $ grade <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E~ ## $ interest_rate <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6~ ## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~ ## $ loan_status <fct> Current, Current, Current, Current, Current, C~ ## $ has_second_income <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~ ## $ total_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288~ ``` --- # Help on Data in R For a data set contained in a R package, it is often possible to obtain additional information using the `help` command. For example, the `loan50` data set is contained in the `openintro` package. -- - Go the console and run the following command: ```r ?openintro::loan50 ``` - **Question:** For the numerical variables in `loan50`, are there any that are continuous? Are there any numerical variables that are discrete? --- # Some R Practice Another data set associated with the text is the `epa2021` data set. In the R console, run the following commands to learn about this data. - Size of the data ```r dim(openintro::epa2021) ``` -- - Details on the data ```r ?openintro::epa2021 ``` -- - Variable types ```r dplyr::glimpse(openintro::epa2021) ``` --- # Scatterplots In statistics, we are often interested to explore the relationship between variables in a data set, one way to do this is with a scatter plot. - For example, in the `epa2021` data set we can try to assess any association between engine size and gas mileage: ```r gf_point(hwy_mpg~engine_displacement,data=epa2021) ``` <!-- --> --- # Thinking About Scatterplots A scatterplot provides a case-by-case view of data for two numerical variables. -- - **Question:** What does the scatterplot on the previous slide reveal about the data? How is the plot useful? How are scatter plots useful in general? --- # Models Statistical models help us assess variable associations. For example, the curve shown in the following plot is obtained by fitting a type of statistical model to the data: ```r gf_point(hwy_mpg~engine_displacement,data=epa2021) + geom_smooth() ``` <!-- --> This model suggests that the relationship between engine size and gas mileage is **nonlinear** since the curve deviates significantly from being a straight line. --- # Single Numerical Variable Data Summaries Later, we will return to consider statistical relationships between variables. For now, let's shift our focus to understanding what can happen with just one variable in a data set. -- - Consider, for example, just the `interest_rate` variable in the `loan50` data. -- - We could visualize `interest_rate` as a sort of one-dimensional scatterplot by simply plotting the values of the variable on the number line: <!-- --> There is overlap in the data, so darker dots correspond to more points laid on top of one another (*i.e.*, and higher density). --- # Dot Plots We could also stack points that take on the same values in order to create a **dot plot**: <!-- --> -- - **Question:** What information is provided by this plot? -- - The previous dot plot provides a starting point for understanding the [**distribution**](https://en.wikipedia.org/wiki/Frequency_distribution) of our sample data. The distribution of a sample tells us about what values we see in the data, as well as how common (frequent) or not certain values may be. --- # Descriptive Statistics - Plots of data are very helpful in many ways. However, it is also nice to have simple ways to summarize and/or characterize the distribution of numerical data quantitatively. -- - In fact, statistics provides us with methods for summarizing and characterizing numerical data. The most common ones are: -- - 1) The sample **mean** and **median** which measure the center of a distribution of data, and -- - 2) the sample **variance** and **standard deviation** which measure the "spread" of a distribution of data. The variance is the average squared distance from the mean, and the standard deviation tells us how far the data are distributed from the mean. -- We will now give mathematical definitions for mean, median, variance, and standard deviation and show how to compute all of these quantities both by hand and using R. --- # Definition of Mean and Examples In words, we define the mean of sample data by `\(\text{sample mean (average) of data} = \frac{\text{sum of all sample data values}}{\text{sample size}}.\)` -- Mathematically, we define the mean of sample data by the formula `\(\bar{x} = \frac{x_{1} + x_{2} + \cdots + x_{n}}{n}.\)` -- Note that we read `\(\bar{x}\)` as "x bar". --- # Example Computing Mean Consider the following list of numbers: ```r x <- c(2,5,10,4,8,2,6,3,9,10,4,6,5,5,7) ``` We can calculate the mean (or average) of these 15 values in a few different ways: -- - By hand `\(\frac{2 + 5 + 10 + 4 + 8 + 2 + 6 + 3 + 9 + 10 + 4 + 6 + 5 + 5 + 7}{15} = \frac{86}{15} \approx 5.73\)` -- - Manually in R ```r sum(x)/length(x) ``` ``` ## [1] 5.733333 ``` -- - Using the `mean` function in R ```r mean(x) ``` ``` ## [1] 5.733333 ``` --- # Mean Example with Data Let's compute the mean of the `interest_rate` variable: ```r mean(loan50$interest_rate) ``` ``` ## [1] 11.5672 ``` -- - **Question:** What is the value of `\(n\)` (*i.e.*, the sample size) for the `interest_rate` data? --- # Thinking About the Mean The next plot shows the data for the `interest_rate` variable again but now with the mean value added to the figure: <!-- --> **Question:** In what sense does the sample mean of the `interest_rate` data measure the center of the distribution of the sample data? How much to "large" values contribute to the mean? --- # Definition of Median If the data are ordered from smallest to largest, the sample **median** is the observation right in the middle. -- - If there are an even number of observations, there will be two values in the middle, and the median is taken as their average value. -- - The median is also known as the 50th percentile or 50th quantile. --- # Median Example Consider again the list of numbers contained in the vector `x`: ```r x ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 ``` -- It is simple to order these from smallest to largest: ```r sort(x) ``` ``` ## [1] 2 2 3 4 4 5 5 5 6 6 7 8 9 10 10 ``` The value in the middle is obviously 5. -- Let's confirm this using the R command `median`: ```r median(x) ``` ``` ## [1] 5 ``` --- # Median with Even Number of Values To see what happens when we have an even number of sample values consider the following list of values: ```r y <- c(2,5,3,7,8,1,4,2,6,7) ``` -- Again, we can order them as ```r sort(y) ``` ``` ## [1] 1 2 2 3 4 5 6 7 7 8 ``` -- Then the median should be `\(\frac{4+5}{2} = \frac{9}{2}=4.5\)` -- Let's confirm this with R: ```r median(y) ``` ``` ## [1] 4.5 ``` --- # Median Example with Data The median of the `interest_rate` data is computed as ```r median(loan50$interest_rate) ``` ``` ## [1] 9.93 ``` --- # Thinking About the Median The following plot shows the interest rate data plus both the sample mean and sample median. <!-- --> -- - **Question:** In what sense does the sample median of the `interest_rate` data measure the center of the distribution of the sample data? How much to "large" values contribute do the mean? --- # Robust Statistics Consider our data `x` again, ```r x ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 ``` ```r (mean(x)) ``` ``` ## [1] 5.733333 ``` ```r (median(x)) ``` ``` ## [1] 5 ``` -- Let's add an **outlier** ```r (xl <- c(x,20)) ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 20 ``` --- # Mean and Median with Outliers Observe what happens if we compute the mean and median with an outlier in the data: ```r (xl <- c(x,20)) ``` ``` ## [1] 2 5 10 4 8 2 6 3 9 10 4 6 5 5 7 20 ``` ```r (mean(xl)) ``` ``` ## [1] 6.625 ``` ```r (median(xl)) ``` ``` ## [1] 5.5 ``` -- - Notice that the median is much less sensitive to the outlier than the mean is. Because of this, we call the median a **robust statistic**. --- # IQR The interquartile range (IQR) is another example of a robust statistic. ```r (IQR(x)) ``` ``` ## [1] 3.5 ``` ```r (IQR(xl)) ``` ``` ## [1] 4.25 ``` -- - What exactly is the interquartile range? It will take a couple of slides to answer this quest. --- # Boxplots .center[ <img src="https://www.dropbox.com/s/qpgw6veozn2z4jx/boxPlotLayoutNumVar.jpg?raw=1" width="75%" /> ] --- # Definition of IQR `\(IQR = Q_{3} - Q_{1}\)` where - `\(Q_{1} =\)` the 25th percentile - `\(Q_{3} =\)` the 75th percentile --- # Boxplots with R Suppose we want to create a boxplot for our `interest_rate` variable in the `loan50` data, then we would do as follows: ```r gf_boxplot(~interest_rate,data=loan50) + coord_flip() ``` <!-- --> --- # Some R Practice Go to the R console and run the following commands: -- - Compute the mean for the highway gas mileage `hwy_mpg` variable from the `epa2021` data set. ```r mean(epa2021$hwy_mpg) ``` -- - Compute the median for the highway gas mileage `hwy_mpg` variable from the `epa2021` data set. ```r median(epa2021$hwy_mpg) ``` --- # Definition of Variance and Standard Deviation Mathematically, we define the sample **variance** (denoted by `\(s^2\)`) by the formula `\(s^2 = \frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}.\)` Take care to note that the denominator is `\(n-1\)`, that is, the sample size minus 1. -- The sample **standard deviation** (denoted by `\(s\)`) is simply the square root of the sample variance. That is `\(s = \sqrt{\text{sample variance}} = \sqrt{\frac{(x_1 - \bar{x})^2+(x_{2} - \bar{x})^2+\cdots +(x_{n}-\bar{x})^2}{n-1}}.\)` --- # Computing Variance and Standard Deviation In R, we compute the sample variance using the command `var` and the sample standard deviation using the command `sd`. For example ```r (var(x)) ``` ``` ## [1] 6.92381 ``` ```r (sd(x)) ``` ``` ## [1] 2.631313 ``` -- Notice that ```r sqrt(var(x)) ``` ``` ## [1] 2.631313 ``` give the same as `sd(x)`. --- # Variance and Standard Deviation for Data Let's compute the sample variance and standard deviation for the `interest_rate` data: ```r (var(loan50$interest_rate)) ``` ``` ## [1] 25.52387 ``` ```r (sd(loan50$interest_rate)) ``` ``` ## [1] 5.052115 ``` --- # Thinking About Variance and Standard Deviation Consider the following plot: <!-- --> Now notice that of 50 data points, 34 of them lie within one standard deviation of the mean. That is, 34 of the data values lie in the interval `\((\bar{x}-s,\bar{x}+s)\)`. That is 68 percent of the data. --- # Two Standard Deviations What percent of the data lie within two standard deviations of the mean, that is, within the interval `\((\bar{x}-2s,\bar{x}+2s)\)`? Let's see: <!-- --> Of 50 data points, 48 lie within two standard deviations of the mean. That is, 48 of the data values lie in the interval `\((\bar{x}-2s,\bar{x}+2s)\)`. That is 96 percent of the data. Not always, but very often about 70% of data lie within one standard deviation of the mean and about 96% of data lie within two standard deviations of the mean. --- # Histograms and Shape - Dot plots show the exact value for each observation in a sample. -- - A histogram "bins" the sample data into distinct intervals and counts the frequency of data points occurring within each bin interval. --- # Histogram Example with Data The following plot shows a histogram for the `interest_rate` variable in the `loan50` data set: ```r gf_histogram(~interest_rate,data=loan50,boundary=5,binwidth=2.5,color="black") + scale_x_continuous(breaks = c(5,10,15,20,25)) ``` <!-- --> -- - Let's examine how this histogram is contructed. --- # Constructing a histogram Look at the `interest_rate` variable data after sorting it: ``` ## [1] 5.31 5.31 5.32 6.08 6.08 6.08 6.71 6.71 7.34 7.35 7.35 7.96 ## [13] 7.96 7.96 7.97 9.43 9.43 9.44 9.44 9.44 9.92 9.92 9.92 9.92 ## [25] 9.93 9.93 10.42 10.42 10.90 10.90 10.91 10.91 10.91 11.98 12.62 12.62 ## [37] 12.62 14.08 15.04 16.02 17.09 17.09 17.09 18.06 18.45 19.42 20.00 21.45 ## [49] 24.85 26.30 ``` -- - If our bins are the intervals `\([5,7.5], [7.5,10.0], \ldots\)`, then we see that there are 11 data points in bin 1, 15 data points in bin 2, etc. --- # The Use of Histograms - Histograms provide a visualization of the density of sample data. -- - That is, higher bars represent higher frequency of a range of values. -- - Histograms also indicate the shape of the distribution of data. --- # Shape of a Distribution A histogram might suggest if our data is **skewed**: <!-- --> --- # Multimodal Distributions A distriubtion with one peak is called unimodal, a distribution with two peaks is called bimodal, and a distribution with more than two peaks is called multimodal. -- <!-- --> --- # Reflection In this lecture, we covered the topics of - Descriptive statistics: mean, median, variance, and standard deviation; and - we introduced scatterplots, histograms, and boxplots for visualizing numerical data -- - R commands for computing mean, median, variance, and standard deviation were introduced; and - we one way to obtain a boxplot or a histogram using R. -- - We also introduced the concept of outliers and robust statistics. --- # For Next Time In the next lecture, - We explore methods for summarizing categorical data such as contingency tables and bar plots. Textbook section 2.2.1. -- - We will also look at some more involved examples of exploratory data analysis (EDA) using R.