3  The Sampling Distribution

3.1 Goals

  • Understand the sampling distribution of an estimate
  • Investigate sampling error
  • Calculate standard error of the mean
  • Calculate confidence intervals

3.2 Learning the Tools

3.2.1 Simulating your own sampling distributions

For this lab we will simulate our own sampling distribution of the mean. Here we will use the penguins dataset from the palmerpenguins package. To simulate the sampling distribution you’ll treat the data in the penguins dataset as the entire population (it isn’t but we can pretend) and take many random samples from it, calculate the sample mean \(\bar{Y}\) for each, and plot the distribution.

3.2.1.1 Randomly sampling rows

The dplyr package offers tools that allow us to easily produce a random sample of rows from a dataset using the function slice_sample(). You may need to install the dplyr package, refer back to Lab 02 for how to do that. With the slice_sample() function we can, for example, sample 5 rows at random from the penguins data set by doing the following:

library(dplyr)

samp <- slice_sample(penguins, n = 5)
samp
    species    island bill_len bill_dep flipper_len body_mass    sex year
1    Gentoo    Biscoe     55.1     16.0         230      5850   male 2009
2    Adelie Torgersen     35.7     17.0         189      3350 female 2009
3    Adelie    Biscoe     41.0     20.0         203      4725   male 2009
4    Adelie Torgersen     39.3     20.6         190      3650   male 2007
5 Chinstrap     Dream     55.8     19.8         207      4000   male 2009

Now try on your to increase the sample size 10, how would you do this?

3.2.1.2 Repeated sampling

To simulate a sampling distribution, we need to repeatedly randomly sample the “population” (again, in this case, we’re pretending the penguins data set is the entire population). The infer package (again you may need to install this pacakge following direction sin Lab 02) has a convenient function rep_slice_sample() that will repeat slice_sample() many times. It creates a new column called replicate to index each replicate sample. To randomly sample 5 rows 4 times from the species Adelie, we do:

# first load needed packages
library(palmerpenguins)
library(infer)

# let's look at just the species setosa, so we'll need to subset our data
just_adelie <- subset(penguins, species == "Adelie")

many_adelie_samp <- rep_slice_sample(just_adelie, n = 5, reps = 4)
head(many_adelie_samp)
# A tibble: 6 × 9
# Groups:   replicate [2]
  replicate species island    bill_length_mm bill_depth_mm flipper_length_mm
      <int> <fct>   <fct>              <dbl>         <dbl>             <int>
1         1 Adelie  Biscoe              35.9          19.2               189
2         1 Adelie  Biscoe              38.1          16.5               198
3         1 Adelie  Biscoe              35            17.9               192
4         1 Adelie  Dream               41.1          18.1               205
5         1 Adelie  Dream               36.6          18.4               184
6         2 Adelie  Torgersen           37.8          17.3               180
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Note: the infer package converted the penguins data.frame to something called a tibble, for our purposes, think of a data.frame and a tibble as equivalent.

Now, let’s take 1000 samples of size 10 from each species. I’ll show you the code for Adelie, then fill in the blank (___) sections below to do it for the other species. After, you’ll need to combine the results before summarizing and plotting.

# let's use `set.seed` so we can compare answers
set.seed(123)

just_adelie <- subset(penguins, species == "Adelie")
many_adelie_samp10 <- rep_slice_sample(just_adelie, n = 10, reps = 1000)

just_chinstrap <- subset(penguins, species == ___)
many_chinstrap_samp10 <- rep_slice_sample(___, n = 10, reps = 1000)

just_gentoo <- subset(penguins, species == ___)
many_gentoo_samp10 <- rep_slice_sample(___, n = 10, reps = 1000)

If you used the same seed (123) in the above code then you should get these same answers:

many_adelie_samp10
# A tibble: 10,000 × 9
# Groups:   replicate [1,000]
   replicate species island    bill_length_mm bill_depth_mm flipper_length_mm
       <int> <fct>   <fct>              <dbl>         <dbl>             <int>
 1         1 Adelie  Torgersen           38.6          21.2               191
 2         1 Adelie  Dream               42.3          21.2               191
 3         1 Adelie  Torgersen           37.3          20.5               199
 4         1 Adelie  Dream               36            18.5               186
 5         1 Adelie  Dream               41.5          18.5               201
 6         1 Adelie  Dream               37.8          18.1               193
 7         1 Adelie  Dream               38.9          18.8               190
 8         1 Adelie  Dream               35.7          18                 202
 9         1 Adelie  Dream               37.3          16.8               192
10         1 Adelie  Dream               41.1          18.1               205
# ℹ 9,990 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
many_chinstrap_samp10
# A tibble: 10,000 × 9
# Groups:   replicate [1,000]
   replicate species   island bill_length_mm bill_depth_mm flipper_length_mm
       <int> <fct>     <fct>           <dbl>         <dbl>             <int>
 1         1 Chinstrap Dream            47.6          18.3               195
 2         1 Chinstrap Dream            45.7          17                 195
 3         1 Chinstrap Dream            50.9          19.1               196
 4         1 Chinstrap Dream            50.6          19.4               193
 5         1 Chinstrap Dream            47.5          16.8               199
 6         1 Chinstrap Dream            50.5          19.6               201
 7         1 Chinstrap Dream            42.4          17.3               181
 8         1 Chinstrap Dream            45.6          19.4               194
 9         1 Chinstrap Dream            52.7          19.8               197
10         1 Chinstrap Dream            46.2          17.5               187
# ℹ 9,990 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
many_gentoo_samp10
# A tibble: 10,000 × 9
# Groups:   replicate [1,000]
   replicate species island bill_length_mm bill_depth_mm flipper_length_mm
       <int> <fct>   <fct>           <dbl>         <dbl>             <int>
 1         1 Gentoo  Biscoe           43.5          15.2               213
 2         1 Gentoo  Biscoe           45.2          14.8               212
 3         1 Gentoo  Biscoe           50            15.3               220
 4         1 Gentoo  Biscoe           49.6          16                 225
 5         1 Gentoo  Biscoe           43.6          13.9               217
 6         1 Gentoo  Biscoe           47.5          15                 218
 7         1 Gentoo  Biscoe           49.5          16.2               229
 8         1 Gentoo  Biscoe           45.3          13.8               208
 9         1 Gentoo  Biscoe           49.8          16.8               230
10         1 Gentoo  Biscoe           49.1          14.5               212
# ℹ 9,990 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Remember, we only need to use set.seed in situations where we’re trying to compare output from random sampling. You can delete set.seed after you compared your output to mine.

You can use the rbind function to combine all three sets of sampling distributions into a single tibble.

penguins_sample_dists <- rbind(many_adelie_samp10, 
                               many_chinstrap_samp10,
                               many_gentoo_samp10)

penguins_sample_dists
# A tibble: 30,000 × 9
# Groups:   replicate [1,000]
   replicate species island    bill_length_mm bill_depth_mm flipper_length_mm
       <int> <fct>   <fct>              <dbl>         <dbl>             <int>
 1         1 Adelie  Torgersen           38.6          21.2               191
 2         1 Adelie  Dream               42.3          21.2               191
 3         1 Adelie  Torgersen           37.3          20.5               199
 4         1 Adelie  Dream               36            18.5               186
 5         1 Adelie  Dream               41.5          18.5               201
 6         1 Adelie  Dream               37.8          18.1               193
 7         1 Adelie  Dream               38.9          18.8               190
 8         1 Adelie  Dream               35.7          18                 202
 9         1 Adelie  Dream               37.3          16.8               192
10         1 Adelie  Dream               41.1          18.1               205
# ℹ 29,990 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Now we have a very large set of samples to examine.

3.2.2 Sample mean \(\bar{Y}\)

To look at the distribution of the sample mean, we first need to calculate the sample mean for all 1000 replicates per species. We will use two helpful functions from dplyr to first group the penguins_sample_dists data.frame by species and replicate, and then calculate the mean for each species with the summarize function. Let’s first look at the sampling distribution of the mean of bill length:

bill_length_sample_dists <-  group_by(penguins_sample_dists, species, replicate)
bill_length_sample_dists <- summarize(bill_length_sample_dists, 
                                      Y_bar = mean(bill_length_mm, na.rm = TRUE))

bill_length_sample_dists
# A tibble: 3,000 × 3
# Groups:   species [3]
   species replicate Y_bar
   <fct>       <int> <dbl>
 1 Adelie          1  38.6
 2 Adelie          2  37.0
 3 Adelie          3  38  
 4 Adelie          4  38.4
 5 Adelie          5  39.2
 6 Adelie          6  39.5
 7 Adelie          7  39.8
 8 Adelie          8  39.4
 9 Adelie          9  40.2
10 Adelie         10  38.2
# ℹ 2,990 more rows

Modify the code above to calculate sample means for flipper_length_mm.

3.2.3 Plot the sampling distribution \(\bar{Y}\)

We can apply the ggplot() tools we’ve already learned to plot a multiple histogram to compare the sampling distributions in each species.

library(ggplot2)

ggplot(bill_length_sample_dists, aes(Y_bar, fill = species)) +
    geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
    scale_fill_viridis_d()

See if you remember how to use facet_grid() to put each specie in it’s own panel like this:

3.2.4 Standard error of the mean

The standard error of the mean helps us quantify our uncertainty about our estimate of the population mean given our sample size. We can calculate a hypothetical standard error for the perfect random sample of size \(n\) by dividing the population standard deviation by \(\sqrt{n}\): \(\sigma / \sqrt{n}\). Let’s pretend that the penguins data set is the entire “population”, the population standard deviation for bill_length_mm is:

species \(\sigma\)
Adelie 2.65
Chinstrap 3.29
Gentoo 3.06

Now calculate the hypothetical standard error of the mean for a sample size of 10. You should get:

species \(\sigma\) \(\sigma/\sqrt{n}\)
Adelie 2.65 0.8380036
Chinstrap 3.29 1.0403894
Gentoo 3.06 0.9676570

Let’s compare this hypothetical standard error of the mean to what we obtain from our simulations. Remember that the standard error of the mean is simply the standard deviation of the sampling distribution. That means we can get the answer by using the sd() function on our simulated sampling distribution.

bill_length_se <- group_by(bill_length_sample_dists, species)
bill_length_se <- summarize(bill_length_se, SE_Ybar = sd(Y_bar))
bill_length_se
# A tibble: 3 × 2
  species   SE_Ybar
  <fct>       <dbl>
1 Adelie      0.841
2 Chinstrap   1.01 
3 Gentoo      0.929

Notice that we had to first group by Species, then summarize by taking the standard deviation of all of our sample means.

Are the population standard errors of the mean close to what you calculated from the simulations? Are the standard errors what you expected given the multiple histogram figure above?

3.2.5 Sample standard error of the mean

The sample standard error (\(\mathrm{SE}_{\bar{Y}}\)) quantifies our uncertainty in our estimate of the population mean, \(\bar{Y}\). Specifically, \(\mathrm{SE}_{\bar{Y}}\) is the standard deviation of sampling distribution for \(\bar{Y}\). The equation for the \(\mathrm{SE}_{\bar{Y}}\) is the sample standard deviation divided by the square-root of the sample size:

\[ \mathrm{SE}_{\bar{Y}} = \frac{s}{\sqrt{n}} \]

There’s no function in R to calculate \(\mathrm{SE}_{\bar{Y}}\), but you know the functions for sample standard deviation and square-root. Use R to calculate the sample standard error of the mean for the following numbers:

2.16 -0.79 -0.18 1.62 -0.98 -1.15 -0.15 1.34 1.96 1.74

You should get \(\mathrm{SE}_{\bar{Y}}\) = 0.42.

3.2.6 95% confidence intervals

Confidence intervals are a way to show the plausible range of parameter values given the data. 95% confidence intervals will include the true population parameter 95% of the time. We’ll learn ways to calculate confidence intervals for different parameters throughout the class. Today, we’ll use the “2 SE” rule to approximate 95% confidence intervals for the sample mean \(\bar{Y}\). The lower bound and upper bounds of the approximate 95% confidence interval using the 2 SE rule are:

\[ \text{lower CI}: \bar{Y} - 2 \times \mathrm{SE}_{\bar{Y}} \] \[ \text{upper CI}: \bar{Y} + 2 \times \mathrm{SE}_{\bar{Y}} \] Use the mean() function and standard error of the mean to calculate the confidence interval for the data you used in the last section. You should get:

\(\bar{Y}\) Lower CI Upper CI
0.557 -0.2813638 1.395364

3.3 Questions

All questions are about the sampling distribution of the sample mean, \(\bar{Y}\)

  1. Import data

    We’ll use a dataset about leaf sizes from Wright et al. (2017). We’ll pretend that this is population of all leaf sizes in the world and look at the properties of random samples from the population.

    Use the read.csv(), $, [, and/or dplyr functions to

    • read-in the dataset
    • make a data.frame with only the latitude and leafsize_cm2 columns
    • remove all rows with missing values from leafsize_cm2
    • subset the data to only tropical latitudes between -23.43655° and 23.43655°
    • assign this data.frame to the name leafsize

    Hints:

    • to get ONE column you can use ...$column_name, to get multiple columns, you can use ...[, c("column_name1", "column_name2")]
    • you can figure out if a value is missing with the is.na function
    • latitudes between -23.43655° and 23.43655° is the same as abs(latitude) < 23.43655

    If you’ve done everything correctly, you should get the same values for the population mean seen below:

    mean(leafsize$leafsize_cm2)

    [1] 65.97642

  2. Create 1000 replicates each of sample sizes of 64, 256, and 1024 from the leafize data.frame you generated in a. I’ll show you the code for \(n = 64\), then you should copy and modify it to make similar objects called sample_dist256 and sample_dist1024. Then use rbind() to combine them into an object called sample_dists.

    # create replicate samples
    sample_dist64 <- rep_slice_sample(leafsize, n = 64, reps = 1e3)
    
    # add a column recording the sample size
    sample_dist64$sample_size <- 64
    
    # create replicate samples
    sample_dist256 <- ___
    
    # add a column recording the sample size
    ___ <- 256
    
    # create replicate samples
    sample_dist1024 <- ___
    
    # add a column recording the sample size
    ___ <- 1024
    
    # combine using `rbind`
    sample_dists <- ___
  3. Use the group_by() and summarize() functions to calculate the sample mean for each level of sample size (64, 256, or 1024) and replicate. Make sure you assign the output a name so you can use it to make a plot in the next part.

  4. Make a multi-panel histogram with separate panels for each sample size. It should look something like this, but will not be exactly the same because the simulations are random.

  5. How does the location and width of the sampling distribution for \(\bar{Y}\) change as \(n\) increases?