Understand the sampling distribution of an estimate
Investigate sampling error
Calculate standard error of the mean
Calculate confidence intervals
4.2 Learning the Tools
4.2.1 Simulating your own sampling distributions
Just like we did in lecture, we will simulate our own sampling distribution of the mean. For this lab, we will use the iris dataset to demonstrate how. You’ll take many random samples from the iris dataset, calculate the sample mean \(\bar{Y}\) for each, and plot the distribution.
4.2.1.1 Randomly sampling rows
We used the replicate function in lecture, but the dplyr package offers tools that are perhaps even more intuitive. You can take a random sample of rows in your data using the dplyr function slice_sample(). For example, to sample 5 rows at random from the iris data set, you would do the following:
Now try on your to increase the sample size 10, how would you do this?
4.2.1.2 Repeated sampling
To simulate a sampling distribution, we need to repeatedly randomly sample the “population” (in this case, we’re pretending the iris data set is the entire population). The infer package has a convenient function rep_slice_sample() that will repeat slice_sample() many times. It creates a new column called replicate to index each replicate sample. To randomly sample 5 rows 4 times from Iris setosa, we do:
# first load the infer pacakgelibrary(infer)# let's look at just the species setosa, so we'll need to subset our datajust_setosa <-subset(iris, Species =="setosa")many_setosa_samp <-rep_slice_sample(just_setosa, n =5, reps =4)head(many_setosa_samp)
Note: the infer package converted the iris data.frame to something called a tibble, for our purposes, think of a data.frame and a tibble as equivalent.
Now, let’s take 1000 samples of 10 from each species. I’ll show you the code for Iris setosa, then fill in the ___ sections below to do it for it the other species. After, you’ll need to combine the results before summarizing and plotting.
# let's use `set.seed` so we can compare answersset.seed(123)just_setosa <-subset(iris, Species =="setosa")many_setosa_samp10 <-rep_slice_sample(just_setosa, n =10, reps =1000)just_versicolor <-subset(iris, Species == ___)many_versicolor_samp10 <-rep_slice_sample(___, n =10, reps =1000)just_virginica <-subset(iris, Species == ___)many_virginica_samp10 <-rep_slice_sample(___, n =10, reps =1000)
If you used the same seed (123) in the above code then you should get these same answers:
Remember, we only need to use set.seed in situations where we’re trying to compare output from random sampling. You can delete set.seed after you compared your output to mine.
You can use the rbind function to combine all three sets of sampling distributions into a single tibble.
Now we have a very large set of samples to examine.
4.2.2 Sample mean \(\bar{Y}\)
To look at the distribution of the sample mean, we first need to calculate the sample mean for all 1000 replicates per species. We will use two helpful functions from dplyr to first group the iris_sample_dists data.frame by species and replicate, and then calculate the mean for each species with the summarize function. Let’s first look at the sampling distribution of the mean of sepal length:
Modify the code above to calculate sample means for Petal.Length.
4.2.3 Plot the sampling distribution \(\bar{Y}\)
We can apply the ggplot() tools we’ve already learned to plot a multiple histogram to compare the sampling distributions in each species.
ggplot(sepal_length_sample_dists, aes(Y_bar, fill = Species)) +geom_histogram(alpha =0.5, position ="identity", bins =30) +scale_fill_viridis_d()
See if you remember how to use facet_grid() to put each Specie in it’s own panel like this:
4.2.4 Standard error of the mean
The standard error of the mean helps us quantify our uncertainty about our estimate of the population mean given our sample size. We can calculate a hypothetical standard error for the perfect random sample of size \(n\) by dividing the population standard deviation by \(\sqrt{n}\): \(\sigma / \sqrt{n}\). Let’s pretend that the iris data set is the entire “population”, the population standard deviation for Sepal.Length is:
Species
\(\sigma\)
setosa
0.35
versicolor
0.51
virginica
0.62
Now calculate the hypothetical standard error of the mean for a sample size of 10. You should get:
Species
\(\sigma\)
\(\sigma/\sqrt{n}\)
setosa
0.35
0.1106797
versicolor
0.51
0.1612762
virginica
0.62
0.1960612
Let’s compare this hypothetical standard error of the mean to what we obtain from our simulations. Remember that the standard error of the mean is simply the standard deviation of the sampling distribution. That means we can get the answer by using the sd() function on our simulated sampling distribution.
# A tibble: 3 × 2
Species SE_Ybar
<fct> <dbl>
1 setosa 0.101
2 versicolor 0.147
3 virginica 0.182
Notice that we had to first group by Species, then summarize by taking the standard deviation of all of our sample means.
Are the population standard errors of the mean close to what you calculated from the simulations? Are the standard errors what you expected given the multiple histogram figure above?
4.2.5 Sample standard error of the mean
The sample standard error (\(\mathrm{SE}_{\bar{Y}}\)) quantifies our uncertainty in our estimate of the population mean, \(\bar{Y}\). Specifically, \(\mathrm{SE}_{\bar{Y}}\) is the standard deviation of sampling distribution for \(\bar{Y}\). The equation for the \(\mathrm{SE}_{\bar{Y}}\) is the sample standard deviation divided by the square-root of the sample size:
\[ \mathrm{SE}_{\bar{Y}} = \frac{s}{\sqrt{n}} \]
There’s no function in R to calculate \(\mathrm{SE}_{\bar{Y}}\), but you know the functions for sample standard deviation and square-root. Use R to calculate the sample standard error of the mean for the following numbers:
Confidence intervals are a way to show the plausible range of parameter values given the data. 95% confidence intervals will include the true population parameter 95% of the time. We’ll learn ways to calculate confidence intervals for different parameters throughout the class. Today, we’ll use the “2 SE” rule to approximate 95% confidence intervals for the sample mean \(\bar{Y}\). The lower bound and upper bounds of the approximate 95% confidence interval using the 2 SE rule are:
\[ \text{lower CI}: \bar{Y} - 2 \times \mathrm{SE}_{\bar{Y}} \]\[ \text{upper CI}: \bar{Y} + 2 \times \mathrm{SE}_{\bar{Y}} \] Use the mean() functions and standard error of the mean to calculate the confidence interval for the data you used in the last section. You should get:
\(\bar{Y}\)
Lower CI
Upper CI
0.557
-0.2813638
1.395364
4.3 Questions
All questions are about the sampling distribution of the sample mean, \(\bar{Y}\)
Import data
We’ll use a dataset about leaf sizes from Wright et al. (2017). We’ll pretend that this is population of all leaf sizes in the world and look at the properties of random samples from the population.
Use the read.csv(), $, [, and/or dplyr functions to
read-in the dataset
make a data.frame with only the latitude and leafsize_cm2 columns
remove all rows with missing values from leafsize_cm2
subset the data to only tropical latitudes between -23.43655° and 23.43655°
assign this data.frame to the name leafsize
Hints:
to get ONE column you can use ...$column_name, to get multiple columns, you can use ...[, c("column_name1", "column_name2")]
remember the select function
you can figure out if a value is missing with the is.na function
latitudes between -23.43655° and 23.43655° is the same as abs(latitude) < 23.43655
If you’ve done everything correctly, you should get the same values for the population mean seen below:
mean(leafsize$leafsize_cm2)
[1] 65.97642
Create 1000 replicates each of sample sizes of 64, 256, and 1024 from the leafize data.frame you generated in a. I’ll show you the code for \(n = 64\), then you should copy and modify it to make similar objects called sample_dist256 and sample_dist1024. Then use rbind() to combine them into an object called sample_dists.
# create replicate samplessample_dist64 <-rep_slice_sample(leafsize, n =64, reps =1e3)# add a column recording the sample sizesample_dist64$sample_size <-64# create replicate samplessample_dist256 <- ___# add a column recording the sample size___ <-256# create replicate samplessample_dist1024 <- ___# add a column recording the sample size___ <-1024# combine using `rbind`sample_dists <- ___
Use the group_by() and summarize() functions to calculate the sample mean for each level of sample size (64, 256, or 1024) and replicate. Make sure you assign the output a name so you can use it to make a plot in the next part.
Make a multi-panel histogram with separate panels for each sample size. It should look something like this, but will not be exactly the same because the simulations are random.
How does the location and width of the sampling distribution for \(\bar{Y}\) change as \(n\) increases?