7  The Binomial Distribution and Analysis of Proportion Data

7.1 Goals

  • Gain familiarity with the Binomial Distribution
  • Test null hypotheses with the Binomial Test
  • Generate and interpret 95% confidence intervals of proportions

7.2 Learning the Tools

7.2.1 Binomial distribution

Let’s get familiar with the binomial distribution using a classic example: flipping a coin. We will simulate the coin flipping process using sample:

sample(c("H", "T"), size = 1, replace = TRUE, prob = c(0.5, 0.5))
[1] "H"

The above code simulates one coin toss that will come up either heads ("H") or tails ("T"). We are simulating a fair coin, meaning equal probabilities of heads and tails, thus we specify prob = c(0.5, 0.5). We could simulate the outcome of tossing a coin 4 times like this:

sample(c("H", "T"), size = 4, replace = TRUE, prob = c(0.5, 0.5))
[1] "H" "H" "T" "H"

Now let’s modify the code for 4 coin tosses to represent a biased coin, say one that comes up tails 70% of the time

sample(c("H", "T"), size = 4, replace = TRUE, prob = c(0.3, 0.7))
[1] "T" "T" "T" "H"

Let’s arbitrarily say that a “success” is if the coin lands tails up. We can count the number of successes like this:

# get some coin tosses (we're again using a fair coin)
toss <- sample(c("H", "T"), size = 4, replace = TRUE, prob = c(0.5, 0.5))
toss
[1] "H" "H" "H" "T"
# count successes
nsuccess <- sum(toss == "T")
nsuccess
[1] 1

The number of successes in 4 tosses is exactly the kind of situation where a binomial distribution is relevant. In this example, the probability of success is \(p = 0.5\) and the number of trials is \(n = 4\).

Let’s run the above code 20 times to estimate the frequencies of each potential outcome that will approximate the binomial distribution with \(p = 0.5\) and \(n = 4\).

To do that, simply run the above code 20 times and record the number of successes from each iteration. You can write down the success on paper, or the computer, or team up with a partner. Once you’re done, enter those data into a simple vector using the c function. You’re aiming to get something like this (you’re numbers will be different though):

# vector of the number of tails across 20 trials
ntails <- c(1, 2, 2, 1, 0, 
            1, 0, 2, 2, 4, 
            1, 3, 2, 2, 2, 
            1, 3, 3, 2, 2)

# use the `table` function to get frequencies
freqs <- table(ntails)
freqs
ntails
0 1 2 3 4 
2 5 9 3 1 

Let’s compare our estimated frequencies to the prediction from the binomial distribution. We will use dbinom to get the probability of each outcome directly from the binomial distribution. Remember, that for \(n = 4\) trials, all the possible outcomes are 0, 1, 2, 3, 4, which we can write in R code as 0:4:

dbinom(0:4, size = 4, prob = 0.5)
[1] 0.0625 0.2500 0.3750 0.2500 0.0625

Compare that to the frequencies divided by 20 (dividing by the number of simulations turns frequencies into probabilities)

freqs / 20
ntails
   0    1    2    3    4 
0.10 0.25 0.45 0.15 0.05 

Remember your numbers will be different, you might in fact not even have all the outcomes represented in your simulation. What we observe is that there is a general agreement between our simulation and the true probabilities from dbinom: 2 successes is the most probable outcome, 0 and 4 are the least likely.

Let’s also compare the probability of multiple events as estimated from our simulation and pbinom. Let’s look at the probability of 3 or fewer tails

# calculate the estimated probability
sum(ntails <= 3) / 20
[1] 0.95
# compare to the probability given by pbinom
pbinom(3, size = 4, prob = 0.5)
[1] 0.9375

Pretty close!

7.2.2 Binomial tests and confidence intervals

Doing a binomial test in R is much easier than doing one by hand.

The function binom.test() will do an exact binomial test. It requires three pieces of information in the input.

  • x: the number of successes
  • n: the number of trials
  • p: the proportion stated by the null hypothesis.

For example, let’s consider the hypothetical feeding trial discussed in lecture with pulelehua Kamehameha caterpillars. We imagined we did a feeding trial with 30 replicates between māmaki and Urtica. Out of those 30 trials we imagined the caterpillars choose māmaki 21 times. Let’s test the null hypothesis of no preference using binom.test. We need to tell binom.test the number of successes (x = 21), the number trials (n = 30) and the null hypothesis probability (p = 0.5).

binom.test(x = 21, n = 30, p = 0.5)

    Exact binomial test

data:  21 and 30
number of successes = 21, number of trials = 30, p-value = 0.04277
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5060410 0.8526548
sample estimates:
probability of success 
                   0.7 

In this case, the output of the function gives quite a bit of information. One key element that we will be looking for is the P-value; in this case R tells us that the P-value is 0.04277. This is the P-value that corresponds to a two-tailed test.

The binom.test() function also gives an estimate of the proportion of successes (in this case 0.7). It also gives an approximate 95% confidence interval. The method used to calculate this confidence interval is different from the method we discussed in lecture. When calculating confidence intervals for proportion data, you are welcome to use either the output of binom.test or the method we covered in lecture involving qbinom.

7.3 Questions

Let’s imagine a new feeding trial of pulelehua Kamehameha caterpillars. This time we will imagine a comparison inspired by Bongar et al. (2024). They looked at māmaki versus a close relative that happens to be invasive in Hawaiʻi: Cecropia obtusifolia. If pulelehua Kamehameha caterpillars could thrive on Cecropia obtusifolia that would be great.

Inspired by the results of Bongar et al. (2024), let’s suppose that out of 50 feeding trials where pulelehua Kamehameha caterpillars are given a choice between māmaki and Cecropia obtusifolia, they choose māmaki 38 times.

  1. Use binom.test() to calculate the estimated proportion of choosing māmaki. What is the 95% confidence interval for this proportion? What is the two-tailed \(P\)-value of the null hypothesis?

  2. Based on the results form binom.test you just produced does it seem promising that pulelehua Kamehameha caterpillars will be able to use Cecropia obtusifolia as a food plant?

  3. Based on the 95% confidence intervals, do you think the proportion of times the caterpillars choose māmaki over Urtica versus the proportion of times they choose māmaki over Cecropia obtusifolia are different from one another? Explain your answer. The hypothetical experiment comparing choice between māmaki and Urtica comes from our lectures, please refer to lecture slides for details on this hypothetical experiment.

  4. Find your own proportion data on the Internet or at home if you are doing this after lab. Develop a null hypothesis, state it in your report, and use binom.test() to test it. Did you reject your null hypothesis? Explain why or why not.