21  The Poisson Generalized Linear Model

In the previous module we introduced GLMs using the binomial distribution as our example. Recall that the key insight is that different kinds of response variables call for different probability distributions. When data are binary (0/1) or represent proportions, the binomial distribution is appropriate. But what about data that are counts such as the number of trees in a plot or the number of species on an island? For these, the Poisson distribution is a natural choice.

21.1 The Poisson distribution

The Poisson distribution describes the probability of observing a non-negative, integer count. We write

\[ y \sim \text{Pois}(\lambda) \]

The parameter \(\lambda\) represents the mean of the distribution, But with only one parameter, we do not have very much flexibility to modulate the variance of the distribution. It turns out the Poisson imposes the strong constraint that the variance is equal to the mean, are both equal to \(\lambda\) (this will be a central point in an upcoming lab).

Here is an example probability mass function for a Poisson distribution with \(\lambda = 4\)

# range of count values
nn <- 0:12

ggplot(data.frame(x = nn, y = dpois(nn, 4)), aes(x, y)) +
    geom_bar(stat = "identity") +
    xlab(expression(italic(y))) +
    ylab("Probability") +
    theme_cowplot() +
    geom_vline(xintercept = 4, color = "orange", linewidth = 1.5)

Because \(\lambda\) is both the mean and variance of the distribution, and because counts can never be negative nor can a variance, a critical constraint is that \(\lambda \geq 0\). This constraint is what drives the need for a special link function in a Poisson GLM.

21.3 The species–area relationship

You’ll recall that in our R refresher we used the classic species area relationship (SAR) as an example for making data visualizations. To make a model of the SAR, a Poisson GLM might be an appropriate choice, after all number of species (the response variable in the SAR) is a count. One classic form for the (SAR) is expressed as

\[ \begin{aligned} S &= cA^z \\ \log(S) &= \log(c) + z\log(A) \end{aligned} \]

where \(S\) is the number of species, \(A\) is area, \(c\) is a constant, and \(z\) is the scaling exponent. Log-transforming both sides of the SAR reveals a linear equation…this should perhaps be sounding like a log link function situation. Indeed, if we recognize that we are modeling \(\bar{S}\), the average species richness, then we make these substitutions in the log link funciton equation

\[ \begin{aligned} \bar{S} &= \lambda \\ \log(A) &= x \\ b_0 &= \log(c) \\ b_1 &= z \end{aligned} \]

and we arrive at a log link function for \(\bar{S}\):

\[ \log(\bar{S}) = b_0 + b_1 \log(A) \]

Let’s visualize the data (note: we are excluding the two very large survey plots because they are extreme outliers on the x-axis)

On the log–log scale we see a roughly linear relationship, consistent with the SAR and log link assumptions of Poisson GLM.

21.4 Fitting a Poisson GLM in R

Fitting a Poisson GLM in R again uses the glm function, just as with the binomial GLM. The key difference is specifying family = poisson:

# note: i've done some data wrangling alread to make the
# object `dat_plt`
head(dat_plt)
  PlotID Plot_Area avg_temp_annual_c hii nspp
1      1   1000.00          19.90869  38    9
2      2   1017.88          22.25223  22    9
3      3   1017.88          22.25223  22    2
4      4   1017.88          22.25223  22    9
5      5   1017.88          22.25223  22    4
6      6   1017.88          22.25223  22    9
# here is the model
sar <- glm(nspp ~ log(Plot_Area),
           data = dat_plt,
           family = poisson)

Note that we only need to log-transform the explanatory variable (Plot_Area). The log link function automatically handles the log-transformation of the mean of the response. We do not need to log-transform nspp. We can inspect the coefficients as usual:

coefficients(sar)
   (Intercept) log(Plot_Area) 
    -1.8086075      0.4893528 

We can plug these into the SAR equation:

\[ S = e^{-1.809} A^{0.489} \]

21.5 Adding more explanatory variables

Area is a natural starting point for modeling species richness, but we are likely interested in additional predictors. In fact, area might even be a nuisance variable—one we need to account for statistically, but that is not our main scientific interest. We might be more interested in, say, human impact (hii) or temperature (avg_temp_annual_c).

Adding more predictors to a Poisson GLM is no different from adding them to a linear model—just extend the formula:

mod <- glm(nspp ~ log(Plot_Area) + hii + avg_temp_annual_c,
           data = dat_plt,
           family = poisson)

coefficients(mod)
      (Intercept)    log(Plot_Area)               hii avg_temp_annual_c 
     -2.320992411       0.516567381       0.011011301       0.007295895 

After accounting for plot area, species richness appears to increase (weakly) with both human impact index and mean annual temperature. Given our conservation concern about human impact, this might not make sense at first. However, there is at least some scientific thought that non-native plants augment species richness (Sax & Gaines 2003) and thus heavily impacted sites have many introduced species that add to the total count. However, the other part of the story is temperature—hotter areas have more species. This could be because lowland areas are hotter and also more likely to be impacted by humans. But it is also worth noting that, historically, hot and dry lowland forests were once incredibly diverse with unique endemic species (Rock 1913) that are indeed now being lost by human impact. It is also worth taking a moment to contemplate the Rock (1913) publication. Rock was a botanist paid by the American government to survey plants in Hawaiʻi. 1913 is only 30 years after the 1893 illegal overthrow of the Hawaiian Kingdom. While Rock (1913) provides a convenient citation, his work also represents science as colonialism and colonialism as science.

21.6 Collinearity

At a much more trivial level—in terms of ethics and biology—adding more explanatory variables requires care. When two or more predictors are correlated with each other, we run into collinearity, and that causes problems for estimation.

To see why, we can visualize the likelihood surface for the slopes of human impact and temperature from the model above:

Instead of a well-defined peak, the likelihood surface has a long, flat ridge, meaning many—infinite—combinations of the two slope coefficients have nearly the same likelihood. That makes it difficult and unreliable to identify the unique best-fitting parameter values.

When we visualize human impact and temperature we see they are indeed correlated with each other:

ggplot(dat_plt, aes(hii, avg_temp_annual_c)) +
    geom_point() +
    xlab("Human impact index") +
    ylab("Mean annual temperature (C)") +
    theme_cowplot()

The practical takeaway: before including multiple predictors in a model, check for pairwise correlations among them. A commonly used guideline is to only include predictors with pairwise correlation coefficients \(-0.6 < r < 0.6\). You can compute correlations with cor() in R:

cor(dat_plt$hii, dat_plt$avg_temp_annual_c, use = "complete.obs")
[1] 0.6397327

If two predictors are too highly correlated, you will typically need to choose one or the other, or find a way to combine them, such as through principal component analysis or other dimension reduction techniques, rather than including both in the same model.

21.7 Slides

21.8 References

Rock JF. 1913. The indigenous trees of the hawaiian islands. Patronage, Honoulu: Hawaii.
Sax DF, Gaines SD. 2003. Species diversity: From global decreases to local increases. Trends in Ecology & Evolution 18:561–566. Elsevier.