20 Building Binomial Generalized Linear Models

This lab will guide you through building, visualizing, and interpreting binomial GLMs and the kinds of data they model. You can download the template .qmd file for this lab here:

The template will prompt you to work through the following sections.

20.1 Building binomial GLMs

In the lecture on binomial GLMs we saw that elevation seemed to be a good predictor of presence or absence of non-native species with less probability of encountering non-native trees at higher elevations. Our job now is to figure out what is it about elevation that might be responsible for this pattern. Many variables change across elevation: temperature, rainfall, associated climatic measures like evapotranspiration. Human impact too might vary across elevation. In this lab we will select from these variables to build three different binomial GLMs and then visualize and interpret the data and models to gain understanding about what ecological processes might be going on across elevation to drive changes in probability of encountering non-native trees.

Our models can take many different forms with hypothetical formulas looking like

cbind(non_nat, num-tre - non_nat) ~ x_1
cbind(non_nat, num-tre - non_nat) ~ x_1 + x2
cbind(non_nat, num-tre - non_nat) ~ x_3 * x_4

In all cases we will use the matrix of “successes” and “failures” as our response variable. Keep the explanatory variables to three or fewer. At least one model should have only a single explanatory variable.

First we need to load and prepare the data.

20.1.1 Data preparations

We will need one data object with rows as survey plots and columns for all the potential explanatory variables. Additionally we will need a column for non_nat (the number of non-native trees) and num_tre (the total number of trees in each plot). Back in our R refresher we made a very similar data object, so let’s refer back to those exercises to see how to make the current data object we need.

First load the data

tree <- read.csv("https://raw.githubusercontent.com/uhm-biostats/stat-mod/refs/heads/main/data/OpenNahele_Tree_Data.csv")
clim <- read.csv("https://raw.githubusercontent.com/uhm-biostats/stat-mod/refs/heads/main/data/plot_climate.csv")
hii_geo <- read.csv("https://raw.githubusercontent.com/uhm-biostats/stat-mod/refs/heads/main/data/hii_geo.csv")

Go through the steps from the R refresher for joining the data and using the split-apply-combine workflow to produce the data object resolved at the level of one survey plot per row. Note that you do not need to calculate all the variables like basal and basal_native, just the number of non-native trees and the total number of trees. In the R refresher code we calculated total_abund which is the same computation as the num_tre column we want now. You can also look at how we calculated total_abund_native to see how we can calculate non-nat (hint: for natives we did Native_Status == "native", now we need Native_Status == "alien").

20.1.2 Working with `glm`

Before actually using glm, for each of your chosen models write down your hypothesis for whether the explanatory variables will have a positive or negative effect on the response variable. Then use glm to build each model. Once built, have a look at the model parameters using coefficients and compare the sign of the parameters to your hypotheses.

20.2 Visualizing data and models

Follow the lecture examples to visualize the data and model predictions (on the link function scale). When visualizing the link function predictions, use the first approach from the lecture where we extracted the coefficients and made a function to calculate the logit(p) response.

When you models have more than one explanatory variable, visualizations can get tricky. For this lab we will take the approach of visualizing response versus one explanatory variable at a time. So, for example, if your model formula is

cbind(non_nat, num-tre - non_nat) ~ x_1 + x2

you would need to plot two logit(p) reponses:

\[ \begin{aligned} logit(p) &= b_0 + b_1 * x_1~\text{ and } \\ logit(p) &= b_0 + b_2 * x_2 \end{aligned} \]

(note the intercepts are the same, which may not always be the case depending on interaction terms).

For this model you would also have two data plots, for which the code might look something like

# x_1 plot
ggplot(my_dat, aes(x_1, non_nat / num_tre)) +
    geom_pointdensity() +
    scale_color_viridis_c()

# x_2 plot
ggplot(my_dat, aes(x_2, non_nat / num_tre)) +
    geom_pointdensity() +
    scale_color_viridis_c()

For this lab, no need to add the model prediction to the plot of data.

20.1 Building binomial GLMs

20.1.1 Data preparations

20.1.2 Working with glm

20.2 Visualizing data and models

20.1.2 Working with `glm`