Simple regression models of Galapagos Islands biodiversity

This regression modelling exercise draws directly on the example in Chapter 7 of Statistical Methods for Geography by Peter A. Rogerson (2001, Sage: London). The data in this document are largely the same as those in that work, although a couple of missing elevation values estimated by Rogerson (for Baltra and Seymour islands) have been included here from OSM or other sources.

Load some needed libraries

library(sf)
library(tmap)
library(dplyr)
library(GGally) # for scatter plot matrices

# set the tmap mode to web maps
tmap_mode("view")

Read and inspect the data

galapagos.data <- read.csv('galapagos.csv', sep=',')

Now inspect the data.

as_tibble(galapagos.data)

## # A tibble: 30 × 8
##    ISLAND            NUM_SPEC NATIVE  AREA  ELEV NEAREST S_CRUZ AREA_ADJ
##    <chr>                <int>  <int> <dbl> <int>   <dbl>  <dbl>    <dbl>
##  1 Isla Baltra             58     23 25.1    173     0.6    0.6     1.84
##  2 Isla Bartolomé          31     21  1.24   109     0.6   26.3   572.  
##  3 Isla Caldwell            3      3  0.21   114     2.8   58.7     0.78
##  4 Isla Champion           25      9  0.1     46     1.9   47.4     0.18
##  5 Isla Coamano             2      1  0.05    20     1.9    1.9   904.  
##  6 Isla Daphne Major       18     11  0.34   119     8      8       1.84
##  7 Isla Daphne Minor       24      6  0.08    93     6     12       0.34
##  8 Isla Darwin             10      7  2.33   168    34.1  290.      2.85
##  9 Isla Eden                8      4  0.03    37     0.4    0.4    18.0 
## 10 Isla Enderby             2      2  0.18   112     2.6   50.2     0.1 
## # … with 20 more rows

Two of these variables NUM_SPEC and NATIVE are potential dependent variables to be ‘explained’, while the others are potential independent explanatory variables, drawing on ideas from island biogeography. The variables are as followw:

ISLAND the island name
NUM_SPEC and NATIVE the total number of species (including native) and native (only) species
AREA island area in square km
ELEV maximum elevation of the island in m
NEAREST distance in km to the nearest other island in the archipelago
S_CRUZ distance in km to Isla Santa Cruz
AREA_ADJ area in square km of the nearest other island

We use GGally::ggpairs to examine all the variables in a single display. Note that we don’t examine the island names in this display.

ggpairs(galapagos.data, columns = 2:8)

Of note here is that as we might expect the strongest correlation is between NUM_SPEC and NATIVE. Since we will build models to explain each of these separately, and not include either as explanatory variables, this is OK. AREA and ELEV are correlated which is a potential problem from a technical perspective (note that each might be expected to account for some biodiversity). No other variables are strongly correlated with either of the biodiversity indicators. Based on this plot we might have a hunch that a model for NUM_SPEC based on AREA and ELEV will work well.

Model building

A kitchen sink model

Rogerson starts with a ‘kitchen sink’ model including all variables, although he recommends against this approach. But, hey! — let’s give it a go.

m1 <- lm(NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ, data = galapagos.data)
summary(m1)

## 
## Call:
## lm(formula = NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ, 
##     data = galapagos.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.56  -31.92  -10.87   31.30  178.83 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35562   18.37764   0.672 0.507803    
## AREA        -0.02494    0.02214  -1.127 0.270997    
## ELEV         0.31877    0.05239   6.085 2.77e-06 ***
## NEAREST     -0.03526    1.04158  -0.034 0.973273    
## S_CRUZ      -0.27007    0.21215  -1.273 0.215198    
## AREA_ADJ    -0.07461    0.01739  -4.291 0.000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.18 on 24 degrees of freedom
## Multiple R-squared:  0.7719, Adjusted R-squared:  0.7244 
## F-statistic: 16.25 on 5 and 24 DF,  p-value: 5.04e-07

The results provided by the summary function point to ELEV and AREA_ADJ as the two most significant explanatory variables. Although our results are different in detail from the Rogerson book (because of minor differences in the data), this result matches well qualitatively with Rogerson’s, which is encouraging. The regression equation corresponding with this model summary can be assembled using the estimated coefficients. These suggest that each additional metre of maximum elevation on an island accounts for 0.319 additional species, while each additional square km of the nearest other island reduces the number of species by 0.075. The first of these seems reasonable, the second might require more complicated explanation. Perhaps if the nearest other island is large, fewer species are present on this island because they end up ‘over there’.

More to the point, the standard error of the model is 60.18, which given that the median number of species on all islands is 42 is rather high! The other surprising feature of the model is that the coefficient associated with the AREA variable is negative, meaning that the larger the island, the fewer species we expect. This is a consequence of the fact that the model also includes elevation ELEV and this ‘masks out’ the effect of larger areas.

Another model

The unexpected negative coefficient associated with the AREA variable suggests we should remove it from consideration, so let’s try that, while also removing the other non-signficant variables from the kitchen sink model.

m2 <- lm(NUM_SPEC ~ ELEV + AREA_ADJ, data = galapagos.data)
summary(m2)

## 
## Call:
## lm(formula = NUM_SPEC ~ ELEV + AREA_ADJ, data = galapagos.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.452  -33.175   -7.787   19.485  202.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.65550   14.77915   0.315 0.755180    
## ELEV         0.27367    0.03139   8.717 2.48e-09 ***
## AREA_ADJ    -0.06830    0.01544  -4.424 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.81 on 27 degrees of freedom
## Multiple R-squared:  0.738,  Adjusted R-squared:  0.7186 
## F-statistic: 38.03 on 2 and 27 DF,  p-value: 1.402e-08

The notable feature of this model is that its overall characteristics in terms of overall variance accounted for (the R-squared statistics) and the standard error are very similar to those of the kitchen sink model, but based on only two independent variables not 5. This means the model is more parsimonious in that it does a lot with a little, and we might stop here if we are satisfied that we can provide an account of this model that is in accordance with the theoretical ideas about island biodiversity.

We might also examine very stripped down models, perhaps NUM_SPEC ~ ELEV or NUM_SPEC ~ AREA, to see how well these work (it turns out surprisingly badly…). For example:

m3 <- lm(NUM_SPEC ~ AREA, data = galapagos.data)
summary(m3)

## 
## Call:
## lm(formula = NUM_SPEC ~ AREA, data = galapagos.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.495 -53.431 -29.045   3.423 306.137 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.78286   17.52442   3.640 0.001094 ** 
## AREA         0.08196    0.01971   4.158 0.000275 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.73 on 28 degrees of freedom
## Multiple R-squared:  0.3817, Adjusted R-squared:  0.3596 
## F-statistic: 17.29 on 1 and 28 DF,  p-value: 0.0002748

The technical problem of ELEV and AREA being strongly correlated, which makes it problematic to include both in a model, may in this case also be a real effect. Many of these small islands are effectively single peaks so that elevation and area are direct proxies for one another. Thus, it may be that ELEV is more important than AREA as an indicator of the variety of habitat niches provided by an island.

A significant issue with the variables in this dataset is that they are highly skewed in most cases, mainly because the largest island is so much larger than all the others. This would suggest that it may make sense to perform various transformations on the variables to make their distributions more ‘normal’. A more contemporary approach would be to use machine-learning methods to fit non-linear models to these data. Those methods are beyond our present scope, but if you are interested you might explore them (and we can help out!) in the mini-project exercises.

Spatial characteristics of the models

Rather than spend too much time worrying about the exact details of the model, here we focus on mapping the models we have built. To do that we need to read in a spatial dataset.

galapagos.s <- st_read('galapagos.gpkg')

## Reading layer `galapagos' from data source 
##   `/home/osullid3/Documents/teaching/Geog315/slides/regression/example/galapagos.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 20 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -92.00919 ymin: -1.411568 xmax: -89.24089 ymax: 1.681402
## Geodetic CRS:  WGS 84

This dataset is missing a few of the islands (they weren’t present in OSM data…) but enough are present to get an idea of the distribution of the model variables.

First make some additional columns in the data table based on the models’ fitted and residual (error) values.

galapagos.results <- galapagos.data %>%
  mutate(m1fit = m1$fitted.values, m1error = -m1$residuals,
         m2fit = m2$fitted.values, m2error = -m2$residuals)

Then join these to the spatial data for mapping (both datasets have an ISLAND variable which the join will be based on).

galapagos.s <- galapagos.s %>%
  inner_join(galapagos.results)

## Joining, by = "ISLAND"

Map the predicted values

Now we can map the actual and predicted number of species. We will use tmap_arrange to do this for side-by-side comparisons.

m1map <- tm_shape(galapagos.s) + tm_polygons(col = "m1fit", style = "cont", palette = "Greens", title="model 1")
m2map <- tm_shape(galapagos.s) + tm_polygons(col = "m2fit", style = "cont", palette = "Greens", title="model 2")
actual <- tm_shape(galapagos.s) + tm_polygons(col = "NUM_SPEC", style = "cont", palette = "Greens", title="actual")
tmap_arrange(m1map, m2map, actual, sync = TRUE)

It is clear from this that both models substantiall overpredict the number of species present on the largest island Isla Isabela, and underpredicts diversity on Isla Santa Cruz. Reading up on the complex histories (both human and natural) of the islands, suggests many reasons why our simplistic models might not do a particularly great job.

Perhaps an easier way to visualise the model performance is mapping model errors (the residuals).

m1err <- tm_shape(galapagos.s) + tm_polygons(col = "m1error", style = "cont", palette = "RdBu", title="m1 error")
m2err <- tm_shape(galapagos.s) + tm_polygons(col = "m2error", style = "cont", palette = "RdBu", title="m2 error")
actual <- tm_shape(galapagos.s) + tm_polygons(col = "NUM_SPEC", style = "cont", palette = "Greens", title="actual")
tmap_arrange(m1err, m2err, actual, sync = TRUE)

It is worth noticing that I have mapped the negative of the model residuals, as this is maybe a bit easier to think about as an ‘error’. The residual is the difference actual - model, so a positive residual is when the model predicts a lower value than the actual, meaning that the model error is negative. This can get a little bit confusing, so it is important to be clear in your head about the signs of model errors, and I find it helpful to take this approach.

Overall, it is difficult to draw many firm conclusions from this exercise, although the interpretive nature of regression modelling as a way to explore the complexities of data is hopefully apparent.

From a geographical perspective, of particular note is that we are especially interested in where a model performs well and where it performs badly as this may provide clues to more nuanced explanations of what is happening. In this case, Isabela is the most human-affected of the islands, and conservation activity is most focused on Santa Cruz, and it seems clear that these would have to be accounted for in any more substantial modelling exercise.

Automating searching all the possible models

It’s possible to make many many different models for these data. There are 5 variables (AREA, ELEV, NEAREST, S_CRUZ, AREA_ADJ) that might be included or not. That allows for up to 31 different models! (Each variable can be included or not, so 2 x 2 x 2 x 2 x 2 = 32 possibilities, one of which is no variables, so we exclude that from our count). If we had more variables, there would be even more possibilities.

In this situation it can make sense to get R to search the possibilities. There are many ways that this can be done. The most basic is called stepwise regression where we allow the computer to iteratively include or drop variables one at a time, only continuing on if the model improves based on some measure.

The easiest way to understand this is to try it. The function required is called step() and it takes as its argument a starting model. Since we made a kitchen-sink model (m1) let’s use that and see what we get.

step(m1, direction = "both")

## Start:  AIC=251.14
## NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ
## 
##            Df Sum of Sq    RSS    AIC
## - NEAREST   1         4  86918 249.15
## - AREA      1      4597  91511 250.69
## - S_CRUZ    1      5869  92783 251.10
## <none>                   86914 251.14
## - AREA_ADJ  1     66692 153606 266.23
## - ELEV      1    134084 220998 277.14
## 
## Step:  AIC=249.15
## NUM_SPEC ~ AREA + ELEV + S_CRUZ + AREA_ADJ
## 
##            Df Sum of Sq    RSS    AIC
## - AREA      1      4744  91662 248.74
## <none>                   86918 249.15
## - S_CRUZ    1      9873  96791 250.37
## + NEAREST   1         4  86914 251.14
## - AREA_ADJ  1     72322 159240 265.31
## - ELEV      1    141759 228677 276.17
## 
## Step:  AIC=248.74
## NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ
## 
##            Df Sum of Sq    RSS    AIC
## <none>                   91662 248.74
## + AREA      1      4744  86918 249.15
## - S_CRUZ    1      8176  99839 249.30
## + NEAREST   1       151  91511 250.69
## - AREA_ADJ  1     69216 160879 263.62
## - ELEV      1    277789 369451 288.56

## 
## Call:
## lm(formula = NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
## 
## Coefficients:
## (Intercept)         ELEV       S_CRUZ     AREA_ADJ  
##    18.89774      0.27224     -0.24726     -0.06692

This rather complicated output shows us starting from m1 what happens if we try dropping each of the possible variables. In doing so it ranks the models that result based on a measure AIC (Akaike Information Criterion). The lower this value, the better the model, relative to others under consideration. So in ‘round 1’ m1 we start with AIC = 251.14 and the step that makes the best new model is to remove the variable NEAREST. In the next round the same process causes us to drop AREA.

This process continues until we end up with a model that can’t be improved (based on the AIC measure), given by the function NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ. We can then build that model (provided we find it plausible and can explain it in a reasonable way!)

m4 <- lm(NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
summary(m4)

## 
## Call:
## lm(formula = NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88.19 -32.52 -12.15  24.75 189.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.89774   17.19617   1.099 0.281862    
## ELEV         0.27224    0.03067   8.877 2.38e-09 ***
## S_CRUZ      -0.24726    0.16236  -1.523 0.139854    
## AREA_ADJ    -0.06692    0.01510  -4.431 0.000151 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.38 on 26 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7317 
## F-statistic: 27.36 on 3 and 26 DF,  p-value: 3.326e-08

What may be questionable about this model is the inclusion of a variable that appears to not be signficant (S_CRUZ), and we would need to weigh up the value of retaining this variable or not.

Note that it is also possible to start with a simple model and build to a more complicated one:

step(m3, NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ, direction = "both")

## Start:  AIC=273.06
## NUM_SPEC ~ AREA
## 
##            Df Sum of Sq    RSS    AIC
## + ELEV      1     66602 169009 265.10
## <none>                  235611 273.06
## + S_CRUZ    1      4563 231048 274.48
## + AREA_ADJ  1      2850 232761 274.70
## + NEAREST   1      1148 234463 274.92
## - AREA      1    145470 381081 285.49
## 
## Step:  AIC=265.1
## NUM_SPEC ~ AREA + ELEV
## 
##            Df Sum of Sq    RSS    AIC
## + AREA_ADJ  1     72219  96791 250.37
## - AREA      1      3192 172201 263.66
## <none>                  169009 265.10
## + S_CRUZ    1      9769 159240 265.31
## + NEAREST   1         0 169009 267.10
## - ELEV      1     66602 235611 273.06
## 
## Step:  AIC=250.37
## NUM_SPEC ~ AREA + ELEV + AREA_ADJ
## 
##            Df Sum of Sq    RSS    AIC
## + S_CRUZ    1      9873  86918 249.15
## - AREA      1      3048  99839 249.30
## <none>                   96791 250.37
## + NEAREST   1      4008  92783 251.10
## - AREA_ADJ  1     72219 169009 265.10
## - ELEV      1    135970 232761 274.70
## 
## Step:  AIC=249.15
## NUM_SPEC ~ AREA + ELEV + AREA_ADJ + S_CRUZ
## 
##            Df Sum of Sq    RSS    AIC
## - AREA      1      4744  91662 248.74
## <none>                   86918 249.15
## - S_CRUZ    1      9873  96791 250.37
## + NEAREST   1         4  86914 251.14
## - AREA_ADJ  1     72322 159240 265.31
## - ELEV      1    141759 228677 276.17
## 
## Step:  AIC=248.74
## NUM_SPEC ~ ELEV + AREA_ADJ + S_CRUZ
## 
##            Df Sum of Sq    RSS    AIC
## <none>                   91662 248.74
## + AREA      1      4744  86918 249.15
## - S_CRUZ    1      8176  99839 249.30
## + NEAREST   1       151  91511 250.69
## - AREA_ADJ  1     69216 160879 263.62
## - ELEV      1    277789 369451 288.56

## 
## Call:
## lm(formula = NUM_SPEC ~ ELEV + AREA_ADJ + S_CRUZ, data = galapagos.data)
## 
## Coefficients:
## (Intercept)         ELEV     AREA_ADJ       S_CRUZ  
##    18.89774      0.27224     -0.06692     -0.24726

We actually end up with the same model, but this is not true in general and will depend on the details of a particular dataset.

The most important thing is that the model you end up with is based not just on allowing the computer to do its thing, but an assessment of the scientific relevance of the results: in short, can you explain your model in a way that makes sense to you and others?

A further exercise

Further exploration of this material perhaps modelling NATIVE as the dependent variable can be explored by downloading the materials in this zip file.