This regression modelling exercise draws directly on the example in Chapter 7 of Statistical Methods for Geography by Peter A. Rogerson (2001, Sage: London). The data in this document are largely the same as those in that work, although a couple of missing elevation values estimated by Rogerson (for Baltra and Seymour islands) have been included here from OSM or other sources.
library(sf)
library(tmap)
library(dplyr)
library(GGally) # for scatter plot matrices
# set the tmap mode to web maps
tmap_mode("view")
galapagos.data <- read.csv('galapagos.csv', sep=',')
Now inspect the data.
as_tibble(galapagos.data)
## # A tibble: 30 × 8
## ISLAND NUM_SPEC NATIVE AREA ELEV NEAREST S_CRUZ AREA_ADJ
## <chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Isla Baltra 58 23 25.1 173 0.6 0.6 1.84
## 2 Isla Bartolomé 31 21 1.24 109 0.6 26.3 572.
## 3 Isla Caldwell 3 3 0.21 114 2.8 58.7 0.78
## 4 Isla Champion 25 9 0.1 46 1.9 47.4 0.18
## 5 Isla Coamano 2 1 0.05 20 1.9 1.9 904.
## 6 Isla Daphne Major 18 11 0.34 119 8 8 1.84
## 7 Isla Daphne Minor 24 6 0.08 93 6 12 0.34
## 8 Isla Darwin 10 7 2.33 168 34.1 290. 2.85
## 9 Isla Eden 8 4 0.03 37 0.4 0.4 18.0
## 10 Isla Enderby 2 2 0.18 112 2.6 50.2 0.1
## # … with 20 more rows
Two of these variables NUM_SPEC
and NATIVE
are potential dependent variables to be ‘explained’, while the others are potential independent explanatory variables, drawing on ideas from island biogeography. The variables are as followw:
We use GGally::ggpairs
to examine all the variables in a single display. Note that we don’t examine the island names in this display.
ggpairs(galapagos.data, columns = 2:8)
Of note here is that as we might expect the strongest correlation is between NUM_SPEC
and NATIVE
. Since we will build models to explain each of these separately, and not include either as explanatory variables, this is OK. AREA
and ELEV
are correlated which is a potential problem from a technical perspective (note that each might be expected to account for some biodiversity). No other variables are strongly correlated with either of the biodiversity indicators. Based on this plot we might have a hunch that a model for NUM_SPEC
based on AREA
and ELEV
will work well.
Rogerson starts with a ‘kitchen sink’ model including all variables, although he recommends against this approach. But, hey! — let’s give it a go.
m1 <- lm(NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ, data = galapagos.data)
summary(m1)
##
## Call:
## lm(formula = NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ,
## data = galapagos.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.56 -31.92 -10.87 31.30 178.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.35562 18.37764 0.672 0.507803
## AREA -0.02494 0.02214 -1.127 0.270997
## ELEV 0.31877 0.05239 6.085 2.77e-06 ***
## NEAREST -0.03526 1.04158 -0.034 0.973273
## S_CRUZ -0.27007 0.21215 -1.273 0.215198
## AREA_ADJ -0.07461 0.01739 -4.291 0.000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.18 on 24 degrees of freedom
## Multiple R-squared: 0.7719, Adjusted R-squared: 0.7244
## F-statistic: 16.25 on 5 and 24 DF, p-value: 5.04e-07
The results provided by the summary
function point to ELEV
and AREA_ADJ
as the two most significant explanatory variables. Although our results are different in detail from the Rogerson book (because of minor differences in the data), this result matches well qualitatively with Rogerson’s, which is encouraging. The regression equation corresponding with this model summary can be assembled using the estimated coefficients. These suggest that each additional metre of maximum elevation on an island accounts for 0.319 additional species, while each additional square km of the nearest other island reduces the number of species by 0.075. The first of these seems reasonable, the second might require more complicated explanation. Perhaps if the nearest other island is large, fewer species are present on this island because they end up ‘over there’.
More to the point, the standard error of the model is 60.18, which given that the median number of species on all islands is 42 is rather high! The other surprising feature of the model is that the coefficient associated with the AREA
variable is negative, meaning that the larger the island, the fewer species we expect. This is a consequence of the fact that the model also includes elevation ELEV
and this ‘masks out’ the effect of larger areas.
The unexpected negative coefficient associated with the AREA
variable suggests we should remove it from consideration, so let’s try that, while also removing the other non-signficant variables from the kitchen sink model.
m2 <- lm(NUM_SPEC ~ ELEV + AREA_ADJ, data = galapagos.data)
summary(m2)
##
## Call:
## lm(formula = NUM_SPEC ~ ELEV + AREA_ADJ, data = galapagos.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.452 -33.175 -7.787 19.485 202.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.65550 14.77915 0.315 0.755180
## ELEV 0.27367 0.03139 8.717 2.48e-09 ***
## AREA_ADJ -0.06830 0.01544 -4.424 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.81 on 27 degrees of freedom
## Multiple R-squared: 0.738, Adjusted R-squared: 0.7186
## F-statistic: 38.03 on 2 and 27 DF, p-value: 1.402e-08
The notable feature of this model is that its overall characteristics in terms of overall variance accounted for (the R-squared statistics) and the standard error are very similar to those of the kitchen sink model, but based on only two independent variables not 5. This means the model is more parsimonious in that it does a lot with a little, and we might stop here if we are satisfied that we can provide an account of this model that is in accordance with the theoretical ideas about island biodiversity.
We might also examine very stripped down models, perhaps NUM_SPEC ~ ELEV
or NUM_SPEC ~ AREA
, to see how well these work (it turns out surprisingly badly…). For example:
m3 <- lm(NUM_SPEC ~ AREA, data = galapagos.data)
summary(m3)
##
## Call:
## lm(formula = NUM_SPEC ~ AREA, data = galapagos.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.495 -53.431 -29.045 3.423 306.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.78286 17.52442 3.640 0.001094 **
## AREA 0.08196 0.01971 4.158 0.000275 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.73 on 28 degrees of freedom
## Multiple R-squared: 0.3817, Adjusted R-squared: 0.3596
## F-statistic: 17.29 on 1 and 28 DF, p-value: 0.0002748
The technical problem of ELEV
and AREA
being strongly correlated, which makes it problematic to include both in a model, may in this case also be a real effect. Many of these small islands are effectively single peaks so that elevation and area are direct proxies for one another. Thus, it may be that ELEV
is more important than AREA
as an indicator of the variety of habitat niches provided by an island.
A significant issue with the variables in this dataset is that they are highly skewed in most cases, mainly because the largest island is so much larger than all the others. This would suggest that it may make sense to perform various transformations on the variables to make their distributions more ‘normal’. A more contemporary approach would be to use machine-learning methods to fit non-linear models to these data. Those methods are beyond our present scope, but if you are interested you might explore them (and we can help out!) in the mini-project exercises.
Rather than spend too much time worrying about the exact details of the model, here we focus on mapping the models we have built. To do that we need to read in a spatial dataset.
galapagos.s <- st_read('galapagos.gpkg')
## Reading layer `galapagos' from data source
## `/home/osullid3/Documents/teaching/Geog315/slides/regression/example/galapagos.gpkg'
## using driver `GPKG'
## Simple feature collection with 20 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -92.00919 ymin: -1.411568 xmax: -89.24089 ymax: 1.681402
## Geodetic CRS: WGS 84
This dataset is missing a few of the islands (they weren’t present in OSM data…) but enough are present to get an idea of the distribution of the model variables.
First make some additional columns in the data table based on the models’ fitted and residual (error) values.
galapagos.results <- galapagos.data %>%
mutate(m1fit = m1$fitted.values, m1error = -m1$residuals,
m2fit = m2$fitted.values, m2error = -m2$residuals)
Then join these to the spatial data for mapping (both datasets have an ISLAND
variable which the join will be based on).
galapagos.s <- galapagos.s %>%
inner_join(galapagos.results)
## Joining, by = "ISLAND"
Now we can map the actual and predicted number of species. We will use tmap_arrange
to do this for side-by-side comparisons.
m1map <- tm_shape(galapagos.s) + tm_polygons(col = "m1fit", style = "cont", palette = "Greens", title="model 1")
m2map <- tm_shape(galapagos.s) + tm_polygons(col = "m2fit", style = "cont", palette = "Greens", title="model 2")
actual <- tm_shape(galapagos.s) + tm_polygons(col = "NUM_SPEC", style = "cont", palette = "Greens", title="actual")
tmap_arrange(m1map, m2map, actual, sync = TRUE)
It is clear from this that both models substantiall overpredict the number of species present on the largest island Isla Isabela, and underpredicts diversity on Isla Santa Cruz. Reading up on the complex histories (both human and natural) of the islands, suggests many reasons why our simplistic models might not do a particularly great job.
Perhaps an easier way to visualise the model performance is mapping model errors (the residuals).
m1err <- tm_shape(galapagos.s) + tm_polygons(col = "m1error", style = "cont", palette = "RdBu", title="m1 error")
m2err <- tm_shape(galapagos.s) + tm_polygons(col = "m2error", style = "cont", palette = "RdBu", title="m2 error")
actual <- tm_shape(galapagos.s) + tm_polygons(col = "NUM_SPEC", style = "cont", palette = "Greens", title="actual")
tmap_arrange(m1err, m2err, actual, sync = TRUE)
It is worth noticing that I have mapped the negative of the model residuals, as this is maybe a bit easier to think about as an ‘error’. The residual is the difference actual - model, so a positive residual is when the model predicts a lower value than the actual, meaning that the model error is negative. This can get a little bit confusing, so it is important to be clear in your head about the signs of model errors, and I find it helpful to take this approach.
Overall, it is difficult to draw many firm conclusions from this exercise, although the interpretive nature of regression modelling as a way to explore the complexities of data is hopefully apparent.
From a geographical perspective, of particular note is that we are especially interested in where a model performs well and where it performs badly as this may provide clues to more nuanced explanations of what is happening. In this case, Isabela is the most human-affected of the islands, and conservation activity is most focused on Santa Cruz, and it seems clear that these would have to be accounted for in any more substantial modelling exercise.
It’s possible to make many many different models for these data. There are 5 variables (AREA
, ELEV
, NEAREST
, S_CRUZ
, AREA_ADJ
) that might be included or not. That allows for up to 31 different models! (Each variable can be included or not, so 2 x 2 x 2 x 2 x 2 = 32 possibilities, one of which is no variables, so we exclude that from our count). If we had more variables, there would be even more possibilities.
In this situation it can make sense to get R to search the possibilities. There are many ways that this can be done. The most basic is called stepwise regression where we allow the computer to iteratively include or drop variables one at a time, only continuing on if the model improves based on some measure.
The easiest way to understand this is to try it. The function required is called step()
and it takes as its argument a starting model. Since we made a kitchen-sink model (m1
) let’s use that and see what we get.
step(m1, direction = "both")
## Start: AIC=251.14
## NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ
##
## Df Sum of Sq RSS AIC
## - NEAREST 1 4 86918 249.15
## - AREA 1 4597 91511 250.69
## - S_CRUZ 1 5869 92783 251.10
## <none> 86914 251.14
## - AREA_ADJ 1 66692 153606 266.23
## - ELEV 1 134084 220998 277.14
##
## Step: AIC=249.15
## NUM_SPEC ~ AREA + ELEV + S_CRUZ + AREA_ADJ
##
## Df Sum of Sq RSS AIC
## - AREA 1 4744 91662 248.74
## <none> 86918 249.15
## - S_CRUZ 1 9873 96791 250.37
## + NEAREST 1 4 86914 251.14
## - AREA_ADJ 1 72322 159240 265.31
## - ELEV 1 141759 228677 276.17
##
## Step: AIC=248.74
## NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ
##
## Df Sum of Sq RSS AIC
## <none> 91662 248.74
## + AREA 1 4744 86918 249.15
## - S_CRUZ 1 8176 99839 249.30
## + NEAREST 1 151 91511 250.69
## - AREA_ADJ 1 69216 160879 263.62
## - ELEV 1 277789 369451 288.56
##
## Call:
## lm(formula = NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
##
## Coefficients:
## (Intercept) ELEV S_CRUZ AREA_ADJ
## 18.89774 0.27224 -0.24726 -0.06692
This rather complicated output shows us starting from m1
what happens if we try dropping each of the possible variables. In doing so it ranks the models that result based on a measure AIC (Akaike Information Criterion). The lower this value, the better the model, relative to others under consideration. So in ‘round 1’ m1
we start with AIC = 251.14 and the step that makes the best new model is to remove the variable NEAREST
. In the next round the same process causes us to drop AREA
.
This process continues until we end up with a model that can’t be improved (based on the AIC measure), given by the function NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ
. We can then build that model (provided we find it plausible and can explain it in a reasonable way!)
m4 <- lm(NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
summary(m4)
##
## Call:
## lm(formula = NUM_SPEC ~ ELEV + S_CRUZ + AREA_ADJ, data = galapagos.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.19 -32.52 -12.15 24.75 189.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.89774 17.19617 1.099 0.281862
## ELEV 0.27224 0.03067 8.877 2.38e-09 ***
## S_CRUZ -0.24726 0.16236 -1.523 0.139854
## AREA_ADJ -0.06692 0.01510 -4.431 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.38 on 26 degrees of freedom
## Multiple R-squared: 0.7595, Adjusted R-squared: 0.7317
## F-statistic: 27.36 on 3 and 26 DF, p-value: 3.326e-08
What may be questionable about this model is the inclusion of a variable that appears to not be signficant (S_CRUZ
), and we would need to weigh up the value of retaining this variable or not.
Note that it is also possible to start with a simple model and build to a more complicated one:
step(m3, NUM_SPEC ~ AREA + ELEV + NEAREST + S_CRUZ + AREA_ADJ, direction = "both")
## Start: AIC=273.06
## NUM_SPEC ~ AREA
##
## Df Sum of Sq RSS AIC
## + ELEV 1 66602 169009 265.10
## <none> 235611 273.06
## + S_CRUZ 1 4563 231048 274.48
## + AREA_ADJ 1 2850 232761 274.70
## + NEAREST 1 1148 234463 274.92
## - AREA 1 145470 381081 285.49
##
## Step: AIC=265.1
## NUM_SPEC ~ AREA + ELEV
##
## Df Sum of Sq RSS AIC
## + AREA_ADJ 1 72219 96791 250.37
## - AREA 1 3192 172201 263.66
## <none> 169009 265.10
## + S_CRUZ 1 9769 159240 265.31
## + NEAREST 1 0 169009 267.10
## - ELEV 1 66602 235611 273.06
##
## Step: AIC=250.37
## NUM_SPEC ~ AREA + ELEV + AREA_ADJ
##
## Df Sum of Sq RSS AIC
## + S_CRUZ 1 9873 86918 249.15
## - AREA 1 3048 99839 249.30
## <none> 96791 250.37
## + NEAREST 1 4008 92783 251.10
## - AREA_ADJ 1 72219 169009 265.10
## - ELEV 1 135970 232761 274.70
##
## Step: AIC=249.15
## NUM_SPEC ~ AREA + ELEV + AREA_ADJ + S_CRUZ
##
## Df Sum of Sq RSS AIC
## - AREA 1 4744 91662 248.74
## <none> 86918 249.15
## - S_CRUZ 1 9873 96791 250.37
## + NEAREST 1 4 86914 251.14
## - AREA_ADJ 1 72322 159240 265.31
## - ELEV 1 141759 228677 276.17
##
## Step: AIC=248.74
## NUM_SPEC ~ ELEV + AREA_ADJ + S_CRUZ
##
## Df Sum of Sq RSS AIC
## <none> 91662 248.74
## + AREA 1 4744 86918 249.15
## - S_CRUZ 1 8176 99839 249.30
## + NEAREST 1 151 91511 250.69
## - AREA_ADJ 1 69216 160879 263.62
## - ELEV 1 277789 369451 288.56
##
## Call:
## lm(formula = NUM_SPEC ~ ELEV + AREA_ADJ + S_CRUZ, data = galapagos.data)
##
## Coefficients:
## (Intercept) ELEV AREA_ADJ S_CRUZ
## 18.89774 0.27224 -0.06692 -0.24726
We actually end up with the same model, but this is not true in general and will depend on the details of a particular dataset.
The most important thing is that the model you end up with is based not just on allowing the computer to do its thing, but an assessment of the scientific relevance of the results: in short, can you explain your model in a way that makes sense to you and others?
Further exploration of this material perhaps modelling NATIVE
as the dependent variable can be explored by downloading the materials in this zip file.