image source commons.wikimedia.org

Francis Galton’s illustration of correlation, 1875 image source commons.wikimedia.org

Regression & friends

Regression is a standard statistical approach to relating variables to one another

Based on a statistical model

We can use regression to determine \(f(\mathbf{X})\) in overlay

Structure of regression

Dependent variable (the thing we are modelling)

Independent variables (the explanatory factors)

There are various preferences for the distribution of the variables (normal, no outliers, etc.), which makes exploring data first really important, especially for outlier detection

The result

\[ y=b_0+b_1x_1+\ldots+b_ix_i+\ldots+b_px_p+\epsilon \]

where \(y\) is the dependent variable, \(x_i\) are the independent variables, \(b_i\) are the regression coefficients, and \(\epsilon\) is the model error

This mirrors overlay’s
\[ y=w_0+w_1x_1+w_2x_2+\ldots+w_nx_n \]

How it works

A simple example

Data from raster package
getData("worldclim") and getData("SRTM")

extract the raster data values to the points

Lapse rate

How quickly does temperature fall with altitude?

Building a regression model

Use the linear model function lm() in R


            model <- lm(t_min_july ~ elevation, data = df)
            > model

            Call:
            lm(formula = t_min_july ~ elevation, data = df)

            Coefficients:
            (Intercept)    elevation
               4.449734    -0.004898
            

More information


            > summary(model)

            Call:
            lm(formula = t_min_july ~ elevation, data = df)

            Residuals:
                 Min       1Q   Median       3Q      Max
            -0.89072 -0.13481 -0.01928  0.12767  0.78203

            Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
            (Intercept)  4.450e+00  2.928e-02  151.98   <2e-16 ***
            elevation   -4.898e-03  5.627e-05  -87.05   <2e-16 ***
            ---
            Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.2139 on 439 degrees of freedom
            Multiple R-squared:  0.9452,	Adjusted R-squared:  0.9451
            F-statistic:  7578 on 1 and 439 DF,  p-value: < 2.2e-16
            

More variables

We might consider including lat as an additional explanatory variable

It is easy to add variables to the model specification

            > summary(model2)

            Call:
            lm(formula = t_min_july ~ elevation + lat, data = df)

            Residuals:
                 Min       1Q   Median       3Q      Max
            -0.81647 -0.11906 -0.00257  0.09144  0.72292

            Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
            (Intercept)  2.002e+01  1.432e+00   13.98   <2e-16 ***
            elevation   -4.466e-03  6.295e-05  -70.95   <2e-16 ***
            lat          4.098e-01  3.763e-02   10.89   <2e-16 ***
            ---
            Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.1966 on 438 degrees of freedom
            Multiple R-squared:  0.9533,	Adjusted R-squared:  0.9531
            F-statistic:  4467 on 2 and 438 DF,  p-value: < 2.2e-16
            

Reflection

The basics of building a model are simple

There is much devil in the details

Focusing on understanding what’s going on, not on the model as an end in itself is key

# Another example ## Drawing on data from Chapter 7 in Peter Rogerson’s *Statistical Methods for Geography*, 2001 (1st edn) ## Biodiversity in the Galapagos

View this notebook in a separate window

Wrinkles

Multicollinearity, outliers, distributions

Dummy variables

Variants depending on nature of the dependent variable

Approach as an exploration

# Preferred data characteristics ## Independent variables: should be independent (avoid multicollinearity) ## Outliers: be careful of these as they may strongly influence results ## Distributions: strictly, normally distributed variables are preferred

Dummy variables

Where independent variables are non-numeric use dummy variables

A categorical variable with \(k\) levels, becomes...

                   id landuse
                 
              1     1 urban
              2     2 urban
              3     3 rural
              4     4 commercial
              5     5 industrial
            


\(k-1\) dummy variables

                   id landuse    urban rural commercial
                              
              1     1 urban          1     0          0
              2     2 urban          1     0          0
              3     3 rural          0     1          0
              4     4 commercial     0     0          1
              5     5 industrial     0     0          0
            

# Non-numeric dependent variables ## Binary (yes/no, present/absent) data: calls for _logistic regression_ ## Count data: may call for _Poisson regression_ ### As needed we can explore these options depending on projects
# Variable selection ## Which variables are included or not is critical ## Automated methods are available, but should be approached with caution! ## Newer automated methods focus on the journey not the destination

image source geograph.org.uk by Matt Harrop CC2.0 license

Summary

Statistical modelling is a huge toolbox of options

All the methods are worlds unto themselves, including regression and linear models

Probably most useful when regarded as exploratory not an end in itself

From a geographical perspective we are often interested in local not global effects