image source commons.wikimedia.org by Magnus Manske

Wrinkles

Multicollinearity, outliers, distributions

Dummy variables

Variants depending on nature of the dependent variable

Approach as an exploration

# Preferred data characteristics ## Independent variables: should be independent (avoid multicollinearity) ## Outliers: be careful of these as they may strongly influence results ## Distributions: strictly, normally distributed variables are preferred ### It’s usually impossible to satisfy all of these, the important thing is to _pay attention to your data and their characteristics_

Dummy variables

Where independent variables are non-numeric use dummy variables

A categorical variable with \(k\) levels, becomes...

                   id landuse
                 
              1     1 urban
              2     2 urban
              3     3 rural
              4     4 commercial
              5     5 industrial
            


\(k-1\) dummy variables

                   id landuse    urban rural commercial
                              
              1     1 urban          1     0          0
              2     2 urban          1     0          0
              3     3 rural          0     1          0
              4     4 commercial     0     0          1
              5     5 industrial     0     0          0
            

# Non-numeric dependent variables ## Binary (yes/no, present/absent) → _logistic regression_ ## Counts → _Poisson regression_ ### As needed we can explore these options depending on projects

e.g., Logistic regression

Use this to model true/false or probability variables

model <- glm(y ~ x, family = "binomial")

# Variable selection ## Which variables are included or not is critical ## Automated methods such as `step()` are available, but should be approached with caution! ## Newer automated methods focus on the journey not the destination

Spatial aspects

Including spatial measures as variables (e.g., distance to a park)

Trend surface analysis (spatial coordinates as variables)

Mapping regression model errors (the residuals)

... because we may be more interested in local than global effects

Trend surface analysis

source Abler, Adams & Gould 1971
Spatial Organization Prentice-Hall
drawing on an unpublished
Masters thesis by John Florin

Trend surface analysis

Trend surface analysis

Lake Hillier, Australia apparently it really is pink...
image source culturauniversale.blogspot.com

Residual mapping

Often outliers are more interesting than the model!

Point to where interesting things may be happening

Perhaps a clue to missing factors

Geographically
weighted regression
(GWR)

An alternative approach that recognises geographical variation

houses in London Fields
image source flickr.com by Matthew Rutledge

Example

Regression model based on floor area, parcel area, number of bedrooms, etc.

From Chapter 2 of
Fotheringham AS, C Brunsdon, and M Charlton. 2002. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships Wiley: Chichester, UK.

image source geograph.org.uk by Matt Harrop CC2.0 license

Conclusions

For serious regression work, you need a statistics platform, and should look beyond desktop GIS

R is preferred, but SPSS, SAS might also be used

There are standalone tools for GWR, but also an R package

Some of the Arc toolboxes can get you started but may be limited