Spatial autocorrelation: what’s the problem?

This post is the first exploring topics raised in Chapter 2 of Geographic Information Analysis¹. Specifically it is about one of the ‘pitfalls’ of spatial data, spatial autocorrelation.

As we’ll see, and as the book suggests, it’s not really clear that spatial autocorrelation really is a problem, so much as a characteristic feature of spatial data that it’s pretty much the whole purpose of spatial analysis to describe and leverage in various ways to improve our analysis.

Anway, returning to the text, on page 35, we state that,

The nonrandom distribution of phenomena in space has various consequences for conventional statistical analysis. For example, the usual parameter estimates based on samples that are not randomly distributed in space will be biased toward values prevalent in the regions favored in the sampling scheme.

This post zeros in on that aspect of spatial autocorrelation.

Code

library(terra)
library(spatstat)
library(sf)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(patchwork)

set.seed(1)

A random and an autocorrelated surface

To get started we make a random and an autocorrelated surface. I do this using terra by first making an empty 100×100 raster, then populating it with random numbers.

extent <- ext(c(xmin = -1, xmax = 1,
                ymin = -1, ymax = 1) * 5000)

r_random <- rast(res = 100, crs = "+proj=eqc", extent)
names(r_random) <- "value"
values(r_random) <- runif(10000)

Next, smooth the surface. Easiest way to do this is to apply a focal mean a few times.

# label: smooth-the-grid
r_smoothed <- r_random
for (i in 1:20) {
  r_smoothed <- r_smoothed |> focal(3, mean, expand = TRUE)
}
names(r_smoothed) <- "value"

1: expand = TRUE uses duplicate values around the perimeter of the raster in the calculation of the mean so our raster doesn’t shrink one cell on all sides at each iteration.

Finally, remake the random grid by shuffling the values in the smoothed surface. This is so that our random and smoothed surfaces are (aspatially) statistically identical.

values(r_random) <- sample(values(r_smoothed), 10000, replace = FALSE)

And here’s a plot comparing the two.

Code

xy_random <- r_random |> as.data.frame(xy = TRUE)
xy_smoothed <- r_smoothed |> as.data.frame(xy = TRUE)

g1 <- ggplot() +
  geom_raster(data = xy_random, aes(x = x, y = y, fill = value)) +
  scale_fill_distiller(palette = "Reds") +
  coord_equal() +
  guides(fill = "none") +
  ggtitle("Random") +
  theme_void()

g2 <- ggplot() +
  geom_raster(data = xy_smoothed, aes(x = x, y = y, fill = value)) +
  scale_fill_distiller(palette = "Reds") +
  coord_equal() +
  guides(fill = "none") +
  ggtitle("Smoothed") +
  theme_void()

g1 | g2

Figure 1: The random and smoothed surfaces

Sampling values from the surfaces

We can sample values at random from a raster using terra::spatSample but here we want to control the configuration of the sampling pattern. For that we’ll use sampling points generated by spatstat’s point pattern functions.

A couple of helper functions, to do the sampling, and to convert a spatstat point pattern to a plain terra::SpatVector object.

ppp_as_spatvector <- function(ppp) {
  ppp |>
    st_as_sf() |>
    select() |>
    st_set_crs("+proj=eqc") |>
    slice(-1) |>
    as("SpatVector")
}

sample_raster_statistic <- function(r, pts, fn = mean) {
  terra::extract(r, pts |> ppp_as_spatvector(), ID = FALSE) |> 
    pull(1) |> 
    fn()
}

1: slice(-1) removes the first row of the sf dataset which is the polygon representing the window of the point pattern.
2: We could supply an arbitrary function if we want.

Now make up some point patterns, Poisson random, evenly-spaced, clustered, and a grid-based one.

window <- owin(extent[1:2], extent[3:4])

poisson_pps <- list()
ssi_pps <- list()
clustered_pps <- list()
area <- diff(extent[1:2]) * diff(extent[3:4])
for (i in 1:100) {
  set.seed(i)
  poisson_pps[[length(poisson_pps) + 1]] <- rpoispp(100 / area, win = window)
  set.seed(i)
  ssi_pps[[length(ssi_pps) + 1]] <- rSSI(600, n = 100, win = window)
  set.seed(i)
  clustered_pps[[length(clustered_pps) + 1]] <- rThomas(10 / area, 250, 10, win = window)
}

get_grid_pp <- function(offset_x = 0, offset_y = 0) {
  as.ppp(xy_random |> 
           select(x, y) |>
           filter((x %/% 100) %% 10 == offset_x,
                  (y %/% 100) %% 10 == offset_y), 
         W = window)
}
grid_pps <- mapply(get_grid_pp, rep(0:9, 10), rep(0:9, each = 10), SIMPLIFY = FALSE)

And here’s what those look like on top of our random and smoothed rasters.

Code

g3 <- ggplot() +
  geom_raster(data = xy_random, aes(x = x, y = y, fill = value), alpha = 0.5) +
  scale_fill_distiller(palette = "Greys") +
  geom_sf(data = grid_pps[[1]] |> st_as_sf() |> slice(-1),
          shape = 1) +
  geom_sf(data = poisson_pps[[1]] |> st_as_sf() |> slice(-1)) +
  geom_sf(data = ssi_pps[[1]] |> st_as_sf() |> slice(-1), 
          colour = "dodgerblue") +
  geom_sf(data = clustered_pps[[1]] |> st_as_sf() |> slice(-1), 
          colour = "red") +
  coord_sf(xlim = extent[1:2], ylim = extent[3:4]) +
  guides(fill = "none") +
  theme_void()

g4 <- ggplot() +
  geom_raster(data = xy_smoothed, aes(x = x, y = y, fill = value), alpha = 0.5) +
  scale_fill_distiller(palette = "Greys") +
  geom_sf(data = grid_pps[[1]] |> st_as_sf() |> slice(-1),
          shape = 1) +
  geom_sf(data = poisson_pps[[1]] |> st_as_sf() |> slice(-1)) +
  geom_sf(data = ssi_pps[[1]] |> st_as_sf() |> slice(-1), 
          colour = "dodgerblue") +
  geom_sf(data = clustered_pps[[1]] |> st_as_sf() |> slice(-1), 
          colour = "red") +
  coord_sf(xlim = extent[1:2], ylim = extent[3:4]) +
  guides(fill = "none") +
  theme_void()
  
g3 | g4

Figure 2: Typical grid-based (open circles), uniform (black), evenly-space (blue), and clustered (red) sampling patterns on the random and smoothed raster surfaces

Before reading further, what difference do you think the surface structure might make to the results for estimating the mean using different sampling schemes?

Does the surface autocorrelation make a difference?

We can sample the surfaces 100 times each, using the four different sampling schemes, and see what difference it makes to our estimates of some statistic, here the mean value of our surface.

df <- data.frame(
    Random_Grid = grid_pps |> 
      lapply(sample_raster_statistic, r = r_random) |> 
      unlist(),
    Smoothed_Grid = grid_pps |> 
      lapply(sample_raster_statistic, r = r_smoothed) |> 
      unlist(),
    Random_Poisson = poisson_pps |> 
      lapply(sample_raster_statistic, r = r_random) |> 
      unlist(),
    Smoothed_Poisson = poisson_pps |> 
      lapply(sample_raster_statistic, r = r_smoothed) |> 
      unlist(),
    Random_Even = ssi_pps |> 
      lapply(sample_raster_statistic, r = r_random) |>
      unlist(),
    Smoothed_Even = ssi_pps |> 
      lapply(sample_raster_statistic, r = r_smoothed) |> 
      unlist(),
    Random_Clustered = clustered_pps |> 
      lapply(sample_raster_statistic, r = r_random) |>
      unlist(),
    Smoothed_Clustered = clustered_pps |>
      lapply(sample_raster_statistic, r = r_smoothed) |> 
      unlist()) |>
  pivot_longer(cols = 1:8) |>
  rename(Mean = value) |>
  separate_wider_delim(
    cols = name, delim = "_", 
    names = c("Raster", "Sampling")) |>
  mutate(
    Raster = ordered(
      Raster, levels = c("Random", "Smoothed")),
    Sampling = ordered(
      Sampling, levels = c("Poisson", "Grid",
                           "Even", "Clustered")))
df

1: Make the variable names into a name column and the values into a value column.
2: Split the name column into Raster and Sampling columns.

# A tibble: 800 × 3
   Raster   Sampling   Mean
   <ord>    <ord>     <dbl>
 1 Random   Grid      0.495
 2 Smoothed Grid      0.499
 3 Random   Poisson   0.501
 4 Smoothed Poisson   0.501
 5 Random   Even      0.505
 6 Smoothed Even      0.500
 7 Random   Clustered 0.501
 8 Smoothed Clustered 0.487
 9 Random   Grid      0.502
10 Smoothed Grid      0.499
# ℹ 790 more rows

And now we can plot the distribution of estimates of the mean for the various combinations of random and smoothed surface and sampling scheme.

Code

ggplot(df) +
  geom_density(aes(x = Mean, fill = Raster), colour = NA, alpha = 0.5) +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap( ~ Sampling, ncol = 4, labeller = label_both) +
  ylab("Density") +
  theme_minimal() +
  theme(panel.background = element_rect(fill = NA),
        legend.position = "bottom")

Figure 3: Density smoothed sampling distributions of the mean value of the surfaces with different combinations of random (red) vs smooth surface (blue) and various sampling schemes.

And in answer to the question, yes, autocorrelation of the surface values makes the sampling scheme matter a lot! Keep in mind that data values in our two surfaces are exactly the same set of values. It is the difference in their spatial arrangement that causes our different sampling schemes to see different things.

In all cases, regardless of sampling scheme the random raster yields up a similar sampling distribution of the mean. It doesn’t matter how we go about picking 100 sample locations, we end up with a similar picture of the data, because regardless of how we pick 100 samples, they are random like the data, and form an unbiased sample.

The Poisson random set of observations yields similar distributions of the sampling mean for both surfaces as we might hope. By definition a Poisson point process is unbiased without first- or second-order effects, so in both cases we get an unbiased sample of the data.

Things go off the rails pretty badly after that on the smoothed surface. Both the grid-based and evenly-spaced sampling schemes yield a very narrow range of estimates of the mean, because these schemes at this low (1%) sampling rate may miss peaks and troughs in the data. If a single sample location happens to hit a peak, then a bunch of others are guaranteed to miss it because the even spacing guarantees that no other samples will be taken nearby.

Meanwhile the clustered sampling scheme produces a much wider range of estimates of the mean. This is because sometimes a large number of samples close together will land near a peak or trough producing a biased estimate of the mean in that case, because we’re measuring a disporportionate number of points close to an extreme value.

But nobody would be that dumb, surely?

Of course this is an artificial setup. In practice people try to sample sensibly, right? Well yes, but…

But in practice, our sampling schemes are likely to be biased. This could be for all kinds or reasons. Areas of particular interest may not be typical but we focus data collection in those places (because they are interesting). Samples are likely to be preferentially collected in more accessible locations for practical and economic reasons. Or, perhaps we are being ‘smart’ and plan an evenly spaced sampling strategy. But this approach, unless we have prior knowledge of the data, risks missing peaks and troughs completely. In this case where the statistic of interest is the mean, that might not matter (we get an accurate estimate), but if we really need to know the variance then the data we collect will be misleading.

And very often the data already exist and we don’t get a say in how they were collected.

Any statistician would of course say that the best sampling scheme is a random, unbiased one.² In practice, in spatial terms it rarely is, and if the underlying spatial data have spatial structure—i.e. they are autocorrelated—which since we are interested in geography we can presume they do, then how we sample them can matter greatly.

Pitfall or potential?

When it comes to data collection, definitely pitfall. Spatial data should be collected with care and attention to these kinds of concerns. Prior knowledge about the phenomenon of interest should inform decisions about how to go about collecting useful data. It’s important to note here that while regular gridded samples perform abominably in this toy example, if the sampling interval of your grid is fine grained enough relative to variation in the phenomenon then it’s probably OK. Still worth being cognizant of these issues though: it’s easy to be beguiled by ‘detailed’ data and assume away this kind of problem.

On the other hand, dealing with these issues and understanding the extent of autocorrelation in spatial data has led to a wide range of methods for characterising spatial variation in spatial data, and those are of great interest to geographers. As we comment on page 36 in Geographic Information Analysis

Although autocorrelation presents a considerable challenge to conventional statistical methods and remains problematic, quantitative geographers have made a virtue of it by developing a number of autocorrelation measures into powerful descriptive tools. It would be wrong to claim that the problem of spatial autocorrelation has been solved, but considerable progress has been made in developing techniques that account for its effects and in taking advantage of the opportunity it provides for useful geographic descriptions.

And on that cheery note, I’ll end.

Footnotes

https://onlinelibrary.wiley.com/doi/book/10.1002/9780470549094↩︎
Stats 101.↩︎