Spatial data wrangling in sf

library(sf)
library(tmap)
library(dplyr)
library(magrittr)
library(tidyr)
map_mode <- "plot"
tmap_mode(map_mode)

The sf package supports a wide range of spatial operations both on individual layers, and between layers. These are discussed in this document, for your reference, and may be particularly useful when it comes to the mini-project later in the semester.

I have split operations into those which are performed between two datasets, and those which transform an existing dataset by some geometric method.

I will use three small datasets to demonstrate the various operations. These are loaded as follows:

crashes <- st_read("points.gpkg")
## Reading layer `points' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/slides/spatial-data-wrangling/points.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 575 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1746724 ymin: 5426053 xmax: 1749660 ymax: 5429090
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
streets <- st_read("lines.gpkg")
## Reading layer `lines' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/slides/spatial-data-wrangling/lines.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 1664 features and 2 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 1746331 ymin: 5425842 xmax: 1749730 ymax: 5429519
## Projected CRS: NZGD2000_New_Zealand_Transverse_Mercator_2000
mblocks <- st_read("polys.gpkg")
## Reading layer `polys' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/slides/spatial-data-wrangling/polys.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 406 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1746199 ymin: 5425271 xmax: 1749916 ymax: 5430532
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000

And make a map to see what we are dealing with

tm_shape(mblocks) +
  tm_polygons(col = "pop2013", alpha = 0.5) +
  tm_shape(streets) +
  tm_lines(col = 'purple') +
  tm_shape(crashes) +
  tm_dots()

These data are:

  • mblocks 2013 Census meshblocks with the population count
  • streets street and path network segments
  • crashes road crash locations in 2017

All for an area around 1.6km from Te Herenga Waka - Victoria University.

Operations between layers

The spatial operation most closely related to data table operations is a spatial join. This operation depends on the spatial relations between two layers, which are called binary (spatial) predicates. So we need to look at the predicates before considering join operations.

Binary predicates

The binary predicates determine whether or not various spatial conditions are met between two datasets. The binary predicates available in sf are listed here. These vary in their exact interpretation depending on whether the layers involved are points or lines or polygons, but an illustration of them for polygons is shown below, to give you the idea.

These are as discussed in the paper

Egenhofer MJ & Franzosa RD 1991 Point-set topological spatial relations International Journal of Geographical Information Systems 5(2) 161-174.

This wikipedia article provides a good overview of the underlying concepts.

The various binary spatial predicate functions in sf return for two datasets which geometries in the first dataset are related to which in the second based on the relation specified. For example, if we want to know which points in the crashes dataset are in which meshblocks, we would do

crashes %>% 
  st_within(mblocks)
## Sparse geometry binary predicate list of length 575, where the
## predicate was `within'
## first 10 elements:
##  1: 123
##  2: 73
##  3: 267
##  4: 264
##  5: 91
##  6: 292
##  7: 68
##  8: 49
##  9: 213
##  10: 74

The list output is which of the polygons in the mblocks dataset the first 10 of the points in the crashes data layer is within. If we do the equivalent test in reverse

mblocks %>% 
  st_contains(crashes)
## Sparse geometry binary predicate list of length 406, where the
## predicate was `contains'
## first 10 elements:
##  1: (empty)
##  2: (empty)
##  3: (empty)
##  4: (empty)
##  5: 326
##  6: 118, 298, 478, 505
##  7: (empty)
##  8: 108, 134, 221, 295, 348, 451, 535
##  9: 143, 216, 439
##  10: (empty)

we get which points are contained by each of the polygons. In this case the first four polygons in mblocks contain no points from crashes, the next one contains crash 326, the next one contains several crashes, and so on.

The results we see here show only the first few rows of the results. They are also in a ‘sparse’ format showing only the relations which are TRUE. If we request the complete results by setting the option sparse to FALSE then we see a matrix which shows for every combination of object in the first dataset (the rows) which objects in the second dataset (the columns) satisfies the required spatial relationship.

st_contains(mblocks, crashes, sparse = FALSE)[1:10, 1:10]
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Usually the sparse form (the default) is less of a pain to deal with! In that form, the result is a list and extracting the information it contains requires some additional processing.

As an example, we can use the st_contains() operation to count points in polygons. For example, perhaps we want to count the number of crashes in different areas, then the lengths() function applied to the result of st_contains() gives us that information.

num_crashes <- mblocks %>% 
  st_contains(crashes) %>% 
  lengths()

mblocks.n <- mblocks %>%
  mutate(n = num_crashes)

tm_shape(mblocks.n) +
  tm_polygons(col = "n")

If we want the information about which particular objects are in the result list from applying the spatial predicate, then we use the unlist() function to get at it. So for example, a list of the polygons that contain a crash is provided by

crashes %>% 
  st_within(mblocks) %>%
  unlist() %>%
  unique() # this removes any duplicate entries
##   [1] 123  73 267 264  91 292  68  49 213  74 195  76  11 153 271 194 186 144
##  [19]  81  64 103 167 118 197  45  87 381  85 113  72 111 188  93 312 201 156
##  [37]  94 184 172  13 146 245 229 405 193 333 226 334 154 147 102 282 216 145
##  [55] 171 136 347  86 183 162 203  21 138  51 190 173 289  39 307 379 199 165
##  [73]   8  99 101 310 255   6 191 364 248 254 204 284 151  97 225   9 137 205
##  [91] 273  25 131 269 202 108 305 122 116 209 373 170 198 239 250 125 276  82
## [109] 132  65 398 100 187 371  90 345 189  71 325 401 326 363 327 182 200 150
## [127] 180  88 168 281 110 238  83 303 389  37  77 119 221  24 215  78 291 120
## [145] 275 141  96 158 357  84 224   5 404 374  67 356 128  35 149 126  69 117
## [163] 252 376  92 306 338 115 290 349  28 181  34 247 311 228  63 265 366 139
## [181]  40 142 155 328 105  62 382  75 251 324 208 152

Binary operations

The two key spatial operations between two datasets are spatial filter (st_filter) and spatial join (st_join).

Spatial filter

A spatial filter operation is like the non-spatial filter function in dplyr and will pick out only those rows in the data that meet some criterion, where the criterion is a spatial one, based on the relationship to some other dataset. So for example, we can do a selection of the polygons that contain one or more points like this

mblocks %>%
  st_filter(crashes) %>%
  tm_shape() +
    tm_polygons() +
    tm_shape(crashes) +
    tm_dots(size = 0.01)

st_filter applies some spatial predicate to the relationship between two layers to perform the filtering. By default the predicate is st_intersects which in most cases will pick out where two spatial layers coincide with one another.

As a different example, consider a situation where we have selected only one polygon in the dataset

mb100 <- mblocks %>%
  slice(100) # this selects just row number 100 (we could pick any)
tm_shape(mblocks) +
  tm_polygons() +
  tm_shape(mb100) +
  tm_polygons(col = 'red')

Now we can get the neighbouring polygons like this, by specifying an appropriate predicate:

nmblocks <- mblocks %>%
  st_filter(mb100, predicate = st_touches)
tm_shape(mblocks) +
  tm_polygons() +
  tm_shape(nmblocks) +
  tm_polygons(col = 'orange') + 
  tm_shape(mb100) + 
  tm_polygons(col = "red")

Note that the initially selected polygon is considered to ‘touch’ itself. It doesn’t arise in this simple example, but this capability can be used to perform complicated multistep spatial manipulations of datasets.

Spatial join

A spatial join not only applies some spatial predicate to the relation between two layers, but will append the data from the second layer to the data table of the first.

This is an effective way of spatially merging datasets for further analysis. An example of st_join is in the main lab materials. The trickiest aspect to spatial joins is when more than one object meets the join criterion. For example, if we are joining data from points to polygons then many points may intersect with each polygon:

mblocks %>%
  st_join(crashes)
## Simple feature collection with 789 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1746199 ymin: 5425271 xmax: 1749916 ymax: 5430532
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##      MB2013 pop2013 CRASH_SEV     MULTI_VEH         LOCN1        LOCN2
## 1   2142900       9      <NA>          <NA>          <NA>         <NA>
## 2   2124200       3      <NA>          <NA>          <NA>         <NA>
## 3   2128800     177      <NA>          <NA>          <NA>         <NA>
## 4   2119900       6      <NA>          <NA>          <NA>         <NA>
## 5   2122600       0         N Multi vehicle MOLESWORTH ST LAMBTON QUAY
## 6   2131400     150         N Multi vehicle      DIXON ST  VICTORIA ST
## 6.1 2131400     150         N Multi vehicle   VICTORIA ST  FELTEX LANE
## 6.2 2131400     150         N Multi vehicle   VICTORIA ST  FELTEX LANE
## 6.3 2131400     150         N Multi vehicle      DIXON ST    WILLIS ST
## 7   2133407      36      <NA>          <NA>          <NA>         <NA>
##     fatalities serious minor                           geom
## 1           NA      NA    NA POLYGON ((1749174 5426376, ...
## 2           NA      NA    NA POLYGON ((1748753 5428642, ...
## 3           NA      NA    NA POLYGON ((1748281 5427565, ...
## 4           NA      NA    NA POLYGON ((1748645 5429096, ...
## 5            0       0     0 POLYGON ((1749061 5428876, ...
## 6            0       0     0 POLYGON ((1748521 5427439, ...
## 6.1          0       0     0 POLYGON ((1748521 5427439, ...
## 6.2          0       0     0 POLYGON ((1748521 5427439, ...
## 6.3          0       0     0 POLYGON ((1748521 5427439, ...
## 7           NA      NA    NA POLYGON ((1748813 5427143, ...

There are only 406 polygons in the mblocks dataset, but the joined dataset has 789!

This is because a duplicate entry is made each time a crash point occurs inside a meshblock. In effect the table we get as output is a table of the ‘spatial joins’ between the two original datasets, where a join is a point inside a polygon. Notice also that where a polygon contains no points we get <NA> entries in the table, because no data was found in the crashes layer that intersected with the mblocks layer, but the polygon is retained anyway.

If we simply don’t want the <NA> values we can specify an ‘inner join’ so that only records where a match was found are retained. We do this by overriding the default ‘left join’ behaviour, which prioritises retaining all rows in the left hand (i.e. first) table.

mblocks %>%
  st_join(crashes, left = FALSE)
## Simple feature collection with 575 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1746575 ymin: 5425845 xmax: 1749846 ymax: 5429182
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##      MB2013 pop2013 CRASH_SEV     MULTI_VEH         LOCN1         LOCN2
## 5   2122600       0         N Multi vehicle MOLESWORTH ST  LAMBTON QUAY
## 6   2131400     150         N Multi vehicle      DIXON ST   VICTORIA ST
## 6.1 2131400     150         N Multi vehicle   VICTORIA ST   FELTEX LANE
## 6.2 2131400     150         N Multi vehicle   VICTORIA ST   FELTEX LANE
## 6.3 2131400     150         N Multi vehicle      DIXON ST     WILLIS ST
## 8   2134800     177         N Multi vehicle         SH 1N   TARANAKI ST
## 8.1 2134800     177         N Multi vehicle   TARANAKI ST ABEL SMITH ST
## 8.2 2134800     177         N Multi vehicle   TARANAKI ST     ARTHUR ST
## 8.3 2134800     177         N Multi vehicle         SH 1N   TARANAKI ST
## 8.4 2134800     177         N Multi vehicle         SH 1N   TARANAKI ST
##     fatalities serious minor                           geom
## 5            0       0     0 POLYGON ((1749061 5428876, ...
## 6            0       0     0 POLYGON ((1748521 5427439, ...
## 6.1          0       0     0 POLYGON ((1748521 5427439, ...
## 6.2          0       0     0 POLYGON ((1748521 5427439, ...
## 6.3          0       0     0 POLYGON ((1748521 5427439, ...
## 8            0       0     0 POLYGON ((1748721 5426755, ...
## 8.1          0       0     0 POLYGON ((1748721 5426755, ...
## 8.2          0       0     0 POLYGON ((1748721 5426755, ...
## 8.3          0       0     0 POLYGON ((1748721 5426755, ...
## 8.4          0       0     0 POLYGON ((1748721 5426755, ...

This still leaves the question of how to combine the multiple pieces of information that are now associated with each polygon derived from multiple crashes. There is no one solution here as it will depend on the interpretation of the data what makes sense. Usually we have to group the joined data based on a unique identifier from the original (left hand) data table, and then use summarise to calculate appropriate new variables. For example

crash_counts <- mblocks %>%
  st_join(crashes) %>%
  group_by(MB2013) %>% # this variable uniquely identifies each row
  summarise(num_crashes = n(), # this counts the number of matching cases
            fatalities = sum(fatalities), 
            serious = sum(serious),
            minor = sum(minor),
            pop2013 = first(pop2013)) %>%
  select(MB2013, pop2013, num_crashes, fatalities, serious, minor)

as_tibble(crash_counts)
## # A tibble: 406 × 7
##    MB2013 pop2013 num_crashes fatalities serious minor                      geom
##    <chr>    <int>       <int>      <int>   <int> <int>             <POLYGON [m]>
##  1 21019…       0           1         NA      NA    NA ((1748508 5430325, 17485…
##  2 21024…       3           1         NA      NA    NA ((1747656 5429203, 17476…
##  3 21025…     198           1         NA      NA    NA ((1747332 5429467, 17473…
##  4 21028…     174           1         NA      NA    NA ((1747086 5429464, 17470…
##  5 21030…      21           1         NA      NA    NA ((1747462 5429071, 17474…
##  6 21030…      51           1         NA      NA    NA ((1747558 5429004, 17475…
##  7 21055…       0           1         NA      NA    NA ((1746865 5428490, 17468…
##  8 21061…      39           2          0       0     2 ((1746734 5428512, 17467…
##  9 21111…     132           3          0       0     1 ((1746722 5428375, 17467…
## 10 21112…     243           1         NA      NA    NA ((1746568 5428275, 17465…
## # ℹ 396 more rows

The biggest problem this leaves us with is a lot of NA values. The best way to deal with these is to use the tidyr::replace_na() function in the summarise() calculation to replace NA values with an appropriate alternative. For example, we can do

crash_counts <- mblocks %>%
  st_join(crashes) %>%
  group_by(MB2013) %>%
  summarise(num_crashes = n(),
            fatalities = replace_na(sum(fatalities), 0),
            serious = replace_na(sum(serious), 0),
            minor = replace_na(sum(minor), 0),
            pop2013 = first(pop2013)) %>%
  select(MB2013, pop2013, num_crashes, fatalities, serious, minor)

as_tibble(crash_counts)
## # A tibble: 406 × 7
##    MB2013 pop2013 num_crashes fatalities serious minor                      geom
##    <chr>    <int>       <int>      <int>   <int> <int>             <POLYGON [m]>
##  1 21019…       0           1          0       0     0 ((1748508 5430325, 17485…
##  2 21024…       3           1          0       0     0 ((1747656 5429203, 17476…
##  3 21025…     198           1          0       0     0 ((1747332 5429467, 17473…
##  4 21028…     174           1          0       0     0 ((1747086 5429464, 17470…
##  5 21030…      21           1          0       0     0 ((1747462 5429071, 17474…
##  6 21030…      51           1          0       0     0 ((1747558 5429004, 17475…
##  7 21055…       0           1          0       0     0 ((1746865 5428490, 17468…
##  8 21061…      39           2          0       0     2 ((1746734 5428512, 17467…
##  9 21111…     132           3          0       0     1 ((1746722 5428375, 17467…
## 10 21112…     243           1          0       0     0 ((1746568 5428275, 17465…
## # ℹ 396 more rows

Spatial intersection or overlay

Finally, you may want to clip one layer with another. This is done using the st_intersection function. Here’s how it works, using the convex hull of a set of points (we’ll see what the convex hull is in a moment).

# Clipping the polygons by the convex hull of the points
hull <- crashes %>%
  st_union() %>%
  st_convex_hull()

mblocks %>% 
  st_intersection(hull) %>%
  plot()
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

Spatial operations on single objects

Buffering with st_buffer

Buffering expands spatial objects by some specified distance out from the objects. You have to specify a distance with the dist parameter.

crashes.buf <- crashes %>%
  st_buffer(dist = 50)
mblocks.buf <- mblocks %>%
  st_buffer(dist = -20)
streets.buf <- streets %>%
  st_buffer(dist = 20)

This function preserves any variables associated with the dataset in the buffered dataset. A set of points will become a set of circles when buffered, whereas lines become ‘lozenge’ shapes, and polygons become larger, smoothed versions of themselves.

Buffering is frequently used to explore topics like the spatial relationship between (for example) liquor outlets and schools, or accessibility to public transport and so on.

tmap_mode("plot")
## tmap mode set to plotting
m1 <- tm_shape(mblocks.buf) +
  tm_polygons(alpha = 0.75)
m2 <- tm_shape(streets.buf) +
  tm_polygons(alpha = 0.75)
m3 <- tm_shape(crashes.buf) +
  tm_polygons(alpha = 0.75)
tmap_arrange(m1, m2, m3, nrow = 1)
## Warning: The shape mblocks.buf contains empty units.

tmap_mode(map_mode)
## tmap mode set to plotting

If dist is negative (as it is here for the meshblocks) it will shrink the geometry in on itself. This only makes sense for polygons, and if the shrinking is too large it may cause errors or warnings (as small polygons collapse to nothing or turn inside-out…).

Centroid with st_centroid

The centroid of a geometry is a point at its geometric centre.

tm_shape(mblocks) +
  tm_polygons() +
  tm_shape(mblocks %>% st_centroid()) +
  tm_dots()
## Warning: st_centroid assumes attributes are constant over geometries

The centroid is not guaranteed to be inside the geometry. There is at least one example of this in the example data - and most likely more.

If you need a representative point guaranteed to be inside a polygon, then use st_point_on_surface() instead, although be aware that the location of this point is not guaranteed to be the same every time your run the function!

Convex hull with st_convex_hull

The convex hull of a set of points is the shape you would get if you stretched an elastic band around the outer edge of the set of points that define the object. If we extract a single polygon from the mblocks dataset, that’s the easiest way to illustrate this.

mb <- mblocks %>%
  filter(MB2013 == 2124300)

Now plot this and its convex hull

tm_shape(mb) +
  tm_polygons() +
  tm_shape(mb %>% st_convex_hull()) +
  tm_polygons(col = 'blue', alpha = 0.35, border.col = 'red')

The convex hull of a geometry is a nice simplified summary of its location. It is particularly useful for representing a collection of points. However, if you apply it to a point dataset, you will just get a collection of points, because the function is applied to each point individually.

To create the convex hull of the whole set of points, you have to first merge them into a multipoint with the st_union operation:

crashes %>%
  st_union() %>%
  st_convex_hull() %>%
  tm_shape() +
    tm_polygons(alpha = 0.5, border.col = "red") +
    tm_shape(crashes) + 
    tm_dots()

Operations applied to a collection of objects

Some operations only make sense when applied to a collection of geometries, not one object at a time. Two in particular that are widely used are st_voronoi() and st_triangulate().

Voronoi with st_voronoi

The Voronoi region associated with a geometry is the region of space closer to that geometry than to any other geometry in a set of geometries. This is easiest to understand for points.

Unfortunately, the st_voronoi operation applied to a point dataset behaves rather unexpectedly, so a series of transformations are required to convert it to a useful Voronoi polygon dataset:

voronoi <- crashes %>%
  st_union() %>% # combine points into a multipoint
  st_voronoi() %>%
  st_cast() %>% # no idea why this is required!
  st_as_sf() # convert to a sf dataset

And now let’s plot them along with the originating points.

tm_shape(voronoi) +
  tm_polygons(alpha = 0.5) +
  tm_shape(crashes) +
  tm_dots()

By default the function adds a large buffer region to the points before doing the Voronoi calculation. You can reduce this effect by intersecting the result with a bounding box or convex hull:

crashes.hull <- crashes %>%
  st_union() %>%
  st_convex_hull() %>%
  st_buffer(dist = 100)

voronoi %>% 
  st_intersection(crashes.hull) %>% 
  tm_shape() +
  tm_polygons() +
  tm_shape(crashes) + 
  tm_dots()

The Voronoi polygons of a set of points can be considered ‘service areas’ of the points. Imagine the points are (say) bus stops, then each polygon is the area closer to a given bus stop than to any other bus stop, so other things being equal, it is the ‘service area’ of that bus stop.

Of course… the areas are determined without reference to movement on street networks or hills or anything else, so they are too simplistic for application in an unprocessed form. Alternative methods involve calculating distances over networks not just using straight point-to-point calculations.

An interesting example of the simple form of Voronoi, but applied to true great circle distances on Earth surface is this one by Jason Davies for world airports.

Triangulation with st_triangulate

The triangulation of a set of points connects the points together into a mesh of triangles. Similar to the Voronoi operation, some annoying additional conversions are required.

tri <- crashes %>%
  st_union() %>% # combine potnts into a multipoint
  st_triangulate() %>%
  st_cast() %>% # no idea again!
  st_as_sf() # convert to a sf dataset

tm_shape(tri) +
  tm_polygons(alpha = 0.5) +
  tm_shape(crashes) +
  tm_dots()

The triangulation, the Voronoi polygons, and the convex hull are all closely related. The convex hull of the points is also the outer perimeter of the triangulation.

Alphashapes

An alternative way to summarise the locations of a collection of points is called the alphashape which is based on removing edges from the triangulation until the remaining triangles more closely fit the shape of the set of points. However alphashapes are not supported by any functions in the sf package. An R package that supports alphashapes is alphahull but it is not directly compatible with sf. If you want to see what this looks likes try running the following code:

library(alphahull)

xy <- crashes %>% 
  st_coordinates()
xy <- unique(xy) # there are some duplicate points in the data, which we need to remove
plot(ashape(xy, alpha = 350), asp = 1)

Finally

library(RXKCD)
getXKCD(627)$img
## Warning in readPNG(paste(tempdir(), "xkcd.png", sep = "/")): libpng warning:
## iCCP: profile 'Photoshop ICC profile': 'RGB ': RGB color space not permitted on
## grayscale PNG

## [1] "https://imgs.xkcd.com/comics/tech_support_cheat_sheet.png"