Geographic Cluster Analysis

Geog 315 T2 2023

Overview

The lecture material related to this lab is available here:

Meanwhile, here is a quick summary…

Clustering methods segment the observations in a dataset into clusters or classes based on the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or classes such that observations in the same class are similar to each other, and different from observations in other classes.

Unfortunately, there is no easy way to define clusters beyond recognising that they are the groups of observations identified by a clustering method! Clustering analysis therefore depends a great deal on the interpretation of an analyst to give it meaning.

What do we mean by ‘similar’ and ‘different’? We extend the basic idea of distance in (two dimensional) space where the distance between observation i and observation j is given by

\[d_{ij}=\sqrt{(x_i-x_j)^2+(y_i-y_j)^2}\]

that is the square root of the sum of the squared differences in each coordinate. But now we apply this measure to many data dimensions. So if we have a dataset with (say) 15 attributes, we are in a 15 dimensional data space, and we take the sum of the squared differences in each of the 15 dimensions (i.e. on each variable) between every pair of observations, add them together and take the square root.

Other versions of the basic idea of ‘total difference’ in attribute values are possible.

An important consideration is that all the attributes be rescaled so that the differences in one particular attribute which happens to have large values associated with it don’t ‘drown out’ differences in other variables. For example if one variable is mean income in dollars and has values like 25000 or 50000, while another variable is proportion of children under 5 and has values like 0.04 or 0.05, then if we do not rescale things, the differences in income will swamp any differences in the demographic mix.

A similar concern is that we try not to include lots of strongly correlated variables in the analysis.

Libraries first…

As usual, we need some libraries.

library(sf)
library(tmap)
library(dplyr)

Example data

I have prepared an example dataset so we can focus on the clustering itself. For the assignment proper, you are also asked to consider doing some data preparation to select variables and perhaps rescale data values before applying cluster analysis.

The example data are a collection of demographic variables for San Francisco, California from the 2010 census. You can explore the dataset interactively at this website to get a feel for things.

Download the data from this link, and open them in R

sanfran <- st_read("sf_demo.gpkg")
## Reading layer `sf_demo' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/_labs/week-05/sf_demo.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 189 features and 25 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.5145 ymin: 37.70813 xmax: -122.3679 ymax: 37.81144
## Geodetic CRS:  WGS 84

If you run summary(sanfran) you’ll see that all these variables have been scaled so that the values range from 0 to 1. We won’t worry for the sake of the example exactly how this has been done for this dataset, but it’s something we need to pay attention to for the assignment dataset.

Make a data only copy

For some of what follows the existence of the geometry attribute causes errors, so it is helpful to make a data-only copy. It is convenient to designate this with .d on the name. Because it is a direct copy the rows of the data table stay in the same order, which is important later when we do the cluster analysis.

sanfran.d <- sanfran %>%
  st_drop_geometry() %>%
  select(-id) # also remove ID variable since it is just an identifier

Note that we remove the id variable because it is a meaningless identifier and should not be included in the clustering analysis.

Getting a feel for the data

This is a complicated dataset, with 24 attributes. We’re likely to be interested in lots of possible relationships and patterns in the data, but it’s difficult to get a handle on things with so much going on. In this section I’ll show you a few possibilities, which you may also wish to try with the assignment data later. We don’t really have the time to cover what is going on with all these plotting methods in detail in this course (but I am happy to answer any questions you may have… post them in the slack workspace so everyone can benefit from the explanations, or if they are very specific, just ask).

Univariate maps

The plot() function will make small maps of 9 or 10 of the variables. If you’d like it to map other than the first 9 or 10, then use select as shown below.

sanfran %>%
  plot()
## Warning: plotting the first 9 out of 25 attributes; use max.plot = 25 to plot
## all

Note that the colours are a rainbow palette with purple/blue for low values through to orange/yellow for high. You can change the colour palette if you like, but it’s a bit messy, unfortunately… the plot() function is just for a quick look-see not really for final maps.

sanfran %>%
  plot(pal = RColorBrewer::brewer.pal(9, "Reds"))
## Warning: plotting the first 9 out of 25 attributes; use max.plot = 25 to plot
## all

Keep in mind, the plot() function is only showing you the first few variables, there are 15 more! If any variable is of particular interest map it with tmap in the usual way.

In the short sections that follow, I show some ways to examine all the data, using packages we haven’t worked with before. There isn’t really scope in this course to fully explain these operations, the focus is simply on getting an idea of the data.

Boxplots of all the variables

To run the code in this section you need to install and load the ggplot2 package.

require(ggplot2)
## Loading required package: ggplot2
require(tidyr)
## Loading required package: tidyr

We can use the scatter plot and boxplot functionality in this package to get an overview of all the variables in a single graphic. It’s unfortunately quite a complicated procedure because it involves transforming the data to a ‘long’ format, where each row is the observation id, a variable name, and the corresponding value of that variable. That’s what pivot_longer() is doing.

The code below will show each variable as row of dots. To change to boxplots change the geom_point to geom_boxplot and run it again. It’s best just to copy and paste this code and run it to see the result. If you’d like to understand better how it works, just ask me!

sanfran %>%
  st_drop_geometry() %>%
  pivot_longer(-id) %>%
  ggplot() +
    geom_point(aes(x = value, y = name)) +
    xlab("Value") +
    ylab("Variable")

Scatterplots of all the variables

With 24 variables a full scatterplot matrix is impractical. But we can easily do subsets. Note that we just use plot() but this time on the data only .d version of the dataset that we made (otherwise it will make maps!)

sanfran.d %>% 
  select(1:6) %>%  
  plot()

Here, since we are only using this dataset to illustrate how to perform clustering analysis and aren’t interested in its structure as such, we will not spend any more time on this data exploration. In a more realistic situation you would probably want to do so.

Making k-means clusters

Our chosen approach is k-means clustering. The method (or algorithm) is simple:

  1. Decide on the number of clusters you want, call this k
  2. Choose k cluster centres
  3. Assign each observation to its nearest cluster centre
  4. Calculate the mean centre of each cluster and move the cluster centre accordingly
  5. Go back to 3 and repeat, until the cluster assignments stop changing

Here’s an illustration of this working to help with following the description above.

It’s important to realise that k-means clustering is non-deterministic, because the choice of initial cluster centres is usually random, and may affect the final assignment arrived at.

So here is how we accomplish this in R.

km <- kmeans(sanfran.d, 5)

That’s it! All the results for the chosen number of clusters (specified as 5 in the above example) are now contained in the dataset km. Take a look at it by selecting it from the Environment tab, or just type km at the console:

km
## K-means clustering with 5 clusters of sizes 48, 22, 62, 12, 45
## 
## Cluster means:
##      density medianYearBuilt  PConeUnit PCownerOccUnits PCcommutingNotCar
## 1 0.15830106      0.04468599 0.18707571      0.32748458         0.4790187
## 2 0.09007924      0.10737813 0.72935782      0.68007831         0.3336848
## 3 0.19528343      0.17554932 0.14235860      0.21976965         0.5600920
## 4 0.48104157      0.05434783 0.02950681      0.05283106         0.7973429
## 5 0.13300482      0.13075684 0.77188985      0.60691491         0.3344236
##   PCmovedWithinCounty PClessHighSchool PCsomeCollegeOrMore PCdoctorate
## 1          0.08551634       0.03061757           0.9138524 0.040493135
## 2          0.06816078       0.06704645           0.8384086 0.036165759
## 3          0.09556696       0.12284678           0.7514256 0.019881618
## 4          0.10131849       0.36386356           0.4415528 0.005213407
## 5          0.05662245       0.23142766           0.5470680 0.010613916
##   PCmarriedCouple PCunmarriedSSCouple PCwithKids PCsexMale medianAge
## 1       0.2767255          0.03720100  0.1374825 0.5278012 0.3055436
## 2       0.4970033          0.03172039  0.2626231 0.5057769 0.4019715
## 3       0.2675947          0.01567749  0.1564369 0.5122189 0.2928211
## 4       0.2498533          0.01015470  0.1275014 0.5464857 0.4840792
## 5       0.4805850          0.01150318  0.3322158 0.4850445 0.3544846
##   PCraceBlackAlone PCraceAsianAlone PCraceHispanic PCraceWhiteAlone
## 1       0.02549538        0.1397737     0.08916732        0.7634400
## 2       0.03497096        0.2815608     0.13394869        0.5748927
## 3       0.07499380        0.2881595     0.16565969        0.5103899
## 4       0.07050091        0.5468529     0.13516418        0.2803504
## 5       0.08158363        0.5122083     0.20453925        0.2646261
##   PCforeignBorn perCapitaIncome PCunemployed PCpoorStruggling PCwithInterests
## 1     0.1935600       0.5226991   0.05419155        0.1490416       0.3770700
## 2     0.2730182       0.3586822   0.07080560        0.1623070       0.4451737
## 3     0.3436578       0.2588566   0.07624802        0.3423988       0.2475574
## 4     0.5844461       0.1055708   0.09731570        0.6554962       0.1101683
## 5     0.4886468       0.1426807   0.10696367        0.3196807       0.2290423
##   PCveryWealthyHHolds
## 1          0.24636330
## 2          0.23189893
## 3          0.10770211
## 4          0.02656405
## 5          0.07736595
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   3   1   3   1   1   4   4   3   1   3   3   1   4   3   4   3   3   3   3   4 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   4   4   3   4   3   4   4   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   3   1   1   1   3   3   3   3   1   3   3   3   3   3   3   3   1   1   3   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   1   1   1   1   3   3   4   3   3   3   3   1   1   1   1   1   1   3   3   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   2   2   2   1   1   1   3   3   3   3   3   3   5   5   3 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   3   5   5   5   2   2   1   2   2   5   5   5   5   5   5   5   5   5   5   5 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   5   5   5   5   5   5   5   5   5   1   2   3   3   2   2   2   2   2   2   2 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   2   2   2   5   5   5   5   5   3   5   5   5   5   5   5   5   2   3   3   3 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   5   5   3   5   5   3   3   3   3   3   2   3   3   3   3   3   5   3   2   2 
## 181 182 183 184 185 186 187 188 189 
##   1   3   5   3   5   4   5   1   3 
## 
## Within cluster sum of squares by cluster:
## [1]  6.506995  4.074232 18.284928  3.586401  8.389699
##  (between_SS / total_SS =  60.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Most of this is ‘too much information’. The vital information—which observation is in which cluster—is contained in the cluster component, which we can extract and add as a new variable to our dataset like this:

sanfran <- sanfran %>%
  mutate(k5 = as.factor(km$cluster))

Note how I have given this new variable a name that reflects the number of clusters, and also that I have added it to the spatial dataset. Also, I have added it as a factor which means R will know it is a categorical variable and not a meaningful number.

That means we can map it (your map will look different!)

tm_shape(sanfran) +
  tm_polygons(col = 'k5') +
  tm_legend(legend.outside = TRUE)

Note how tmap knows to use a categorical palette because k5 is a factor.

Notes

You can run the above code again and you will probably end up with a different (if similar) map. You can also run it specifying a different number of clusters in the kmeans() function.

If you do this you should make sure you give the new variable in the mutate operation a different name, reflecting the number of clusters.

The fact that the results may not always be the same means that interpretation of clusters can be quite challenging.

Interpretation of clusters

The ‘quality’ of a particular clustering solution is dependent on how well we think we can interpret it. In this case, for an unfamiliar setting, you probably can’t do much. Measures of the variance within and between clusters can be used to assess ‘quality’ in a more technical sense and are available from the kmeans object produced by the function.

Here is some code you can use to compare a specific variable across clusters (again, your plot will look different)

boxplot(PCwithKids ~ k5, data = sanfran)

You have to use this code one variable at a time. You can get an overview of which variables are particularly high or low in each cluster from the centers component of the kmeans() results:

km$centers
##      density medianYearBuilt  PConeUnit PCownerOccUnits PCcommutingNotCar
## 1 0.15830106      0.04468599 0.18707571      0.32748458         0.4790187
## 2 0.09007924      0.10737813 0.72935782      0.68007831         0.3336848
## 3 0.19528343      0.17554932 0.14235860      0.21976965         0.5600920
## 4 0.48104157      0.05434783 0.02950681      0.05283106         0.7973429
## 5 0.13300482      0.13075684 0.77188985      0.60691491         0.3344236
##   PCmovedWithinCounty PClessHighSchool PCsomeCollegeOrMore PCdoctorate
## 1          0.08551634       0.03061757           0.9138524 0.040493135
## 2          0.06816078       0.06704645           0.8384086 0.036165759
## 3          0.09556696       0.12284678           0.7514256 0.019881618
## 4          0.10131849       0.36386356           0.4415528 0.005213407
## 5          0.05662245       0.23142766           0.5470680 0.010613916
##   PCmarriedCouple PCunmarriedSSCouple PCwithKids PCsexMale medianAge
## 1       0.2767255          0.03720100  0.1374825 0.5278012 0.3055436
## 2       0.4970033          0.03172039  0.2626231 0.5057769 0.4019715
## 3       0.2675947          0.01567749  0.1564369 0.5122189 0.2928211
## 4       0.2498533          0.01015470  0.1275014 0.5464857 0.4840792
## 5       0.4805850          0.01150318  0.3322158 0.4850445 0.3544846
##   PCraceBlackAlone PCraceAsianAlone PCraceHispanic PCraceWhiteAlone
## 1       0.02549538        0.1397737     0.08916732        0.7634400
## 2       0.03497096        0.2815608     0.13394869        0.5748927
## 3       0.07499380        0.2881595     0.16565969        0.5103899
## 4       0.07050091        0.5468529     0.13516418        0.2803504
## 5       0.08158363        0.5122083     0.20453925        0.2646261
##   PCforeignBorn perCapitaIncome PCunemployed PCpoorStruggling PCwithInterests
## 1     0.1935600       0.5226991   0.05419155        0.1490416       0.3770700
## 2     0.2730182       0.3586822   0.07080560        0.1623070       0.4451737
## 3     0.3436578       0.2588566   0.07624802        0.3423988       0.2475574
## 4     0.5844461       0.1055708   0.09731570        0.6554962       0.1101683
## 5     0.4886468       0.1426807   0.10696367        0.3196807       0.2290423
##   PCveryWealthyHHolds
## 1          0.24636330
## 2          0.23189893
## 3          0.10770211
## 4          0.02656405
## 5          0.07736595

This is quite difficult to interpret (all those numbers) but scanning the information should enable you to spot which cluster has particularly high or low values of particular variables.

For a more visual presentation of all the information we can use some more complicated plotting code, as shown below.

sanfran %>%
  select(-id) %>%
  st_drop_geometry() %>%
  pivot_longer(-k5) %>%
  ggplot() +
    geom_point(aes(x = value, y = name, colour = k5)) +
    facet_wrap(~ k5, nrow = 1)

Here, each column of the plot shows all the values for each variable in a single cluster. Scanning across the data you might be able to identify differences among the clusters, or variables that are particularly high or low in one cluster relative to their values in others.

The assignment data: Auckland demographic data, 2018

OK… you will be relieved to know that the assignment is based on a dataset you probably know a bit more about, namely data from the 2018 census concerning general population characteristics across Auckland. The definition of Auckland used is non-standard, focused on the ‘old’ Auckland cities of Waitakere, North Shore City, Manukau City and Auckland City ‘proper’ rather than on the extensive surrounding rural areas included in the ‘super city’.

Download the data from this link and save it to your project folder.

Load the data, and make a data-only copy

ak_demo <- st_read("ak-demographics-2018.gpkg")
## Reading layer `ak-demographics-2018' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/_labs/week-05/ak-demographics-2018.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 472 features and 39 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1728669 ymin: 5887247 xmax: 1803822 ymax: 5971022
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
ak_demo.d <- ak_demo %>%
  st_drop_geometry() %>%
  select(-(sa2_id:pop)) # remove the general variables

These data were compiled from various Census data tables, using the NZ.Stat tool at this link. You can get an overview of the distribution of the different attributes included in the data from this basic webmap.

Specifically the variables in this dataset are as follows, and unless otherwise stated are expressed as percentages (%) of the population responding to that question:

Group Variable name Meaning
General sa2_id Statistical area 2 ID
  sa2_name Statistical area 2 name
  ta_name Territorial authority name (these are all ‘Auckland’)
  pop Total usually resident population
Age under15 Aged under 15
  a_15to29years Aged 15-29
  a_30to64years Aged 30-64
  over65 Aged 65 or over
Ethnicity pakeha Pākeha
  maori Māori
  pacific Pacific peoples
  asian Asian (includes East and South Asian)
  melaa Middle Eastern, Latin American and African
  other_ethnicity Other ethnicities
Migration ur5_same Usually resident 5 years ago at the same location
  ur5_in NZ Usually resident 5 years ago elsewhere in New Zealand
  ur5_overseas Usually resident 5 years ago overseas
Income median_income Median household income (1000s of $)
  under20k Household income under $20,000
  i_20to30k Household income between $20,001 and $30,000
  i_30to50k Household income between $30,001 and $50,000
  i_50to70k Household income between $50,001 and $70,000
  i_70to100k Household income between $70,001 and $100,000
  i_100to150k Household income between $100,001 and $150,000
  over150k Household income over $150,000
Tenure renting % of households renting
Education bachelors Education to bachelors degree
  high_school Education level between high school and degree
  postgrad Education beyond bachelors degree
Religion christian Christian
  maori_religions Māori religions, beliefs and philosophies
  buddhism Buddhism
  hinduism Hinduism
  islam Islam
  judaism Judaism
  new_age Spiritualism and new age religions
  sikhism Sikhism
  jedi Jedi
  no_religion No religion

If you are unsure about any of this then ask.

Note that the attributes are provided as percentages. You may decide that you need to change variables to estimated counts (based on the pop attribute) or to further standardise the numbers so that each attribute is expressed relative to its overall distribution.

I advise you to make some maps to get a feel for the data before attempting to do cluster analysis.

Assignment 3 Cluster analysis of demographica data

This assignment is quite open-ended. Here is the description:

Run a cluster analysis of the provided Auckland demographic data and map the resulting clusters. Provide a short write up explaining how you did the analysis (not the commands used, focus instead on the number of clusters, variables included or excluded). Explain what you think the results tell us about the overall geographical community structure of the Auckland Region. The overall write up should not be more than 2-3 pages (with this length including maps and diagrams if any).

Things to consider

Here are a few things to be thinking about while working on this assignment.

  • How many clusters make sense? How many different kinds of ‘place’ are there in a region like Auckland? Think about how you expect the region to break down into different kinds of place based on the kinds of communities that live there.
  • Try to provide verbal descriptions of the clusters you identify so that a reader can make sense of them without studying tables of data or complex charts.
  • Keep in mind that clusters are a categorical data type which has implications for the style to use when mapping them.
  • Consider if you want to retain all the supplied variables in the analysis. Maybe some are redundant, or you can get a clearer result by removing a few (but not so many that it no longer makes sense to cluster the remaining variables).
  • Consider if it is appropriate to rescale variables so they are expressed relative to their own distribution (e.g. 50 might be a high number for some variables, but not at all high for others). You don’t have to do this to get a good mark, but you should still think about it and explain your choices.

Assessment

Assessment will be based on the maps and any diagrams included, and an overall evaluation of your explanation of the analysis carried out and the discussion of what it shows. Pay equal attention to the quality of any maps or figures you include, and to the written discussion.

Submission

Prepare a PDF document that meets the requirements set out above. The whole document should only need to be 2-3 pages (maximum). Submit a PDF file to the dropbox provided on Nuku. The due date is the end of the day on the due date indicated on the course schedule page..