Overview
The lecture material related to this lab is available here:
Meanwhile, here is a quick summary…
Clustering methods segment the observations in a dataset into clusters or classes based on the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or classes such that observations in the same class are similar to each other, and different from observations in other classes.
Unfortunately, there is no easy way to define clusters beyond recognising that they are the groups of observations identified by a clustering method! Clustering analysis therefore depends a great deal on the interpretation of an analyst to give it meaning.
What do we mean by ‘similar’ and ‘different’? We extend the basic idea of distance in (two dimensional) space where the distance between observation i and observation j is given by
\[d_{ij}=\sqrt{(x_i-x_j)^2+(y_i-y_j)^2}\]
that is the square root of the sum of the squared differences in each coordinate. But now we apply this measure to many data dimensions. So if we have a dataset with (say) 15 attributes, we are in a 15 dimensional data space, and we take the sum of the squared differences in each of the 15 dimensions (i.e. on each variable) between every pair of observations, add them together and take the square root.
Other versions of the basic idea of ‘total difference’ in attribute values are possible.
An important consideration is that all the attributes be rescaled so that the differences in one particular attribute which happens to have large values associated with it don’t ‘drown out’ differences in other variables. For example if one variable is mean income in dollars and has values like 25000 or 50000, while another variable is proportion of children under 5 and has values like 0.04 or 0.05, then if we do not rescale things, the differences in income will swamp any differences in the demographic mix.
A similar concern is that we try not to include lots of strongly correlated variables in the analysis.
Libraries first…
As usual, we need some libraries.
library(sf)
library(tmap)
library(dplyr)
Example data
I have prepared an example dataset so we can focus on the clustering itself. For the assignment proper, you are also asked to consider doing some data preparation to select variables and perhaps rescale data values before applying cluster analysis.
The example data are a collection of demographic variables for San Francisco, California from the 2010 census. You can explore the dataset interactively at this website to get a feel for things.
Download the data from this link, and open them in R
sanfran <- st_read("sf_demo.gpkg")
## Reading layer `sf_demo' from data source
## `/Users/osullid3/Documents/teaching/Geog315/_labs/week-05/sf_demo.gpkg'
## using driver `GPKG'
## Simple feature collection with 189 features and 25 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.5145 ymin: 37.70813 xmax: -122.3679 ymax: 37.81144
## Geodetic CRS: WGS 84
If you run summary(sanfran)
you’ll see that all these
variables have been scaled so that the values range from 0 to 1. We
won’t worry for the sake of the example exactly how this has been done
for this dataset, but it’s something we need to pay attention to for the
assignment dataset.
Make a data only copy
For some of what follows the existence of the geometry
attribute causes errors, so it is helpful to make a data-only copy. It
is convenient to designate this with .d
on the name.
Because it is a direct copy the rows of the data table stay in the
same order, which is important later when we do the cluster
analysis.
sanfran.d <- sanfran %>%
st_drop_geometry() %>%
select(-id) # also remove ID variable since it is just an identifier
Note that we remove the id
variable because it is a
meaningless identifier and should not be included in the clustering
analysis.
Getting a feel for the data
This is a complicated dataset, with 24 attributes. We’re likely to be interested in lots of possible relationships and patterns in the data, but it’s difficult to get a handle on things with so much going on. In this section I’ll show you a few possibilities, which you may also wish to try with the assignment data later. We don’t really have the time to cover what is going on with all these plotting methods in detail in this course (but I am happy to answer any questions you may have… post them in the slack workspace so everyone can benefit from the explanations, or if they are very specific, just ask).
Univariate maps
The plot()
function will make small maps of 9 or 10 of
the variables. If you’d like it to map other than the first 9 or 10,
then use select as shown below.
sanfran %>%
plot()
## Warning: plotting the first 9 out of 25 attributes; use max.plot = 25 to plot
## all
Note that the colours are a rainbow palette with purple/blue for low
values through to orange/yellow for high. You can change the colour
palette if you like, but it’s a bit messy, unfortunately… the
plot()
function is just for a quick look-see not really for
final maps.
sanfran %>%
plot(pal = RColorBrewer::brewer.pal(9, "Reds"))
## Warning: plotting the first 9 out of 25 attributes; use max.plot = 25 to plot
## all
Keep in mind, the plot()
function is only showing you
the first few variables, there are 15 more! If any variable is of
particular interest map it with tmap
in the usual way.
In the short sections that follow, I show some ways to examine all the data, using packages we haven’t worked with before. There isn’t really scope in this course to fully explain these operations, the focus is simply on getting an idea of the data.
Boxplots of all the variables
To run the code in this section you need to install and load the
ggplot2
package.
require(ggplot2)
## Loading required package: ggplot2
require(tidyr)
## Loading required package: tidyr
We can use the scatter plot and boxplot functionality in this package
to get an overview of all the variables in a single graphic. It’s
unfortunately quite a complicated procedure because it involves
transforming the data to a ‘long’ format, where each row is the
observation id, a variable name, and the corresponding value of that
variable. That’s what pivot_longer()
is doing.
The code below will show each variable as row of dots. To change to
boxplots change the geom_point
to geom_boxplot
and run it again. It’s best just to copy and paste this code and run it
to see the result. If you’d like to understand better how it works, just
ask me!
sanfran %>%
st_drop_geometry() %>%
pivot_longer(-id) %>%
ggplot() +
geom_point(aes(x = value, y = name)) +
xlab("Value") +
ylab("Variable")
Scatterplots of all the variables
With 24 variables a full scatterplot matrix is impractical. But we
can easily do subsets. Note that we just use plot()
but
this time on the data only .d
version of the dataset that
we made (otherwise it will make maps!)
sanfran.d %>%
select(1:6) %>%
plot()
Here, since we are only using this dataset to illustrate how to perform clustering analysis and aren’t interested in its structure as such, we will not spend any more time on this data exploration. In a more realistic situation you would probably want to do so.
Making k-means clusters
Our chosen approach is k-means clustering. The method (or algorithm) is simple:
- Decide on the number of clusters you want, call this k
- Choose k cluster centres
- Assign each observation to its nearest cluster centre
- Calculate the mean centre of each cluster and move the cluster centre accordingly
- Go back to 3 and repeat, until the cluster assignments stop changing
Here’s an illustration of this working to help with following the description above.
It’s important to realise that k-means clustering is non-deterministic, because the choice of initial cluster centres is usually random, and may affect the final assignment arrived at.
So here is how we accomplish this in R.
km <- kmeans(sanfran.d, 5)
That’s it! All the results for the chosen number of clusters
(specified as 5 in the above example) are now contained in the dataset
km
. Take a look at it by selecting it from the Environment
tab, or just type km
at the console:
km
## K-means clustering with 5 clusters of sizes 48, 22, 62, 12, 45
##
## Cluster means:
## density medianYearBuilt PConeUnit PCownerOccUnits PCcommutingNotCar
## 1 0.15830106 0.04468599 0.18707571 0.32748458 0.4790187
## 2 0.09007924 0.10737813 0.72935782 0.68007831 0.3336848
## 3 0.19528343 0.17554932 0.14235860 0.21976965 0.5600920
## 4 0.48104157 0.05434783 0.02950681 0.05283106 0.7973429
## 5 0.13300482 0.13075684 0.77188985 0.60691491 0.3344236
## PCmovedWithinCounty PClessHighSchool PCsomeCollegeOrMore PCdoctorate
## 1 0.08551634 0.03061757 0.9138524 0.040493135
## 2 0.06816078 0.06704645 0.8384086 0.036165759
## 3 0.09556696 0.12284678 0.7514256 0.019881618
## 4 0.10131849 0.36386356 0.4415528 0.005213407
## 5 0.05662245 0.23142766 0.5470680 0.010613916
## PCmarriedCouple PCunmarriedSSCouple PCwithKids PCsexMale medianAge
## 1 0.2767255 0.03720100 0.1374825 0.5278012 0.3055436
## 2 0.4970033 0.03172039 0.2626231 0.5057769 0.4019715
## 3 0.2675947 0.01567749 0.1564369 0.5122189 0.2928211
## 4 0.2498533 0.01015470 0.1275014 0.5464857 0.4840792
## 5 0.4805850 0.01150318 0.3322158 0.4850445 0.3544846
## PCraceBlackAlone PCraceAsianAlone PCraceHispanic PCraceWhiteAlone
## 1 0.02549538 0.1397737 0.08916732 0.7634400
## 2 0.03497096 0.2815608 0.13394869 0.5748927
## 3 0.07499380 0.2881595 0.16565969 0.5103899
## 4 0.07050091 0.5468529 0.13516418 0.2803504
## 5 0.08158363 0.5122083 0.20453925 0.2646261
## PCforeignBorn perCapitaIncome PCunemployed PCpoorStruggling PCwithInterests
## 1 0.1935600 0.5226991 0.05419155 0.1490416 0.3770700
## 2 0.2730182 0.3586822 0.07080560 0.1623070 0.4451737
## 3 0.3436578 0.2588566 0.07624802 0.3423988 0.2475574
## 4 0.5844461 0.1055708 0.09731570 0.6554962 0.1101683
## 5 0.4886468 0.1426807 0.10696367 0.3196807 0.2290423
## PCveryWealthyHHolds
## 1 0.24636330
## 2 0.23189893
## 3 0.10770211
## 4 0.02656405
## 5 0.07736595
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 1 3 1 1 4 4 3 1 3 3 1 4 3 4 3 3 3 3 4
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 4 4 3 4 3 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 3 1 1 1 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 3 3 4 3 3 3 3 1 1 1 1 1 1 3 3 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 2 2 2 1 1 1 3 3 3 3 3 3 5 5 3
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 3 5 5 5 2 2 1 2 2 5 5 5 5 5 5 5 5 5 5 5
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 5 5 5 5 5 5 5 5 5 1 2 3 3 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 2 5 5 5 5 5 3 5 5 5 5 5 5 5 2 3 3 3
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 5 5 3 5 5 3 3 3 3 3 2 3 3 3 3 3 5 3 2 2
## 181 182 183 184 185 186 187 188 189
## 1 3 5 3 5 4 5 1 3
##
## Within cluster sum of squares by cluster:
## [1] 6.506995 4.074232 18.284928 3.586401 8.389699
## (between_SS / total_SS = 60.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Most of this is ‘too much information’. The vital information—which
observation is in which cluster—is contained in the cluster
component, which we can extract and add as a new variable to our dataset
like this:
sanfran <- sanfran %>%
mutate(k5 = as.factor(km$cluster))
Note how I have given this new variable a name that reflects the number of clusters, and also that I have added it to the spatial dataset. Also, I have added it as a factor which means R will know it is a categorical variable and not a meaningful number.
That means we can map it (your map will look different!)
tm_shape(sanfran) +
tm_polygons(col = 'k5') +
tm_legend(legend.outside = TRUE)
Note how tmap
knows to use a categorical palette because
k5
is a factor.
Notes
You can run the above code again and you will probably end up with a
different (if similar) map. You can also run it specifying a different
number of clusters in the kmeans()
function.
If you do this you should make sure you give the new variable in the
mutate
operation a different name, reflecting the number of
clusters.
The fact that the results may not always be the same means that interpretation of clusters can be quite challenging.
Interpretation of clusters
The ‘quality’ of a particular clustering solution is dependent on how
well we think we can interpret it. In this case, for an unfamiliar
setting, you probably can’t do much. Measures of the variance within and
between clusters can be used to assess ‘quality’ in a more technical
sense and are available from the kmeans
object produced by
the function.
Here is some code you can use to compare a specific variable across clusters (again, your plot will look different)
boxplot(PCwithKids ~ k5, data = sanfran)
You have to use this code one variable at a time. You can get an
overview of which variables are particularly high or low in each cluster
from the centers
component of the kmeans()
results:
km$centers
## density medianYearBuilt PConeUnit PCownerOccUnits PCcommutingNotCar
## 1 0.15830106 0.04468599 0.18707571 0.32748458 0.4790187
## 2 0.09007924 0.10737813 0.72935782 0.68007831 0.3336848
## 3 0.19528343 0.17554932 0.14235860 0.21976965 0.5600920
## 4 0.48104157 0.05434783 0.02950681 0.05283106 0.7973429
## 5 0.13300482 0.13075684 0.77188985 0.60691491 0.3344236
## PCmovedWithinCounty PClessHighSchool PCsomeCollegeOrMore PCdoctorate
## 1 0.08551634 0.03061757 0.9138524 0.040493135
## 2 0.06816078 0.06704645 0.8384086 0.036165759
## 3 0.09556696 0.12284678 0.7514256 0.019881618
## 4 0.10131849 0.36386356 0.4415528 0.005213407
## 5 0.05662245 0.23142766 0.5470680 0.010613916
## PCmarriedCouple PCunmarriedSSCouple PCwithKids PCsexMale medianAge
## 1 0.2767255 0.03720100 0.1374825 0.5278012 0.3055436
## 2 0.4970033 0.03172039 0.2626231 0.5057769 0.4019715
## 3 0.2675947 0.01567749 0.1564369 0.5122189 0.2928211
## 4 0.2498533 0.01015470 0.1275014 0.5464857 0.4840792
## 5 0.4805850 0.01150318 0.3322158 0.4850445 0.3544846
## PCraceBlackAlone PCraceAsianAlone PCraceHispanic PCraceWhiteAlone
## 1 0.02549538 0.1397737 0.08916732 0.7634400
## 2 0.03497096 0.2815608 0.13394869 0.5748927
## 3 0.07499380 0.2881595 0.16565969 0.5103899
## 4 0.07050091 0.5468529 0.13516418 0.2803504
## 5 0.08158363 0.5122083 0.20453925 0.2646261
## PCforeignBorn perCapitaIncome PCunemployed PCpoorStruggling PCwithInterests
## 1 0.1935600 0.5226991 0.05419155 0.1490416 0.3770700
## 2 0.2730182 0.3586822 0.07080560 0.1623070 0.4451737
## 3 0.3436578 0.2588566 0.07624802 0.3423988 0.2475574
## 4 0.5844461 0.1055708 0.09731570 0.6554962 0.1101683
## 5 0.4886468 0.1426807 0.10696367 0.3196807 0.2290423
## PCveryWealthyHHolds
## 1 0.24636330
## 2 0.23189893
## 3 0.10770211
## 4 0.02656405
## 5 0.07736595
This is quite difficult to interpret (all those numbers) but scanning the information should enable you to spot which cluster has particularly high or low values of particular variables.
For a more visual presentation of all the information we can use some more complicated plotting code, as shown below.
sanfran %>%
select(-id) %>%
st_drop_geometry() %>%
pivot_longer(-k5) %>%
ggplot() +
geom_point(aes(x = value, y = name, colour = k5)) +
facet_wrap(~ k5, nrow = 1)
Here, each column of the plot shows all the values for each variable in a single cluster. Scanning across the data you might be able to identify differences among the clusters, or variables that are particularly high or low in one cluster relative to their values in others.
The assignment data: Auckland demographic data, 2018
OK… you will be relieved to know that the assignment is based on a dataset you probably know a bit more about, namely data from the 2018 census concerning general population characteristics across Auckland. The definition of Auckland used is non-standard, focused on the ‘old’ Auckland cities of Waitakere, North Shore City, Manukau City and Auckland City ‘proper’ rather than on the extensive surrounding rural areas included in the ‘super city’.
Download the data from this link and save it to your project folder.
Load the data, and make a data-only copy
ak_demo <- st_read("ak-demographics-2018.gpkg")
## Reading layer `ak-demographics-2018' from data source
## `/Users/osullid3/Documents/teaching/Geog315/_labs/week-05/ak-demographics-2018.gpkg'
## using driver `GPKG'
## Simple feature collection with 472 features and 39 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1728669 ymin: 5887247 xmax: 1803822 ymax: 5971022
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
ak_demo.d <- ak_demo %>%
st_drop_geometry() %>%
select(-(sa2_id:pop)) # remove the general variables
These data were compiled from various Census data tables, using the NZ.Stat tool at this link. You can get an overview of the distribution of the different attributes included in the data from this basic webmap.
Specifically the variables in this dataset are as follows, and unless otherwise stated are expressed as percentages (%) of the population responding to that question:
Group | Variable name | Meaning |
---|---|---|
General | sa2_id |
Statistical area 2 ID |
sa2_name |
Statistical area 2 name | |
ta_name |
Territorial authority name (these are all ‘Auckland’) | |
pop |
Total usually resident population | |
Age | under15 |
Aged under 15 |
a_15to29years |
Aged 15-29 | |
a_30to64years |
Aged 30-64 | |
over65 |
Aged 65 or over | |
Ethnicity | pakeha |
Pākeha |
maori |
Māori | |
pacific |
Pacific peoples | |
asian |
Asian (includes East and South Asian) | |
melaa |
Middle Eastern, Latin American and African | |
other_ethnicity |
Other ethnicities | |
Migration | ur5_same |
Usually resident 5 years ago at the same location |
ur5_in NZ |
Usually resident 5 years ago elsewhere in New Zealand | |
ur5_overseas |
Usually resident 5 years ago overseas | |
Income | median_income |
Median household income (1000s of $) |
under20k |
Household income under $20,000 | |
i_20to30k |
Household income between $20,001 and $30,000 | |
i_30to50k |
Household income between $30,001 and $50,000 | |
i_50to70k |
Household income between $50,001 and $70,000 | |
i_70to100k |
Household income between $70,001 and $100,000 | |
i_100to150k |
Household income between $100,001 and $150,000 | |
over150k |
Household income over $150,000 | |
Tenure | renting |
% of households renting |
Education | bachelors |
Education to bachelors degree |
high_school |
Education level between high school and degree | |
postgrad |
Education beyond bachelors degree | |
Religion | christian |
Christian |
maori_religions |
Māori religions, beliefs and philosophies | |
buddhism |
Buddhism | |
hinduism |
Hinduism | |
islam |
Islam | |
judaism |
Judaism | |
new_age |
Spiritualism and new age religions | |
sikhism |
Sikhism | |
jedi |
Jedi | |
no_religion |
No religion |
If you are unsure about any of this then ask.
Note that the attributes are provided as percentages. You may decide
that you need to change variables to estimated counts (based on the
pop
attribute) or to further standardise the numbers so
that each attribute is expressed relative to its overall
distribution.
I advise you to make some maps to get a feel for the data before attempting to do cluster analysis.
Assignment 3 Cluster analysis of demographica data
This assignment is quite open-ended. Here is the description:
Run a cluster analysis of the provided Auckland demographic data and map the resulting clusters. Provide a short write up explaining how you did the analysis (not the commands used, focus instead on the number of clusters, variables included or excluded). Explain what you think the results tell us about the overall geographical community structure of the Auckland Region. The overall write up should not be more than 2-3 pages (with this length including maps and diagrams if any).
Things to consider
Here are a few things to be thinking about while working on this assignment.
- How many clusters make sense? How many different kinds of ‘place’ are there in a region like Auckland? Think about how you expect the region to break down into different kinds of place based on the kinds of communities that live there.
- Try to provide verbal descriptions of the clusters you identify so that a reader can make sense of them without studying tables of data or complex charts.
- Keep in mind that clusters are a categorical data type
which has implications for the
style
to use when mapping them. - Consider if you want to retain all the supplied variables in the analysis. Maybe some are redundant, or you can get a clearer result by removing a few (but not so many that it no longer makes sense to cluster the remaining variables).
- Consider if it is appropriate to rescale variables so they are expressed relative to their own distribution (e.g. 50 might be a high number for some variables, but not at all high for others). You don’t have to do this to get a good mark, but you should still think about it and explain your choices.
Assessment
Assessment will be based on the maps and any diagrams included, and an overall evaluation of your explanation of the analysis carried out and the discussion of what it shows. Pay equal attention to the quality of any maps or figures you include, and to the written discussion.
Submission
Prepare a PDF document that meets the requirements set out above. The whole document should only need to be 2-3 pages (maximum). Submit a PDF file to the dropbox provided on Nuku. The due date is the end of the day on the due date indicated on the course schedule page..