Overview
This week the main focus of the assignment is building a simple statistical (regression) model to help us understand a spatial pattern.
We will also look at the RMarkdown file format which you can use to present results from analysis in this or other assignments (this is not required, but can be a very effective and convenient way of presenting analysis results).
Because the lab material is presented in RMarkdown format in
.Rmd
files, we will consider that first, before going on to
the assignment proper.
This means that the procedure this week is slightly different. You
should download all the materials for this week’s lab from this link. You should then
uncompress them to a folder on your computer, and set up a project in
that folder as usual. Then inside the project, open up the file
rmarkdown.Rmd
.
That file explains itself, and you should spend a little bit of time with it to get used to the idea of RMarkdown.
When you are done, you can come back to these lab instructions in the
usual way, or you can follow the instructions in the
07-lab-instructions.Rmd
file instead (the instructions are
the same in that document as in this one).
Building a simple statistical model
In this assignment you will build a simple regression model of the Airbnb listings in and around Wellington that we assembled a few weeks ago. The model will aim to account for variation in the numbers of listings with respect to the age structure of the population (from census) and relative to the numbers of various ‘amenities’ such as cafés, retail, and so on.
Libraries
Before you start, as usual we need some libraries.
library(sf)
library(tmap)
library(dplyr)
tmap_mode("view")
The data
Provided you have unpacked this week’s materials to an accessible
folder and opened a .Rproj
file in the usual way, you
should find the datasets by simply running the commands shown below. If
that doesn’t seem to work, then download the data from the links
provided in the section below. You should find all the data in a
subfolder called data
, if you unpacked the zip file
correctly.
Base data
The base data are in this file. Open them
with st_read
:
welly <- st_read("data/wellington-base-data.gpkg")
and take a look with a plot
command:
plot(welly, lwd = 0.5, pal = RColorBrewer::brewer.pal(9, "Reds"))
I’ve used the pal
option here to get a nicer colour
palette than the plot
command default.
If you really want to get a feel for the distribution of different
variables, then you should make some tmap
maps of
individual attributes. If you make web maps with
tmap_mode("view")
you will also be able to get a closer
look at things.
The attributes in this base data set are
Attribute | Description |
---|---|
sa2_id |
Statistical Area 2 (SA2) ID |
sa2_name |
Statistical Area 2 name |
ta_name |
Territorial Authority name, limited to just Wellington City and Lower Hutt City (this is a smaller area than we originally looked at, to allow easier mapping) |
pop |
Total population of SA2 per the 2018 Census |
u15 |
% of population under 15 |
a15_29 |
% of population aged from 15 to 29 |
a30_64 |
% of population aged from 30 to 64 |
o65 |
% of population aged 65 and over |
dist_cbd |
Approximate distance in km to the CBD. This is measured in a straight line, not over the road network |
Airbnb locations
We already saw this dataset recording all the Airbnb locations across the wider region.
abb <- st_read("data/abb.gpkg")
tm_shape(abb) +
tm_dots()
Amenities
This dataset has the locations of a range of services, as recorded in OpenStreetMap (OSM).
amenities <- st_read("data/amenities.gpkg")
tm_shape(amenities) +
tm_dots(col = "amenity")
The type of amenity is stored in the amenity
attribute.
This is a reduced set of all the available amenities in OSM because the
raw data includes things like bike stands and a wide array of things
that only show up once or maybe twice across the region.
Retail
Finally we have data on shops.
shops <- st_read("data/shops.gpkg")
tm_shape(shops) +
tm_dots(col = "shop")
## Warning: Number of levels of the variable "shop" is 115, which is larger than
## max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 115) in the layer function to show all levels.
In this dataset the type of shop is recorded in the shop
attribute. You’ll see there are a wide array of types of shop including
some surprising classifications such as “military_surplus”.
Note
The Airbnb data, amenities and retail data were current in 2021. Things change pretty quickly so it’s possible that they have moved on from these data. The census data are 2018. In a more realistic study we would probably work a bit harder to make sure all the data timings lined up more accurately!
Question 1
Based on making and exploring some maps, which variables in
the welly
base dataset, and which types of point locations
in the amenities
and shops
datasets do you
think are most likely to help account for the number of Airbnb listings
across the region? There is no strictly correct answer to this question.
The idea is to think about what you are doing before just blindly making
a statistical model without thinking!. (20%)
You will probably find it helpful to make other maps using
tmap
to answer this question. Also keep in mind the causal
explanation we are thinking about here with respect to the age groups
information. Some of the people offering spaces on Airbnb will live in
the area, but many will not. It may be that the age profile of
neighbourhoods gives a general idea of the kind of place the
neighbourhood is (sleepy surburban, hipster, retirement village, or
whatever) and it is that general neighbourhood character that is more
relevant than who the Airbnb landlords are!
Organising the data
The idea is to account for the numbers of Airbnb listings per
Statistical Area 2 (SA2) using attributes in the base data
welly
and in the two point datasets amenities
and shops
.
To do this we need to count the points in the SA2 polygons. We saw how to do this a few weeks ago, but as a reminder, so you know how to count shops and amenities, here’s how to count the Airbnb listings
n_abb_listings <- welly %>%
st_contains(abb) %>%
lengths()
n_abb_listings
## [1] 10 13 11 4 9 12 11 2 5 13 3 2 1 1 4 2 16 5 3 24 6 5 7 3 7
## [26] 7 2 1 3 8 5 1 1 5 0 2 22 3 3 5 0 18 0 2 12 2 6 7 6 10
## [51] 14 9 14 7 2 10 3 7 8 2 9 4 8 22 9 19 16 8 10 15 15 17 27 18 7
## [76] 14 17 20 44 10 10 31 6 47 15 46 46 60 20 21 29 12 17 9 63 25 16 14 8 15
## [101] 23 7 18 8 7 20 23 23 7 29 6 9 12 5 7 15 27 11 12 1 19 27
And this list of counts can then be added to the base data using a
mutate
operation
welly <- welly %>%
mutate(n_abb = n_abb_listings)
And then we can map this:
tm_shape(welly) +
tm_polygons(col = "n_abb")
Similar operations can be applied to count the amenities and/or shops points.
When you make the model requested for this assignment, an important step is to decide which amenities and which shops in the respective datasets matter. You could include some, none or all of them each represented as a count of how many there are of the respective types in each SA2. For example, it seems likely that cafes and restaurants might be of interest to potential Airbnb renters, but hardware shops may be less relevant.
Adding the kinds of amenities and shops you think are relevant to the base dataset will involved filtering the point data down to only the types you want to keep, and then counting them using an operation like the above, then adding them into the base data. Remember that when you filter data, you need to assign the result of the filter to a new variable to retain the effect of applying the filter, before doing the point-in-polygon counting as above.
When you add new attributes to the dataset, I recommend that you save
it to a new file so you don’t lose the work, using the
st_write()
function.
Building a regression model
Because R is a platform for statistical computing, making
regression models (or more generally linear models) is central to what
it does. To demonstrate, I an going to walk you through making a model
using all the census variables as independent variables to fit the
n_abb
variable. This is actually a bad idea for reasons we
get to a bit later.
The lm
function does all the work, all it needs is the
equation we want to fit, which we specify as shown below:
m.ages <- lm(n_abb ~ u15 + a15_29 + a30_64 + o65, data = welly)
Nothing seems to have happened, but a model has been made and
assigned to the m.ages
variable, and we can see what it
looks like with the summary
function:
summary(m.ages)
##
## Call:
## lm(formula = n_abb ~ u15 + a15_29 + a30_64 + o65, data = welly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.149 -5.069 -1.149 4.806 36.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7616 9.4396 -0.293 0.770382
## u15 -0.8040 0.2133 -3.769 0.000258 ***
## a15_29 0.3687 0.1159 3.182 0.001873 **
## a30_64 0.4775 0.1443 3.309 0.001244 **
## o65 -0.1602 0.1855 -0.863 0.389634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.499 on 117 degrees of freedom
## Multiple R-squared: 0.3558, Adjusted R-squared: 0.3337
## F-statistic: 16.15 on 4 and 117 DF, p-value: 1.471e-10
Recall that the regression model is effectively an equation predicting the dependent variable. In this case the equation is
\[ \mathrm{n\_abb}=\beta_0+\beta_1\mathrm{u15}+\beta_2\mathrm{a15\_29}+\beta3\mathrm{a30\_64}+\beta_4\mathrm{o65}+\epsilon \]
The \(\beta\) values are the model
coefficients, and \(\epsilon\)
is an error term (the model is approximate, so includes
errors). In the summary view, the model coefficients are shown in the
Coefficients
table in the Estimate
column. The
***
designations in this table tell us which of the
independent variables are most statistically significant, in this case,
it seems like u15
has that honour, but two of the other age
groups variables are also significant, albeit less so. Information about
the model error is included below the table.
The sign (positive or negative) of the coefficients in the
Estimate
column tells us the sense of the relationship:
- a positive sign means that when that attribute increases the dependent variable also increases
- a negative sign means the opposite: where that attribute is higher the dependent variable tends to be lower
The values of the coefficients also tell us how much change to expect in the dependent variable for each unit change in the independent variable. For example, for every one point increase in the percentage of population under 15 the model is saying that we expect about 0.8 fewer listings in a census tract. This means that on average a SA2 with (say) 10% aged under 15 would have 4 fewer listings than a SA2 with 5% aged under 15, because it is 5 points higher and \(5\times-0.8040=-4.02\).
Question 2
Write a brief interpretation of this model describing in words what it seems to tell us about the effect of different age group percentages on the numbers of Airbnb listings in neighbourhoods. (20%)
Residual mapping
Residuals are the model errors—the variation in the
dependent variable that the model does not account for. Mapping
residuals can be informative. Model residuals are available from the
model we made, m.ages
, and can be added to the spatial data
as a new attribute to be mapped:
welly <- welly %>%
mutate(residual = m.ages$residuals)
tm_shape(welly) +
tm_polygons(col = 'residual', palette = 'RdBu', alpha = 0.65)
## Variable(s) "residual" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Where the residuals are high (i.e., positive), the model is underestimating the number of listings, that is there are more listings than we might reasonably expect. Where the residuals are low (i.e., negative) the model is overestimating the number of listings, that is there are fewer listings than might be expected.
Across most of the area, the model is not doing terribly but there are a couple of places where it is badly out. The next question asks you to examine these places a bit more closely, using the web map view, for contextual information that might explain the weakness of the model in these places. The best web map for that contextual information is probably the OpenStreetMap layer.
Question 3
Where does the model do particularly badly? Briefly speculate on what other factors missing from this particular model (which uses only age data) might explain this. (25%)
Think also about what kinds of places the SA2 areas where the model does poorly are.
Challenges and interpretations
One of the many difficulties with this model (which wasn’t made with too much thought) is the problem of multicollinearity which means that related variables can mask each other out. Here because the different age group percentages must sum to 100, they are not independent of one another. This can lead to instability in model coefficients. For example, if we remove one of the age groups, unexpected changes happen in the model. To see this here’s a model that leaves out the percentage aged under 15 variable.
m.no_children <- lm(n_abb ~ a15_29 + a30_64 + o65, data = welly)
summary(m.no_children)
##
## Call:
## lm(formula = n_abb ~ a15_29 + a30_64 + o65, data = welly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.106 -6.758 -2.404 4.994 40.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.19805 9.09773 -1.890 0.0612 .
## a15_29 0.61657 0.10057 6.131 1.19e-08 ***
## a30_64 0.31892 0.14556 2.191 0.0304 *
## o65 0.02764 0.18843 0.147 0.8836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 118 degrees of freedom
## Multiple R-squared: 0.2775, Adjusted R-squared: 0.2592
## F-statistic: 15.11 on 3 and 118 DF, p-value: 2.195e-08
This model now implies that every percentage point increase in any of the three age groups included will increase the number of Airbnb listings, and also that the 15-29 year old age group has the largest effect, where before the effect of the 30 to 64 year old age group was greater.
This demonstrates how complicated the interrelationships among many attributes in a dataset can be, and how if we control for one factor, it can change the apparent effect of other factors.
It also strongly suggests that the original model including all ages groups is not very reliable! Keep this in mind as you build your own model to complete this assignment.
Finally: build your own model
With the data available, and based on all you have seen so far, make your own model.
You should consider including any of the other variables provided
such as total population (pop
), the distance to centre
(dist_cbd
), and (after filtering and counting them) any of
the various amenity types or shop types.
You can also experiment with the step()
function
discussed in lecture here.
When you have made a model you are happy with answer the following question.
Question 4
For the model you made, include the code used to generate it
and the output from the summary
function. Also make a
residuals map of your model. Briefly explain why you chose to build the
model you did. What influenced your choice of variables to include?
Explain what your model seems to show based on the results provided by
the summary
function. (35%)
Assignment summary
Here are all the questions from the assignment in one location for convenience. Click on the headings to see them in their original context.
Question 1
Based on making and exploring some maps, which variables in
the welly
base dataset, and which point locations in the
amenities
and shops
datasets do you think are
most likely to help account for the number of Airbnb listings across the
region? There is no strictly correct answer to this question. The idea
is to think about what you are doing before just blindly making a
statistical model without thinking!. (20%)
Question 2
Write a brief interpretation of this model describing in words what it seems to tell us about the effect of different age group percentages on the numbers of Airbnb listings in neighbourhoods. (20%)
Question 3
Where does the model do particularly badly? Briefly speculate on what other factors missing from this particular model (which uses only age data) might explain this. (25%)
Question 4
For the model you made, include the code used to generate it
and the output from the summary
function. Also make a
residuals map of your model. Briefly explain why you chose to build the
model you did. What influenced your choice of variables to include?
Explain what your model seems to show based on the results provided by
the summary
function. (35%)
Submission instructions
This week’s assignment requirement is set out in the questions above.
You should prepare a document completing the analysis and answering the questions, and submit a short report to the dropbox provided on Nuku by the end of the day indicated on the course schedule page.
You can prepare your report using Rmarkdown and knitting the report (this is explained in the first part of this week’s materials), or you can proceed as usual preparing your report in Word and including figures and tables as you usually would and saving out to a PDF. Either is acceptable, but I would encourage you to experiment with knitting an Rmarkdown file to understand the possibilities!