Building a simple statistical model

Geog 315 T2 2023

Overview

This week the main focus of the assignment is building a simple statistical (regression) model to help us understand a spatial pattern.

We will also look at the RMarkdown file format which you can use to present results from analysis in this or other assignments (this is not required, but can be a very effective and convenient way of presenting analysis results).

Because the lab material is presented in RMarkdown format in .Rmd files, we will consider that first, before going on to the assignment proper.

This means that the procedure this week is slightly different. You should download all the materials for this week’s lab from this link. You should then uncompress them to a folder on your computer, and set up a project in that folder as usual. Then inside the project, open up the file rmarkdown.Rmd.

That file explains itself, and you should spend a little bit of time with it to get used to the idea of RMarkdown.

When you are done, you can come back to these lab instructions in the usual way, or you can follow the instructions in the 07-lab-instructions.Rmd file instead (the instructions are the same in that document as in this one).

Building a simple statistical model

In this assignment you will build a simple regression model of the Airbnb listings in and around Wellington that we assembled a few weeks ago. The model will aim to account for variation in the numbers of listings with respect to the age structure of the population (from census) and relative to the numbers of various ‘amenities’ such as cafés, retail, and so on.

Libraries

Before you start, as usual we need some libraries.

library(sf)
library(tmap)
library(dplyr)
tmap_mode("view")

The data

Provided you have unpacked this week’s materials to an accessible folder and opened a .Rproj file in the usual way, you should find the datasets by simply running the commands shown below. If that doesn’t seem to work, then download the data from the links provided in the section below. You should find all the data in a subfolder called data, if you unpacked the zip file correctly.

Base data

The base data are in this file. Open them with st_read:

welly <- st_read("data/wellington-base-data.gpkg")

and take a look with a plot command:

plot(welly, lwd = 0.5, pal = RColorBrewer::brewer.pal(9, "Reds"))

I’ve used the pal option here to get a nicer colour palette than the plot command default.

If you really want to get a feel for the distribution of different variables, then you should make some tmap maps of individual attributes. If you make web maps with tmap_mode("view") you will also be able to get a closer look at things.

The attributes in this base data set are

Attribute	Description
`sa2_id`	Statistical Area 2 (SA2) ID
`sa2_name`	Statistical Area 2 name
`ta_name`	Territorial Authority name, limited to just Wellington City and Lower Hutt City (this is a smaller area than we originally looked at, to allow easier mapping)
`pop`	Total population of SA2 per the 2018 Census
`u15`	% of population under 15
`a15_29`	% of population aged from 15 to 29
`a30_64`	% of population aged from 30 to 64
`o65`	% of population aged 65 and over
`dist_cbd`	Approximate distance in km to the CBD. This is measured in a straight line, not over the road network

Airbnb locations

We already saw this dataset recording all the Airbnb locations across the wider region.

abb <- st_read("data/abb.gpkg")

tm_shape(abb) + 
  tm_dots()

Amenities

This dataset has the locations of a range of services, as recorded in OpenStreetMap (OSM).

amenities <- st_read("data/amenities.gpkg")

tm_shape(amenities) +
  tm_dots(col = "amenity")

The type of amenity is stored in the amenity attribute. This is a reduced set of all the available amenities in OSM because the raw data includes things like bike stands and a wide array of things that only show up once or maybe twice across the region.

Retail

Finally we have data on shops.

shops <- st_read("data/shops.gpkg")

tm_shape(shops) + 
  tm_dots(col = "shop")

## Warning: Number of levels of the variable "shop" is 115, which is larger than
## max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 115) in the layer function to show all levels.

In this dataset the type of shop is recorded in the shop attribute. You’ll see there are a wide array of types of shop including some surprising classifications such as “military_surplus”.

Note

The Airbnb data, amenities and retail data were current in 2021. Things change pretty quickly so it’s possible that they have moved on from these data. The census data are 2018. In a more realistic study we would probably work a bit harder to make sure all the data timings lined up more accurately!

Question 1

Based on making and exploring some maps, which variables in the `welly` base dataset, and which types of point locations in the `amenities` and `shops` datasets do you think are most likely to help account for the number of Airbnb listings across the region? There is no strictly correct answer to this question. The idea is to think about what you are doing before just blindly making a statistical model without thinking!. (20%)

You will probably find it helpful to make other maps using tmap to answer this question. Also keep in mind the causal explanation we are thinking about here with respect to the age groups information. Some of the people offering spaces on Airbnb will live in the area, but many will not. It may be that the age profile of neighbourhoods gives a general idea of the kind of place the neighbourhood is (sleepy surburban, hipster, retirement village, or whatever) and it is that general neighbourhood character that is more relevant than who the Airbnb landlords are!

Organising the data

The idea is to account for the numbers of Airbnb listings per Statistical Area 2 (SA2) using attributes in the base data welly and in the two point datasets amenities and shops.

To do this we need to count the points in the SA2 polygons. We saw how to do this a few weeks ago, but as a reminder, so you know how to count shops and amenities, here’s how to count the Airbnb listings

n_abb_listings <- welly %>%
  st_contains(abb) %>%
  lengths()
n_abb_listings

##   [1] 10 13 11  4  9 12 11  2  5 13  3  2  1  1  4  2 16  5  3 24  6  5  7  3  7
##  [26]  7  2  1  3  8  5  1  1  5  0  2 22  3  3  5  0 18  0  2 12  2  6  7  6 10
##  [51] 14  9 14  7  2 10  3  7  8  2  9  4  8 22  9 19 16  8 10 15 15 17 27 18  7
##  [76] 14 17 20 44 10 10 31  6 47 15 46 46 60 20 21 29 12 17  9 63 25 16 14  8 15
## [101] 23  7 18  8  7 20 23 23  7 29  6  9 12  5  7 15 27 11 12  1 19 27

And this list of counts can then be added to the base data using a mutate operation

welly <- welly %>%
  mutate(n_abb = n_abb_listings)

And then we can map this:

tm_shape(welly) + 
  tm_polygons(col = "n_abb")

Similar operations can be applied to count the amenities and/or shops points.

When you make the model requested for this assignment, an important step is to decide which amenities and which shops in the respective datasets matter. You could include some, none or all of them each represented as a count of how many there are of the respective types in each SA2. For example, it seems likely that cafes and restaurants might be of interest to potential Airbnb renters, but hardware shops may be less relevant.

Adding the kinds of amenities and shops you think are relevant to the base dataset will involved filtering the point data down to only the types you want to keep, and then counting them using an operation like the above, then adding them into the base data. Remember that when you filter data, you need to assign the result of the filter to a new variable to retain the effect of applying the filter, before doing the point-in-polygon counting as above.

When you add new attributes to the dataset, I recommend that you save it to a new file so you don’t lose the work, using the st_write() function.

Building a regression model

Because R is a platform for statistical computing, making regression models (or more generally linear models) is central to what it does. To demonstrate, I an going to walk you through making a model using all the census variables as independent variables to fit the n_abb variable. This is actually a bad idea for reasons we get to a bit later.

The lm function does all the work, all it needs is the equation we want to fit, which we specify as shown below:

m.ages <- lm(n_abb ~ u15 + a15_29 + a30_64 + o65, data = welly)

Nothing seems to have happened, but a model has been made and assigned to the m.ages variable, and we can see what it looks like with the summary function:

summary(m.ages)

## 
## Call:
## lm(formula = n_abb ~ u15 + a15_29 + a30_64 + o65, data = welly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.149  -5.069  -1.149   4.806  36.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.7616     9.4396  -0.293 0.770382    
## u15          -0.8040     0.2133  -3.769 0.000258 ***
## a15_29        0.3687     0.1159   3.182 0.001873 ** 
## a30_64        0.4775     0.1443   3.309 0.001244 ** 
## o65          -0.1602     0.1855  -0.863 0.389634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.499 on 117 degrees of freedom
## Multiple R-squared:  0.3558, Adjusted R-squared:  0.3337 
## F-statistic: 16.15 on 4 and 117 DF,  p-value: 1.471e-10

Recall that the regression model is effectively an equation predicting the dependent variable. In this case the equation is

\[ \mathrm{n\_abb}=\beta_0+\beta_1\mathrm{u15}+\beta_2\mathrm{a15\_29}+\beta3\mathrm{a30\_64}+\beta_4\mathrm{o65}+\epsilon \]

The \(\beta\) values are the model coefficients, and \(\epsilon\) is an error term (the model is approximate, so includes errors). In the summary view, the model coefficients are shown in the Coefficients table in the Estimate column. The *** designations in this table tell us which of the independent variables are most statistically significant, in this case, it seems like u15 has that honour, but two of the other age groups variables are also significant, albeit less so. Information about the model error is included below the table.

The sign (positive or negative) of the coefficients in the Estimate column tells us the sense of the relationship:

a positive sign means that when that attribute increases the dependent variable also increases
a negative sign means the opposite: where that attribute is higher the dependent variable tends to be lower

The values of the coefficients also tell us how much change to expect in the dependent variable for each unit change in the independent variable. For example, for every one point increase in the percentage of population under 15 the model is saying that we expect about 0.8 fewer listings in a census tract. This means that on average a SA2 with (say) 10% aged under 15 would have 4 fewer listings than a SA2 with 5% aged under 15, because it is 5 points higher and \(5\times-0.8040=-4.02\).

Question 2

Write a brief interpretation of this model describing in words what it seems to tell us about the effect of different age group percentages on the numbers of Airbnb listings in neighbourhoods. (20%)

Residual mapping

Residuals are the model errors—the variation in the dependent variable that the model does not account for. Mapping residuals can be informative. Model residuals are available from the model we made, m.ages, and can be added to the spatial data as a new attribute to be mapped:

welly <- welly %>%
  mutate(residual = m.ages$residuals)

tm_shape(welly) +
  tm_polygons(col = 'residual', palette = 'RdBu', alpha = 0.65)

## Variable(s) "residual" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Where the residuals are high (i.e., positive), the model is underestimating the number of listings, that is there are more listings than we might reasonably expect. Where the residuals are low (i.e., negative) the model is overestimating the number of listings, that is there are fewer listings than might be expected.

Across most of the area, the model is not doing terribly but there are a couple of places where it is badly out. The next question asks you to examine these places a bit more closely, using the web map view, for contextual information that might explain the weakness of the model in these places. The best web map for that contextual information is probably the OpenStreetMap layer.

Question 3

Where does the model do particularly badly? Briefly speculate on what other factors missing from this particular model (which uses only age data) might explain this. (25%)

Think also about what kinds of places the SA2 areas where the model does poorly are.

Challenges and interpretations

One of the many difficulties with this model (which wasn’t made with too much thought) is the problem of multicollinearity which means that related variables can mask each other out. Here because the different age group percentages must sum to 100, they are not independent of one another. This can lead to instability in model coefficients. For example, if we remove one of the age groups, unexpected changes happen in the model. To see this here’s a model that leaves out the percentage aged under 15 variable.

m.no_children <- lm(n_abb ~ a15_29 + a30_64 + o65, data = welly)
summary(m.no_children)

## 
## Call:
## lm(formula = n_abb ~ a15_29 + a30_64 + o65, data = welly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.106  -6.758  -2.404   4.994  40.271 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.19805    9.09773  -1.890   0.0612 .  
## a15_29        0.61657    0.10057   6.131 1.19e-08 ***
## a30_64        0.31892    0.14556   2.191   0.0304 *  
## o65           0.02764    0.18843   0.147   0.8836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 118 degrees of freedom
## Multiple R-squared:  0.2775, Adjusted R-squared:  0.2592 
## F-statistic: 15.11 on 3 and 118 DF,  p-value: 2.195e-08

This model now implies that every percentage point increase in any of the three age groups included will increase the number of Airbnb listings, and also that the 15-29 year old age group has the largest effect, where before the effect of the 30 to 64 year old age group was greater.

This demonstrates how complicated the interrelationships among many attributes in a dataset can be, and how if we control for one factor, it can change the apparent effect of other factors.

It also strongly suggests that the original model including all ages groups is not very reliable! Keep this in mind as you build your own model to complete this assignment.

Finally: build your own model

With the data available, and based on all you have seen so far, make your own model.

You should consider including any of the other variables provided such as total population (pop), the distance to centre (dist_cbd), and (after filtering and counting them) any of the various amenity types or shop types.

You can also experiment with the step() function discussed in lecture here.

When you have made a model you are happy with answer the following question.

Question 4

For the model you made, include the code used to generate it and the output from the `summary` function. Also make a residuals map of your model. Briefly explain why you chose to build the model you did. What influenced your choice of variables to include? Explain what your model seems to show based on the results provided by the `summary` function. (35%)

Assignment summary

Here are all the questions from the assignment in one location for convenience. Click on the headings to see them in their original context.

Question 1

Based on making and exploring some maps, which variables in the `welly` base dataset, and which point locations in the `amenities` and `shops` datasets do you think are most likely to help account for the number of Airbnb listings across the region? There is no strictly correct answer to this question. The idea is to think about what you are doing before just blindly making a statistical model without thinking!. (20%)

Question 2

Write a brief interpretation of this model describing in words what it seems to tell us about the effect of different age group percentages on the numbers of Airbnb listings in neighbourhoods. (20%)

Question 3

Where does the model do particularly badly? Briefly speculate on what other factors missing from this particular model (which uses only age data) might explain this. (25%)

Question 4

For the model you made, include the code used to generate it and the output from the `summary` function. Also make a residuals map of your model. Briefly explain why you chose to build the model you did. What influenced your choice of variables to include? Explain what your model seems to show based on the results provided by the `summary` function. (35%)

Submission instructions

This week’s assignment requirement is set out in the questions above.

You should prepare a document completing the analysis and answering the questions, and submit a short report to the dropbox provided on Nuku by the end of the day indicated on the course schedule page.

You can prepare your report using Rmarkdown and knitting the report (this is explained in the first part of this week’s materials), or you can proceed as usual preparing your report in Word and including figures and tables as you usually would and saving out to a PDF. Either is acceptable, but I would encourage you to experiment with knitting an Rmarkdown file to understand the possibilities!