Data
Before we get into it, download the data for this lab below:
earthquakes.csv
a table of data from the Ōtautahi Christchurch earthquakesnz.gpkg
a basemap for Aotearoa New Zealand
These files should be saved to a folder you are working in. If you
are working on a lab machine, it needs to be a folder on the H:
drive where it won’t be lost after you log out. Wherever you
are working, I recommend you work in an appropriately named folder
(something like say geog315-week02
).
Introduction
This lab will introduce you to the statistical analysis and programming environment R, running in RStudio (which makes R a bit easier to deal with for most people). R has become one of the standard tools for statistical analysis particularly in the academic research community, but increasingly also in commercial and other work settings. R is well suited to these settings for a number of reasons, particularly
- it is free [as in beer];
- it is easily extensible; and
- because of 1 and 2, many new methods of analysis first become available in packages contributed to the R ecosystem by researchers in the field.
The last point is why we are using R in this course. All required software is installed on the lab machines.
Like any good software, versions of R are available for MacOS, Windows and Linux so you can install a copy on your own computer and work on this lab in your own time—you don’t need to be at the timetabled lab sections to complete the assignment, although you will find it helpful to attend to get assistance from the instructors, and also from one another.
To get up and running on your own computer, you will need to download and install R itself, and also, optionally, (but HIGHLY recommended) install RStudio. Details of the installation were provided last week. If you are still having difficulties, just ask, and I’ll do what I can to help. Installation is pretty straightforward on all platforms.
When you are running R you will want a web connection to install any additional packages called for in lab instructions. You will also find it useful to have a reasonably high resolution display (an old 1024×768 display will not be much fun to work on, but high pixel density modern displays, such as 4K, can also be a bit painful, without tweaking the display settings).
DON’T PANIC!
This lab introduces R by just asking you to get on with it, without stopping to explain too much, at least not at first. This is because it’s probably better to just do things with R to get a feel for what it’s about, without thinking too hard about what is going on; kind of like learning to swim by jumping in at the deep end. You may sometimes feel like you are drowning. Try not to worry too much and stick with it, and bear in mind that the assignments do not assume you are some kind of R guru.
Ask questions, confer with your fellow students, consult Google (this cartoon is good advice).
An overview of RStudio
We’re using R inside a slightly friendlier ‘front-end’ called RStudio, so start that program up in whatever platform you are running on. You should see something like the display below (without all the text which is from an open session on my machine).
I have labeled four major areas of the interface, these are
- Console this is where you type commands and interact directly with the program
- Scripts and other files is where you can write scripts which are short programs consisting of a series of commands that can all be run one after another. This is more useful as you become proficient with the environment. Even as an inexperienced user, it is a good place to record the commands that worked in accomplishing whatever tasks you are working on. You can also get tabular views of open datasets in this panel. Note that this window may not appear at initial startup, in which case the console will extend the whole height of the window on the left.
- Environment/History here you can examine the data currently in your session (environment) more closely, or if you switch to the history tab, see all the commands you have issued in the session. The latter is very useful for retrieving versions of commands that worked, and copying them to the console or to the script window.
- Various outputs are displayed in this area. These will be plot outputs, or help information.
Before going any further we need to make a new RStudio project, which is explained below.
Starting a new RStudio project
I encourage you to organise your work for each lab in this course as an RStudio project. To do this, first clear out any ‘junk’ that might be in the session from previous use by selecting Session - Clear Workspace… from the menu options.
Now you have cleaned house, we can create a new project:
- Select File - New Project… from the menu options
- Assuming you already downloaded the data, select the Existing Directory… option
- Navigate to the folder where you put the data by clicking the Browse… button and then
- Click the Create Project button
RStudio should relaunch itself and in the Files tab at the lower right of the interface you should see something like this:
The file week2.Rproj
(you might have given it a
different name) is an RStudio project file which will keep
track of your work over time. When you finish a session you should use
the File - Close Project menu option, and you will be
prompted to save your work.
If you want to take your work to another machine then the whole folder is what you should save to a backup memory device or cloud storage.
When you want to work on an old session, you can just double click
the week1.Rproj
file and RStudio will start up
more or less where you left off (although you might have to reload
packages if you are using them… we’ll talk about this later).
What if I don’t see a .Rproj
file?
First double check that you followed the instructions correctly.
If you did, the explanation is probably that your operating system is
hiding file extensions. A file extension is the part of the filename
after the .
in this case Rproj
.
Computer operating systems use the file extension to determine what programs can open a particular file. It’s useful to see filename extensions to be able to make most effective use of a computer, but modern operating systems often hide this information, I assume to avoid scaring users with how technical they are! For the purposes of this class, computers are technical, so…
To show file extensions follow the instructions below, depending on your operating system:
If you get stuck, ask for help! If it’s not lab time ask in the slack workspace!
Now you have a new project open, we can continue…
Meet the command line…
The key thing to understand about R is that it is mostly a command line interface (CLI) driven tool. That means you issue typed commands to tell R to do things (this was a normal way to interact with a computer and get work done before the early 1990s, which has made a comeback, like vinyl, but way cooler, and a lot less boomer).
There are some menus and dialog boxes in RStudio to help you
manage things, but mostly you interact with R by typing
commands at the >
prompt in the console window. To
begin, we’ll load up a dataset, just so you can see how things work and
to help you get comfortable. As with most computer related stuff, you
should experiment: you will learn a lot more that way, and since this
session is not assessed, you won’t break anything important.
About these instructions
Throughout these instructions commands you should type at the console prompt will appear like this:
dir()
Sometimes, the response you see from R will be shown in the
same format but preceded by #
signs, like this:
## This is a response
When the expected result is a plot, you’ll see a thumbnail of the plot. You can click on the thumbnail to see a larger image.
Opening a data file
Before doing this, make sure you are in the right working directory. This should already be the case if you set up a project as explained above. If not then use the Session - Set Working Directory - Choose Directory… menu option to select the folder where you put the data.
Now we are ready to open the file. This can be done using the File - Import Dataset menu option, but to get into the spirit of working from the command line, we will do it by typing a command instead.
Type exactly the following in the console at the >
prompt:
quakes <- read.csv("earthquakes.csv")
It may appear that nothing has happened. But if you look in the
Environment tab (top-right panel) you should see that a
quakes
dataframe has appeared in your environent. You can
examine it with
View(quakes)
You should see a data table appear in the upper left panel. The data appear very similar to a spreadsheet. In R, data tables are known as dataframes and each column is an attribute or variable. The various variables that appear in the table are
CUSP_ID
a unique identifier for each earthquake or aftershock eventNZMGE
andNZMGN
are New Zealand Map Grid Easting and Northing coordinatesELAPSED_DAYS
is the number of days after September 3, 2010, when the big earthquake was recordedMAG
is the earthquake or aftershock magnitudeDEPTH
is the estimate depth at which the earthquake or aftershock occurredYEAR
,MONTH
,DAY
,HOUR
,MINUTE
,SECOND
provide detailed time information
Look back at the command we used to read the data. You will see that
we assigned the result of reading the specified file to a
variable which we called quakes
. This means we now
have a dataframe called quakes
in memory, which we
can examine more closely.
Exploring the data
R provides many different ways to get a feel for the data
you are working with. One option is to plot
it.
WARNING: this might take some time. If you are
concerned your computer might not be up to the task, then don’t worry
about it and skip to the next step!
plot(quakes)
This might take a while… It’s also a bad case of waaaay too much information. R is trying to plot every possible pair of variables in the dataframe, and there is just not enough room to do it. Instead, we can plot a subset.
We need the dplyr
package to perform tidy selections, so
let’s load that (if you haven’t already done so, install the
dplyr
package first using the Tools - Install
Packages… menu option).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
and use the select()
function
quakes %>%
select(NZMGE, NZMGN, MAG, DEPTH) %>%
plot()
This time we have just picked out some columns in the dataset which
gives the plot
function a better chance. The relatively
simple way in which R can be used to manipulate datasets seen
here is a major strength of the platform.
Looking at individual variables
It’s probably more useful to examine some individual variables in
isolation. We can refer to individual columns in the dataframe by
calling them by dataframe and variable name, such as
quakes$MAG
(note the $ sign). So for example, if I want a
summary of the magnitude of the aftershocks in this dataset I type
summary(quakes$MAG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.530 2.578 2.791 2.921 3.175 7.100
or the mean northing
mean(quakes$NZMGN)
## [1] 5737977
and R will return the values in response. Many other simple
results like these are available, such as min
,
max
, median
and also
quantile
.
Perhaps more informative is a boxplot or histogram, try:
boxplot(quakes$MAG)
or
hist(quakes$MAG)
A handy shortcut
It gets tedious typing quakes
all the time, so you can
attach
the dataframe so that the variable names are
directly accessible without the quakes$
prefix by
typing
attach(quakes)
and then you can access the attributes of the quakes
dataset using their names alone, for example
hist(MAG)
will plot the specified variable.
Be careful using attach
as it can lead to ambiguity
about what you are plotting if you are working with different datasets
that include variables with the same names.
Try the above commands just to get a feel for things.
Making a map
To mentally prepare you for what’s coming, the next few paragraphs walk you through making a map of these data, using some packages that we will look at more closely in the coming weeks. I think it is helpful to do this just to get a feeling for what is going on before we dive into details in the coming weeks, and also to give you a feel for the capabilities of this platform.
First, we need to load some relevant libraries. sf
for
spatial data
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
tmap
for making maps
library(tmap)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
and dplyr
for data wrangling.
library(dplyr)
When you load tmap
you might get a warning about the
imminent retirement of some dependencies. This should not affect us in
this class.
We us the sf
(simple features) package to read data in
spatial formats like geopackages, or shapefiles, with the
st_read
function:
nz <- st_read('nz.gpkg')
## Reading layer `nz' from data source
## `/Users/osullid3/Documents/teaching/Geog315/_labs/week-02/nz.gpkg'
## using driver `GPKG'
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 2000301 ymin: 5310421 xmax: 2998959 ymax: 6756285
## Projected CRS: NZGD49_New_Zealand_Map_Grid
To make a map with this, we use the tmap
package. We’ll
learn more about this package in the next couple of weeks. Basically it
lets you make a map by progressively adding layers of data. To start a
map you tell it the dataset to use
map <- tm_shape(nz)
and then add information to tell R how to display the map.
In this case, we are mapping polygons, so the tm_polygons
function provides the needed information (to find out more about the
available options, type ?tm_polygons
at the command
prompt.
map + tm_polygons(col = 'green', border.col = 'black')
If we want to add a few more cartographic frills like a compass rose and scale bar, we can do that too:
map +
tm_polygons(col = 'darkseagreen3', border.col = 'skyblue', lwd = 0.5) +
tm_layout(main.title = 'Aotearoa New Zealand',
main.title.position = 'center',
main.title.size = 1,
bg.color = 'powderblue') +
tm_compass() +
tm_scale_bar()
The options I’ve used above include:
col
is the fill colour for the polygonsborder.col
is… yes… the border colourlwd
is the line width (or thickness)- the function
tm_layout
provides some layout options like the main title and a background colour (bg.color
) - the functions
tm_compass
andtm_scalebar
do exactly what it sounds like they might do (note that I don’t think you really need these on a thematic map like this one, but the options exist)
For a list of named colours in R see this document. Try experimenting by changing a few things in the above map.
Consult the help on tm_layout
using
?tm_layout
to see what options are available.
Adding another layer
The quakes
dataset is not in a spatial format, although
it include spatial information (the easting and northing coordinates).
The sf
package provides the required functions to convert
the dataframe to a simple features dataset, which is a
spatial data format. The following command will do the necessary
conversion (you need to be careful to type it exactly as shown, or copy
and paste if you really need to).
quakes_sf <- quakes %>%
st_as_sf(coords = c('NZMGE', 'NZMGN'), crs = st_crs(nz))
What’s happening here? st_as_sf
is the function that
does the conversion. The parameters in parentheses tell the
function what to work on. First is the input dataframe
quakes
which is piped into the function with the
%>%
or pipe operator. Next the
coords
parameter tells the function which variables in the
dataframe are the x and y coordinates in the
dataframe. the c()
structure concatenates the two variable
names into a single vector which is required by
st_as_sf
. Finally, we also specify the coordinate
reference system or map projection of the data. These data are in
New Zealand Map Grid, which I made sure the nz
data layer
is also in. We use st_crs(nz)
to retrieve this information
from the nz
dataset and apply it to the new spatial
quakes_sf
dataset we are making.
Now we have two datasets we can make a layered map including both of them.
tm_shape(nz) +
tm_polygons(col = 'darkseagreen3') +
tm_shape(quakes_sf) +
tm_dots()
That’s OK, although not very useful, we really need to zoom in on the extent or bounding box of the earthquake data:
tm_shape(nz, bbox = st_bbox(quakes_sf)) +
tm_polygons(col = 'darkseagreen3') +
tm_shape(quakes_sf) +
tm_dots() +
tm_scale_bar()
An alternative to tm_dots
is tm_bubbles
which allows us to scale the symbols by some variable:
tm_shape(nz, bbox = st_bbox(quakes_sf)) +
tm_polygons(col = 'white', lwd = 0) +
tm_layout(bg.color = 'powderblue') +
tm_shape(quakes_sf) +
tm_bubbles(size = 'MAG', perceptual = TRUE, alpha = 0.5) +
tm_scale_bar()
This isn’t a great map. It might be easier to see if we only showed the larger aftershocks.
bigquakes <- quakes_sf %>%
filter(MAG >= 4)
Try again, this time also making the bubbles transparent:
tm_shape(nz, bbox = st_bbox(quakes_sf)) +
tm_polygons(col = 'white', lwd=0) +
tm_layout(bg.color = 'powderblue') +
tm_shape(bigquakes) +
tm_bubbles(size = 'MAG', perceptual = T, alpha=0) +
tm_scale_bar()
Alternatively, we might use colour to show the different magnitudes:
tm_shape(nz, bbox = st_bbox(quakes_sf)) +
tm_polygons(col = 'white', lwd = 0) +
tm_layout(bg.color = 'powderblue') +
tm_shape(bigquakes) +
tm_bubbles(col = 'MAG', palette = 'Reds', alpha = 0.5) +
tm_scale_bar()
That’s probably enough experimenting to give you the general idea.
A web basemap
One other thing we can do with the tmap
package is make
a web map instead. We no longer need the nz
layer, we just
have to switch modes
tmap_mode('view')
## tmap mode set to interactive viewing
[To switch back use tmap_mode('plot')
]
Then make a map as before, but this time there is no need for the
nz
layer
tm_shape(quakes_sf) +
tm_dots(col = 'MAG', palette = 'Reds')
Making an interactive web map with just a couple of lines of code is pretty nice!
Review
The aim of this week has been just to get a feel for things. Don’t panic if you don’t completely understand what is happening. The important thing is to realize
- You make things happen by typing commands in the console.
- Commands either cause things to happen (like plots) or they create new variables (data with a name attached), which we can further manipulate using other commands.
- Variables and the data they contain remain in memory (you can see them in the Environment tab) and can be manipulated as required.
- RStudio remembers everything you have typed (check the History tab if you don’t believe this!)
- All the plots you make are also remembered (mess around with the back and forward arrows in the plot display).
Working more efficiently
The session history
The History tab is particularly useful. If you want
to run a command again, find it in the list, select it and then select
the To Console option (at the top). The command will
appear in the console at the current prompt, where you can edit it to
make any desired changes and hit <RETURN>
to run it
again.
You can also get the history functionality using the up arrow key in the console, which will bring previous commands back to the console line for you to reuse. But this gets kind of annoying once you have run many commands.
Scripts
Another way to rerun things you have done earlier is to save them to a script. Open a new script with File - New File - R Script. You can type commands here in the usual way, but they won’t run immediately like they do when you type them in the console. Instead you can run selected lines using the Run button at the top of the scripts area of the user interface (the top left area) for selected lines.
When you have worked a whole workflow out, you can record it in a
script file to run the whole thing later. The easiest way to do this is
to go to the history, select the commands you want, and then select
To Source to drop them into an open script file. This
will add the commands to the current file in the upper left panel, and
then you can save them to a .R
script file to run all at
once.
For example, in the history, find the command used to open the data
file, then the one used to attach the data, then one that makes a
complicated plot. Add each one in turn to the source file (in the proper
order). Then from the scripts area, select File – Save
As… and save the file to some name (say
test.R
).
What you have done is to write a short program!
To run it select all the code, and hit the Run button.
Here’s one I prepared earlier…
To see all this in a simple example, try downloading, opening and
running the file example-script.R
.
As we work through this class, you’ll learn more about these ways of automating work in R.
Additional resources
R is really a programming language as much as it is a piece of software, there is a lot more to learn about it than is covered here, or will be covered in this course. If you want to know more about R as a general statistics environment there is a good online guide here which provides a more detailed introduction.
For the purposes of this course, the commands you really need to get a handle on are explored in the corresponding weekly labs.