Introducing R and RStudio

Geog 315 T2 2023

Data

Before we get into it, download the data for this lab below:

  • earthquakes.csv a table of data from the Ōtautahi Christchurch earthquakes
  • nz.gpkg a basemap for Aotearoa New Zealand

These files should be saved to a folder you are working in. If you are working on a lab machine, it needs to be a folder on the H: drive where it won’t be lost after you log out. Wherever you are working, I recommend you work in an appropriately named folder (something like say geog315-week02).

Introduction

This lab will introduce you to the statistical analysis and programming environment R, running in RStudio (which makes R a bit easier to deal with for most people). R has become one of the standard tools for statistical analysis particularly in the academic research community, but increasingly also in commercial and other work settings. R is well suited to these settings for a number of reasons, particularly

  1. it is free [as in beer];
  2. it is easily extensible; and
  3. because of 1 and 2, many new methods of analysis first become available in packages contributed to the R ecosystem by researchers in the field.

The last point is why we are using R in this course. All required software is installed on the lab machines.

Like any good software, versions of R are available for MacOS, Windows and Linux so you can install a copy on your own computer and work on this lab in your own time—you don’t need to be at the timetabled lab sections to complete the assignment, although you will find it helpful to attend to get assistance from the instructors, and also from one another.

To get up and running on your own computer, you will need to download and install R itself, and also, optionally, (but HIGHLY recommended) install RStudio. Details of the installation were provided last week. If you are still having difficulties, just ask, and I’ll do what I can to help. Installation is pretty straightforward on all platforms.

When you are running R you will want a web connection to install any additional packages called for in lab instructions. You will also find it useful to have a reasonably high resolution display (an old 1024×768 display will not be much fun to work on, but high pixel density modern displays, such as 4K, can also be a bit painful, without tweaking the display settings).

DON’T PANIC!

This lab introduces R by just asking you to get on with it, without stopping to explain too much, at least not at first. This is because it’s probably better to just do things with R to get a feel for what it’s about, without thinking too hard about what is going on; kind of like learning to swim by jumping in at the deep end. You may sometimes feel like you are drowning. Try not to worry too much and stick with it, and bear in mind that the assignments do not assume you are some kind of R guru.

Ask questions, confer with your fellow students, consult Google (this cartoon is good advice).

An overview of RStudio

We’re using R inside a slightly friendlier ‘front-end’ called RStudio, so start that program up in whatever platform you are running on. You should see something like the display below (without all the text which is from an open session on my machine).

I have labeled four major areas of the interface, these are

  • Console this is where you type commands and interact directly with the program
  • Scripts and other files is where you can write scripts which are short programs consisting of a series of commands that can all be run one after another. This is more useful as you become proficient with the environment. Even as an inexperienced user, it is a good place to record the commands that worked in accomplishing whatever tasks you are working on. You can also get tabular views of open datasets in this panel. Note that this window may not appear at initial startup, in which case the console will extend the whole height of the window on the left.
  • Environment/History here you can examine the data currently in your session (environment) more closely, or if you switch to the history tab, see all the commands you have issued in the session. The latter is very useful for retrieving versions of commands that worked, and copying them to the console or to the script window.
  • Various outputs are displayed in this area. These will be plot outputs, or help information.

Before going any further we need to make a new RStudio project, which is explained below.

Starting a new RStudio project

I encourage you to organise your work for each lab in this course as an RStudio project. To do this, first clear out any ‘junk’ that might be in the session from previous use by selecting Session - Clear Workspace… from the menu options.

Now you have cleaned house, we can create a new project:

  • Select File - New Project… from the menu options
  • Assuming you already downloaded the data, select the Existing Directory… option
  • Navigate to the folder where you put the data by clicking the Browse… button and then
  • Click the Create Project button

RStudio should relaunch itself and in the Files tab at the lower right of the interface you should see something like this:

The file week2.Rproj (you might have given it a different name) is an RStudio project file which will keep track of your work over time. When you finish a session you should use the File - Close Project menu option, and you will be prompted to save your work.

If you want to take your work to another machine then the whole folder is what you should save to a backup memory device or cloud storage.

When you want to work on an old session, you can just double click the week1.Rproj file and RStudio will start up more or less where you left off (although you might have to reload packages if you are using them… we’ll talk about this later).

What if I don’t see a .Rproj file?

First double check that you followed the instructions correctly.

If you did, the explanation is probably that your operating system is hiding file extensions. A file extension is the part of the filename after the . in this case Rproj.

Computer operating systems use the file extension to determine what programs can open a particular file. It’s useful to see filename extensions to be able to make most effective use of a computer, but modern operating systems often hide this information, I assume to avoid scaring users with how technical they are! For the purposes of this class, computers are technical, so…

To show file extensions follow the instructions below, depending on your operating system:

If you get stuck, ask for help! If it’s not lab time ask in the slack workspace!

Now you have a new project open, we can continue…

Meet the command line…

The key thing to understand about R is that it is mostly a command line interface (CLI) driven tool. That means you issue typed commands to tell R to do things (this was a normal way to interact with a computer and get work done before the early 1990s, which has made a comeback, like vinyl, but way cooler, and a lot less boomer).

There are some menus and dialog boxes in RStudio to help you manage things, but mostly you interact with R by typing commands at the > prompt in the console window. To begin, we’ll load up a dataset, just so you can see how things work and to help you get comfortable. As with most computer related stuff, you should experiment: you will learn a lot more that way, and since this session is not assessed, you won’t break anything important.

About these instructions

Throughout these instructions commands you should type at the console prompt will appear like this:

dir()

Sometimes, the response you see from R will be shown in the same format but preceded by # signs, like this:

## This is a response

When the expected result is a plot, you’ll see a thumbnail of the plot. You can click on the thumbnail to see a larger image.

Opening a data file

Before doing this, make sure you are in the right working directory. This should already be the case if you set up a project as explained above. If not then use the Session - Set Working Directory - Choose Directory… menu option to select the folder where you put the data.

Now we are ready to open the file. This can be done using the File - Import Dataset menu option, but to get into the spirit of working from the command line, we will do it by typing a command instead.

Type exactly the following in the console at the > prompt:

quakes <- read.csv("earthquakes.csv")

It may appear that nothing has happened. But if you look in the Environment tab (top-right panel) you should see that a quakes dataframe has appeared in your environent. You can examine it with

View(quakes)

You should see a data table appear in the upper left panel. The data appear very similar to a spreadsheet. In R, data tables are known as dataframes and each column is an attribute or variable. The various variables that appear in the table are

  • CUSP_ID a unique identifier for each earthquake or aftershock event
  • NZMGE and NZMGN are New Zealand Map Grid Easting and Northing coordinates
  • ELAPSED_DAYS is the number of days after September 3, 2010, when the big earthquake was recorded
  • MAG is the earthquake or aftershock magnitude
  • DEPTH is the estimate depth at which the earthquake or aftershock occurred
  • YEAR, MONTH, DAY, HOUR, MINUTE, SECOND provide detailed time information

Look back at the command we used to read the data. You will see that we assigned the result of reading the specified file to a variable which we called quakes. This means we now have a dataframe called quakes in memory, which we can examine more closely.

Exploring the data

R provides many different ways to get a feel for the data you are working with. One option is to plot it. WARNING: this might take some time. If you are concerned your computer might not be up to the task, then don’t worry about it and skip to the next step!

plot(quakes)

This might take a while… It’s also a bad case of waaaay too much information. R is trying to plot every possible pair of variables in the dataframe, and there is just not enough room to do it. Instead, we can plot a subset.

We need the dplyr package to perform tidy selections, so let’s load that (if you haven’t already done so, install the dplyr package first using the Tools - Install Packages… menu option).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

and use the select() function

quakes %>%
  select(NZMGE, NZMGN, MAG, DEPTH) %>%
  plot()

This time we have just picked out some columns in the dataset which gives the plot function a better chance. The relatively simple way in which R can be used to manipulate datasets seen here is a major strength of the platform.

Looking at individual variables

It’s probably more useful to examine some individual variables in isolation. We can refer to individual columns in the dataframe by calling them by dataframe and variable name, such as quakes$MAG (note the $ sign). So for example, if I want a summary of the magnitude of the aftershocks in this dataset I type

summary(quakes$MAG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.530   2.578   2.791   2.921   3.175   7.100

or the mean northing

mean(quakes$NZMGN)
## [1] 5737977

and R will return the values in response. Many other simple results like these are available, such as min, max, median and also quantile.

Perhaps more informative is a boxplot or histogram, try:

boxplot(quakes$MAG)

or

hist(quakes$MAG)

A handy shortcut

It gets tedious typing quakes all the time, so you can attach the dataframe so that the variable names are directly accessible without the quakes$ prefix by typing

attach(quakes)

and then you can access the attributes of the quakes dataset using their names alone, for example

hist(MAG)

will plot the specified variable.

Be careful using attach as it can lead to ambiguity about what you are plotting if you are working with different datasets that include variables with the same names.

Try the above commands just to get a feel for things.

Making a map

To mentally prepare you for what’s coming, the next few paragraphs walk you through making a map of these data, using some packages that we will look at more closely in the coming weeks. I think it is helpful to do this just to get a feeling for what is going on before we dive into details in the coming weeks, and also to give you a feel for the capabilities of this platform.

First, we need to load some relevant libraries. sf for spatial data

library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

tmap for making maps

library(tmap)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)

and dplyr for data wrangling.

library(dplyr)

When you load tmap you might get a warning about the imminent retirement of some dependencies. This should not affect us in this class.

We us the sf (simple features) package to read data in spatial formats like geopackages, or shapefiles, with the st_read function:

nz <- st_read('nz.gpkg')
## Reading layer `nz' from data source 
##   `/Users/osullid3/Documents/teaching/Geog315/_labs/week-02/nz.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 2000301 ymin: 5310421 xmax: 2998959 ymax: 6756285
## Projected CRS: NZGD49_New_Zealand_Map_Grid

To make a map with this, we use the tmap package. We’ll learn more about this package in the next couple of weeks. Basically it lets you make a map by progressively adding layers of data. To start a map you tell it the dataset to use

map <- tm_shape(nz)

and then add information to tell R how to display the map. In this case, we are mapping polygons, so the tm_polygons function provides the needed information (to find out more about the available options, type ?tm_polygons at the command prompt.

map + tm_polygons(col = 'green', border.col = 'black')

If we want to add a few more cartographic frills like a compass rose and scale bar, we can do that too:

map +
  tm_polygons(col = 'darkseagreen3', border.col = 'skyblue', lwd = 0.5) +
  tm_layout(main.title = 'Aotearoa New Zealand',
            main.title.position = 'center',
            main.title.size = 1,
            bg.color = 'powderblue') +
  tm_compass() +
  tm_scale_bar()

The options I’ve used above include:

  • col is the fill colour for the polygons
  • border.col is… yes… the border colour
  • lwd is the line width (or thickness)
  • the function tm_layout provides some layout options like the main title and a background colour (bg.color)
  • the functions tm_compass and tm_scalebar do exactly what it sounds like they might do (note that I don’t think you really need these on a thematic map like this one, but the options exist)

For a list of named colours in R see this document. Try experimenting by changing a few things in the above map.

Consult the help on tm_layout using ?tm_layout to see what options are available.

Adding another layer

The quakes dataset is not in a spatial format, although it include spatial information (the easting and northing coordinates). The sf package provides the required functions to convert the dataframe to a simple features dataset, which is a spatial data format. The following command will do the necessary conversion (you need to be careful to type it exactly as shown, or copy and paste if you really need to).

quakes_sf <- quakes %>%
  st_as_sf(coords = c('NZMGE', 'NZMGN'), crs = st_crs(nz))

What’s happening here? st_as_sf is the function that does the conversion. The parameters in parentheses tell the function what to work on. First is the input dataframe quakes which is piped into the function with the %>% or pipe operator. Next the coords parameter tells the function which variables in the dataframe are the x and y coordinates in the dataframe. the c() structure concatenates the two variable names into a single vector which is required by st_as_sf. Finally, we also specify the coordinate reference system or map projection of the data. These data are in New Zealand Map Grid, which I made sure the nz data layer is also in. We use st_crs(nz) to retrieve this information from the nz dataset and apply it to the new spatial quakes_sf dataset we are making.

Now we have two datasets we can make a layered map including both of them.

tm_shape(nz) +
  tm_polygons(col = 'darkseagreen3') +
  tm_shape(quakes_sf) +
  tm_dots()

That’s OK, although not very useful, we really need to zoom in on the extent or bounding box of the earthquake data:

tm_shape(nz, bbox = st_bbox(quakes_sf)) +
  tm_polygons(col = 'darkseagreen3') +
  tm_shape(quakes_sf) +
  tm_dots() +
  tm_scale_bar()

An alternative to tm_dots is tm_bubbles which allows us to scale the symbols by some variable:

tm_shape(nz, bbox = st_bbox(quakes_sf)) +
  tm_polygons(col = 'white', lwd = 0) +
  tm_layout(bg.color = 'powderblue') +
  tm_shape(quakes_sf) +
  tm_bubbles(size = 'MAG', perceptual = TRUE, alpha = 0.5) +
  tm_scale_bar()

This isn’t a great map. It might be easier to see if we only showed the larger aftershocks.

bigquakes <- quakes_sf %>%
  filter(MAG >= 4)

Try again, this time also making the bubbles transparent:

tm_shape(nz, bbox = st_bbox(quakes_sf)) +
  tm_polygons(col = 'white', lwd=0) +
  tm_layout(bg.color = 'powderblue') +
  tm_shape(bigquakes) +
  tm_bubbles(size = 'MAG', perceptual = T, alpha=0) +
  tm_scale_bar()

Alternatively, we might use colour to show the different magnitudes:

tm_shape(nz, bbox = st_bbox(quakes_sf)) +
  tm_polygons(col = 'white', lwd = 0) +
  tm_layout(bg.color = 'powderblue') +
  tm_shape(bigquakes) +
  tm_bubbles(col = 'MAG', palette = 'Reds', alpha = 0.5) +
  tm_scale_bar()

That’s probably enough experimenting to give you the general idea.

A web basemap

One other thing we can do with the tmap package is make a web map instead. We no longer need the nz layer, we just have to switch modes

tmap_mode('view')
## tmap mode set to interactive viewing

[To switch back use tmap_mode('plot')]

Then make a map as before, but this time there is no need for the nz layer

tm_shape(quakes_sf) +
  tm_dots(col = 'MAG', palette = 'Reds')

Making an interactive web map with just a couple of lines of code is pretty nice!

Review

The aim of this week has been just to get a feel for things. Don’t panic if you don’t completely understand what is happening. The important thing is to realize

  • You make things happen by typing commands in the console.
  • Commands either cause things to happen (like plots) or they create new variables (data with a name attached), which we can further manipulate using other commands.
  • Variables and the data they contain remain in memory (you can see them in the Environment tab) and can be manipulated as required.
  • RStudio remembers everything you have typed (check the History tab if you don’t believe this!)
  • All the plots you make are also remembered (mess around with the back and forward arrows in the plot display).

Working more efficiently

The session history

The History tab is particularly useful. If you want to run a command again, find it in the list, select it and then select the To Console option (at the top). The command will appear in the console at the current prompt, where you can edit it to make any desired changes and hit <RETURN> to run it again.

You can also get the history functionality using the up arrow key in the console, which will bring previous commands back to the console line for you to reuse. But this gets kind of annoying once you have run many commands.

Scripts

Another way to rerun things you have done earlier is to save them to a script. Open a new script with File - New File - R Script. You can type commands here in the usual way, but they won’t run immediately like they do when you type them in the console. Instead you can run selected lines using the Run button at the top of the scripts area of the user interface (the top left area) for selected lines.

When you have worked a whole workflow out, you can record it in a script file to run the whole thing later. The easiest way to do this is to go to the history, select the commands you want, and then select To Source to drop them into an open script file. This will add the commands to the current file in the upper left panel, and then you can save them to a .R script file to run all at once.

For example, in the history, find the command used to open the data file, then the one used to attach the data, then one that makes a complicated plot. Add each one in turn to the source file (in the proper order). Then from the scripts area, select File – Save As… and save the file to some name (say test.R).

What you have done is to write a short program!

To run it select all the code, and hit the Run button.

Here’s one I prepared earlier…

To see all this in a simple example, try downloading, opening and running the file example-script.R.

As we work through this class, you’ll learn more about these ways of automating work in R.

Additional resources

R is really a programming language as much as it is a piece of software, there is a lot more to learn about it than is covered here, or will be covered in this course. If you want to know more about R as a general statistics environment there is a good online guide here which provides a more detailed introduction.

For the purposes of this course, the commands you really need to get a handle on are explored in the corresponding weekly labs.