library(dplyr)
library(tidyr)
library(janitor)
library(stringr)
library(here)
library(lubridate)
library(sf)
library(terra)
library(tmap)
library(ggplot2)
library(zoo) # useful for serial data
library(R.utils)
source(str_glue("{here()}/scripts/gps-data-utils.R"))
Exploring GPS data
Checking over GPS data for accuracy of speed, etc.
Date | Changes |
---|---|
2024-09-26 | Added an executive summary. |
2024-07-19 | Updated to add turn angle calculation and source gps cleaning function from gps-data-util.R. |
2024-07-16 | Initial post. |
TL;DR Executive summary
The key findings are:
- GPS reported speed and height data are reasonable choices to use in the estimation of hiking functions.
- However, some cleanup is required: speeds of up to 200km/h (helicopter flights), very long elapsed distances, and some negative elevations appear in the raw data and should be removed.
- Not all the issues with the GPS data are uncovered in this notebook (it is an initial exploration), so some additional filtering is applied in subsequent notebooks as deemed necessary. In particular, see this notebook
- The bulk of the GPS data cleaning operations are carried out on the basis of the list compiled at the end of this notebook, detailed in this section and implemented in this script
These notes explore the usefulness, accuracy etc. of the provided GPS data.
Of particular interest is the asymmetry or not of movement speeds with respect to slope, given the salience of this for ‘hiking functions’. It is possible in a largely ‘unpathed’ environment that the asymmetry is limited (see e.g. Rees 20041)
We use just one of the datasets to explore general characteristics. Remove a meaningless column x
and do a little bit of cleanup on the rcr
and date and time columns.
<- get_gps_data_as_sf(str_glue(
gps_data "{here()}/_data/GPS-1516Season-MiersValley/Fraser-FINAL.csv")) |>
select(-x) |>
mutate(rcr = as.logical(rcr),
date_time = ymd_hms(paste(date, time)))
Exploration of the data
A summary of the data suggests that there are some data rows that should be ignored.
summary(gps_data)
index rcr date time
Min. : 1 Mode:logical Length:7323 Length:7323
1st Qu.:1832 TRUE:7304 Class :character Class :character
Median :3662 NA's:19 Mode :character Mode :character
Mean :3662
3rd Qu.:5492
Max. :7323
valid latitude n_s longitude
Length:7323 Min. :-78.12 Length:7323 Min. :163.7
Class :character 1st Qu.:-78.11 Class :character 1st Qu.:163.8
Mode :character Median :-78.10 Mode :character Median :163.8
Mean :-78.10 Mean :163.9
3rd Qu.:-78.10 3rd Qu.:163.9
Max. :-78.09 Max. :164.2
e_w height_m speed_km_h pdop
Length:7323 Min. :-23.1 Min. : 0.000 Min. : 0.98
Class :character 1st Qu.:102.1 1st Qu.: 0.255 1st Qu.: 1.14
Mode :character Median :110.8 Median : 0.496 Median : 1.19
Mean :180.1 Mean : 1.604 Mean : 1.27
3rd Qu.:165.7 3rd Qu.: 2.241 3rd Qu.: 1.24
Max. :636.9 Max. :209.555 Max. :13.76
hdop vdop nsat_used_view distance_m
Min. : 0.6100 Min. :0.7600 Length:7323 Min. : 0.00
1st Qu.: 0.7400 1st Qu.:0.8600 Class :character 1st Qu.: 1.38
Median : 0.7900 Median :0.8900 Mode :character Median : 2.78
Mean : 0.7977 Mean :0.9756 Mean : 21.73
3rd Qu.: 0.8400 3rd Qu.:0.9200 3rd Qu.: 14.99
Max. :13.7200 Max. :4.7100 Max. :74486.45
geometry date_time
POINT :7323 Min. :2016-01-13 16:28:31.07
epsg:3031 : 0 1st Qu.:2016-01-15 16:43:27.00
+proj=ster...: 0 Median :2016-01-19 12:46:32.00
Mean :2016-01-19 01:37:06.72
3rd Qu.:2016-01-22 08:39:45.00
Max. :2016-01-23 15:16:47.00
A speed of > 200km/h was clearly not achieved on foot! Similarly an elapsed distance between 1/30Hz fixes of 74km was not on foot either.
The GPS dilution of precision measures for horizontal and vertical position (hdop and vdop) are mostly within the ‘good’ range (<1) with some outliers. The 3D positions (pdop) are not as good, but are also generally considered reliable (<5). Inspection of a quick map of the data suggests helicopter movement or similar for some ‘singleton’ isolated observations:
ggplot(gps_data) +
geom_sf(size = 0.25) +
theme_minimal()
This being the case it seems reasonable to remove observations with clearly unreasonable speed estimates, assuming that these result from use of other forms of transport. However, before doing so, we augment the data with additional estimates of speed, distance, turn angle, etc. calculated from consecutive fixes (it is not advisable to remove any fixes before making these estimates).
Sanity checking the GPS estimates of speed, distance, and height
Before applying that potentially simplistic cleaning of the data, it is also reasonable to sanity check the speed, distance, and height estimates in the GPS data. First we obtain height data from a 10m DEM.
<- gps_data |>
xy st_coordinates() |>
as_tibble()
<- rast(str_glue("{here()}/_data/input/dry-valleys-10m.tif"))
dem # and useful for later will be a hillshade
<- dem |> terra::terrain("slope", unit = "radians")
slope <- dem |> terra::terrain("aspect", unit = "radians")
aspect <- shade(slope, aspect)
hillshade <- dem |> terra::terrain("TRI")
TRI
<- dem |>
heights_from_dem ::extract(xy, method = "simple") |>
terraselect(2)
<- slope |>
slope_from_dem ::extract(xy, method = "simple") |>
terraselect(2)
<- TRI |>
TRI_from_dem ::extract(xy, method = "simple") |>
terraselect(2)
<- xy |>
xy bind_cols(heights_from_dem) |>
rename(Z = `dry-valleys-10m`) |>
bind_cols(slope_from_dem) |>
mutate(dem_slope = tan(slope)) |>
select(-slope) |>
bind_cols(TRI_from_dem)
<- gps_data |>
gps_data_plus bind_cols(xy)
Now calculate estimates of distance, change in elevation, and slope between observations, from these numbers and from the GPS data.
<- gps_data_plus |>
gps_data_plus mutate(dx1 = as.numeric(diff(as.zoo(X), na.pad = TRUE)),
dy1 = as.numeric(diff(as.zoo(Y), na.pad = TRUE)),
dx2 = as.numeric(diff(as.zoo(X), lag = -1, na.pad = TRUE)),
dy2 = as.numeric(diff(as.zoo(Y), lag = -1, na.pad = TRUE)),
turn_angle = get_turn_angle(dx1, dy1, dx2, dy2),
dz = as.numeric(diff(as.zoo(Z), na.pad = TRUE)),
dh = as.numeric(diff(as.zoo(height_m), na.pad = TRUE)),
dt = as.numeric(diff(as.zoo(date_time), na.pad = TRUE)),
dxy = sqrt(dx1^ 2 + dy1 ^ 2),
dxyz = sqrt(dx1 ^ 2 + dy1 ^ 2 + dz ^ 2),
dxyh = sqrt(distance_m ^ 2 + dh ^ 2),
slope_h = dh / distance_m,
slope_z = dz / dxy,
speedxy = dxy / dt * 3600 / 1000, speedxyz = dxyz / dt * 3600 / 1000)
Now some comparisons.
Distance estimates
The distance_m
estimates in the provided data are a little higher than the manually calculated point to point straight line distances between fixes. However, examination of sequences of data show the two time series are well aligned (below). Differences may be due to filtering applied by the GPS unit, although this is not well documented.
In any case, the GPS distance_m
variable seems safe for use in further analysis.
|>
gps_data_plus filter(dt == 30) |>
slice(2:151) |>
pivot_longer(cols = c(distance_m, dxy, dxyz)) |>
ggplot() +
geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
xlab("Date/time") + ylab("Estimated distance") +
theme_minimal()
Speed estimates
Differences in the speed estimates are more pronounced:
|>
gps_data_plus filter(dt == 30) |>
slice(2:151) |>
pivot_longer(cols = c(speed_km_h, speedxy, speedxyz)) |>
ggplot() +
geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
xlab("Date/time") + ylab("Estimated speed") +
theme_minimal()
The provided speed estimates are generally higher although there is some alignment between the time series, insofar as extended periods of low speeds tend to line up.
A likely reason for the differences is that in a 30 second period (the usual interval between fixes) an individual may move in many directions so that the direct point to point distance between fixes is less than might be expected based on the recorded speed. So, for example, in the above chart between 16:30 and 16:40 the person was moving not in straight lines, where in the following 10 minutes they were moving in relatively straight lines. We can test this theory, at least qualitatively, by calculating a sinuosity estimate between fixes.
<- gps_data_plus |>
sinuosity filter(dt == 30) |>
slice(2:61) |>
# select(date_time, speed_km_h, dxy, speedxy) |>
mutate(sinuosity = speed_km_h / 120 / distance_m,
date_time_start = lag(date_time))
<- sinuosity |>
speeds pivot_longer(cols = c(speed_km_h, speedxy))
ggplot(speeds) +
geom_step(aes(x = date_time, y = value, group = name, linetype = name),
direction = "vh", lwd = 0.5) +
geom_segment(data = sinuosity,
aes(x = date_time_start, xend = date_time,
y = speed_km_h, colour = rank(sinuosity)), lwd = 2) +
scale_colour_distiller(palette = "Reds", direction = 1) +
xlab("Date/time") + ylab("Estimated speed") +
theme_minimal()
Since the estimated sinuosity here is based only on the GPS estimated speed and point to point distances, that it is generally highest when the difference between the GPS reported speed speed_km_h
and the manually estimated speed speedxy
is largest supports the assumption that the GPS is reporting speed using satellite carrier signal doppler effects.
Based on this check, the GPS speed_km_h
variable seems reasonable to use for hiking function estimation.
GPS height and elevation from the DEM
Again, there are differences between heights recorded by the GPS units, and heights extracted from the 10m DEM. Again, it is difficult to know why these might occur, although there is a clear correlation between the two. There is no obvious correlation with surface roughness (TRI), for example, or with the vertical dilution of precision (vdop
).
ggplot(gps_data_plus |>
filter(dt == 30, distance_m < 60, height_m > 0, Z > 0)) +
geom_point(aes(x = TRI, y = height_m / Z, colour = vdop), alpha = 0.25) +
scale_colour_viridis_c(option = "A", direction = -1) +
theme_minimal()
Pragmatically, in the next section, I find that the relationship between slope and speed based on the GPS readings is much better behaved than that derived from the speed and elevation extracted from the DEM. The reason it can be used to estimate slope in that context is divergences between height_m
(from the GPS) and Z
(from the DEM) are serially correlated, so that estimates of slope derived from consecutive data points are useable for that purpose.
Some reassurance that the GPS height estimates are suitably aligned is obtained by plotting each days set of observations of the two as time series. On days where much of the activity occurred at the same elevation or across a very limited elevation range there are obvious deviations, but these are relatively small in magnitude (a few metres); on days when more elevation chance is observed, the overall profile of the day’s activity matches well.
|>
gps_data_plus filter(dt == 30) |>
pivot_longer(cols = c(height_m, Z)) |>
ggplot() +
geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
xlab("Date/time") + ylab("Estimated elevation") +
facet_wrap(~ date, scales = "free", ncol = 2) +
theme_minimal()
Again, it seems reasonable to use the GPS unit reported heights for estimation of hiking functions. The main impact on the analysis in the next section is to introduce noise into the slope at low values.
A hiking function
A hiking function relates slope of the land to speed at which it is traversed. Using the GPS unit reported speed and slope derived from GPS unit reported distance and height, locally estimated scatter plot smoothing yields an approximately Gaussian curve with the peak speed at or close to 0 slope. This gives some confidence that we can use these data to estimate hiking functions. The approach implied in this plot is insufficient, as the data contain many fixes that don’t give much information about ‘hiking’ as such. The preponderance of lower turn angle fixes in the more ‘mobile’ parts of the data (i.e. higher speeds) suggests that it may be possible to filter out those data points.
Better estimation of a hiking function is left to a later notebook (see Estimating a hiking function).
|>
gps_data_plus # here's a possible filter to apply to the data to improve the estimation
filter(dt == 30, speed_km_h < 10, !is.na(slope_h), distance_m > 2.5, turn_angle < 150) |>
ggplot(aes(x = slope_h, y = speed_km_h, colour = turn_angle)) +
geom_point(size = 0.25) +
geom_smooth(aes(x = slope_h, y = speed_km_h))
<- loess(speed_km_h ~ slope_h, data = gps_data_plus) model
Many data in the above plot are probably from ‘milling about’ in breaks, at camp, or around experiments, rather than while on the move. The best way to remove these sections is probably based on turn angles.
Conclusion: a reasonable GPS file cleaning function
Based on this exploration, GPS data need to be processed as follows (note that no initial conversion to a spatial format is required at this stage, since none of the added spatial data introduced above is required).
- Read raw data from CSV.
- Clean names with
janitor::clean_names
. - Remove meaningless variable
x
. - Add a
name
reflecting the id of the person whose data is. - Augment data with
dt
(time interval between fixes),dh
(height change between fixes),slope_h
(estimated slope between fixes), andturn_angle
(change in direction at this fix). - Remove data outside expected latitudinal range (> -78 latitude).
- Remove fixes with high estimated speeds (>~ 10km/h).
- Remove fixes with long estimated distances (>~ 60m).
- Identify breaks in the series where
dt != 30
and tag subsets of the data accordingly. - Count number of fixes in each subset – consider removing small ones…
- Drop NAs and any rows where
dt != 30
.
This is the basis for a function get_cleaned_gps_data()
in the script gps-data-utils.R
used for cleaning up / augmenting the raw CSV files.
Footnotes
Rees WG. 2004. Least-cost paths in mountainous terrain. Computers & Geosciences 30(3) 203–209. doi: 10.1016/j.cageo.2003.11.001.↩︎