Exploring GPS data

Checking over GPS data for accuracy of speed, etc.

Author

David O’Sullivan

Published

July 16, 2024

Date Changes
2024-09-26 Added an executive summary.
2024-07-19 Updated to add turn angle calculation and source gps cleaning function from gps-data-util.R.
2024-07-16 Initial post.

TL;DR Executive summary

The key findings are:

  • GPS reported speed and height data are reasonable choices to use in the estimation of hiking functions.
  • However, some cleanup is required: speeds of up to 200km/h (helicopter flights), very long elapsed distances, and some negative elevations appear in the raw data and should be removed.
  • Not all the issues with the GPS data are uncovered in this notebook (it is an initial exploration), so some additional filtering is applied in subsequent notebooks as deemed necessary. In particular, see this notebook
  • The bulk of the GPS data cleaning operations are carried out on the basis of the list compiled at the end of this notebook, detailed in this section and implemented in this script
library(dplyr)
library(tidyr)
library(janitor)
library(stringr)
library(here)
library(lubridate)
library(sf)
library(terra)
library(tmap)
library(ggplot2)
library(zoo) # useful for serial data
library(R.utils)

source(str_glue("{here()}/scripts/gps-data-utils.R"))

These notes explore the usefulness, accuracy etc. of the provided GPS data.

Of particular interest is the asymmetry or not of movement speeds with respect to slope, given the salience of this for ‘hiking functions’. It is possible in a largely ‘unpathed’ environment that the asymmetry is limited (see e.g. Rees 20041)

We use just one of the datasets to explore general characteristics. Remove a meaningless column x and do a little bit of cleanup on the rcr and date and time columns.

gps_data <- get_gps_data_as_sf(str_glue(
  "{here()}/_data/GPS-1516Season-MiersValley/Fraser-FINAL.csv")) |>
    select(-x) |>
    mutate(rcr = as.logical(rcr),
           date_time = ymd_hms(paste(date, time)))

Exploration of the data

A summary of the data suggests that there are some data rows that should be ignored.

summary(gps_data)
     index        rcr              date               time          
 Min.   :   1   Mode:logical   Length:7323        Length:7323       
 1st Qu.:1832   TRUE:7304      Class :character   Class :character  
 Median :3662   NA's:19        Mode  :character   Mode  :character  
 Mean   :3662                                                       
 3rd Qu.:5492                                                       
 Max.   :7323                                                       
    valid              latitude          n_s              longitude    
 Length:7323        Min.   :-78.12   Length:7323        Min.   :163.7  
 Class :character   1st Qu.:-78.11   Class :character   1st Qu.:163.8  
 Mode  :character   Median :-78.10   Mode  :character   Median :163.8  
                    Mean   :-78.10                      Mean   :163.9  
                    3rd Qu.:-78.10                      3rd Qu.:163.9  
                    Max.   :-78.09                      Max.   :164.2  
     e_w               height_m       speed_km_h           pdop      
 Length:7323        Min.   :-23.1   Min.   :  0.000   Min.   : 0.98  
 Class :character   1st Qu.:102.1   1st Qu.:  0.255   1st Qu.: 1.14  
 Mode  :character   Median :110.8   Median :  0.496   Median : 1.19  
                    Mean   :180.1   Mean   :  1.604   Mean   : 1.27  
                    3rd Qu.:165.7   3rd Qu.:  2.241   3rd Qu.: 1.24  
                    Max.   :636.9   Max.   :209.555   Max.   :13.76  
      hdop              vdop        nsat_used_view       distance_m      
 Min.   : 0.6100   Min.   :0.7600   Length:7323        Min.   :    0.00  
 1st Qu.: 0.7400   1st Qu.:0.8600   Class :character   1st Qu.:    1.38  
 Median : 0.7900   Median :0.8900   Mode  :character   Median :    2.78  
 Mean   : 0.7977   Mean   :0.9756                      Mean   :   21.73  
 3rd Qu.: 0.8400   3rd Qu.:0.9200                      3rd Qu.:   14.99  
 Max.   :13.7200   Max.   :4.7100                      Max.   :74486.45  
          geometry      date_time                     
 POINT        :7323   Min.   :2016-01-13 16:28:31.07  
 epsg:3031    :   0   1st Qu.:2016-01-15 16:43:27.00  
 +proj=ster...:   0   Median :2016-01-19 12:46:32.00  
                      Mean   :2016-01-19 01:37:06.72  
                      3rd Qu.:2016-01-22 08:39:45.00  
                      Max.   :2016-01-23 15:16:47.00  

A speed of > 200km/h was clearly not achieved on foot! Similarly an elapsed distance between 1/30Hz fixes of 74km was not on foot either.

The GPS dilution of precision measures for horizontal and vertical position (hdop and vdop) are mostly within the ‘good’ range (<1) with some outliers. The 3D positions (pdop) are not as good, but are also generally considered reliable (<5). Inspection of a quick map of the data suggests helicopter movement or similar for some ‘singleton’ isolated observations:

ggplot(gps_data) +
  geom_sf(size = 0.25) +
  theme_minimal()

This being the case it seems reasonable to remove observations with clearly unreasonable speed estimates, assuming that these result from use of other forms of transport. However, before doing so, we augment the data with additional estimates of speed, distance, turn angle, etc. calculated from consecutive fixes (it is not advisable to remove any fixes before making these estimates).

Sanity checking the GPS estimates of speed, distance, and height

Before applying that potentially simplistic cleaning of the data, it is also reasonable to sanity check the speed, distance, and height estimates in the GPS data. First we obtain height data from a 10m DEM.

xy <- gps_data |>
  st_coordinates() |>
  as_tibble()

dem <- rast(str_glue("{here()}/_data/input/dry-valleys-10m.tif"))
# and useful for later will be a hillshade
slope <- dem |> terra::terrain("slope", unit = "radians")
aspect <- dem |> terra::terrain("aspect", unit = "radians")
hillshade <- shade(slope, aspect)
TRI <- dem |> terra::terrain("TRI")

heights_from_dem <- dem |>
  terra::extract(xy, method = "simple") |>
  select(2)

slope_from_dem <- slope |>
  terra::extract(xy, method = "simple") |>
  select(2)

TRI_from_dem <- TRI |>
  terra::extract(xy, method = "simple") |>
  select(2)

xy <- xy |>
  bind_cols(heights_from_dem) |>
  rename(Z = `dry-valleys-10m`) |>
  bind_cols(slope_from_dem) |>
  mutate(dem_slope = tan(slope)) |>
  select(-slope) |>
  bind_cols(TRI_from_dem)

gps_data_plus <- gps_data |>
  bind_cols(xy)

Now calculate estimates of distance, change in elevation, and slope between observations, from these numbers and from the GPS data.

gps_data_plus <- gps_data_plus |>
  mutate(dx1 = as.numeric(diff(as.zoo(X), na.pad = TRUE)), 
         dy1 = as.numeric(diff(as.zoo(Y), na.pad = TRUE)),
         dx2 = as.numeric(diff(as.zoo(X), lag = -1, na.pad = TRUE)),
         dy2 = as.numeric(diff(as.zoo(Y), lag = -1, na.pad = TRUE)),
         turn_angle = get_turn_angle(dx1, dy1, dx2, dy2),
         dz = as.numeric(diff(as.zoo(Z), na.pad = TRUE)),
         dh = as.numeric(diff(as.zoo(height_m), na.pad = TRUE)),
         dt = as.numeric(diff(as.zoo(date_time), na.pad = TRUE)),
         dxy = sqrt(dx1^ 2 + dy1 ^ 2),
         dxyz = sqrt(dx1 ^ 2 + dy1 ^ 2 + dz ^ 2),
         dxyh = sqrt(distance_m ^ 2 + dh ^ 2),
         slope_h = dh / distance_m,
         slope_z = dz / dxy,
         speedxy = dxy / dt * 3600 / 1000, speedxyz = dxyz / dt * 3600 / 1000) 

Now some comparisons.

Distance estimates

The distance_m estimates in the provided data are a little higher than the manually calculated point to point straight line distances between fixes. However, examination of sequences of data show the two time series are well aligned (below). Differences may be due to filtering applied by the GPS unit, although this is not well documented.

In any case, the GPS distance_m variable seems safe for use in further analysis.

gps_data_plus |> 
  filter(dt == 30) |>
  slice(2:151) |>
  pivot_longer(cols = c(distance_m, dxy, dxyz)) |>
  ggplot() + 
  geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
  xlab("Date/time") + ylab("Estimated distance") +
  theme_minimal()

Speed estimates

Differences in the speed estimates are more pronounced:

gps_data_plus |> 
  filter(dt == 30) |>
  slice(2:151) |>
  pivot_longer(cols = c(speed_km_h, speedxy, speedxyz)) |>
  ggplot() + 
  geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
  xlab("Date/time") + ylab("Estimated speed") +
  theme_minimal()

The provided speed estimates are generally higher although there is some alignment between the time series, insofar as extended periods of low speeds tend to line up.

A likely reason for the differences is that in a 30 second period (the usual interval between fixes) an individual may move in many directions so that the direct point to point distance between fixes is less than might be expected based on the recorded speed. So, for example, in the above chart between 16:30 and 16:40 the person was moving not in straight lines, where in the following 10 minutes they were moving in relatively straight lines. We can test this theory, at least qualitatively, by calculating a sinuosity estimate between fixes.

sinuosity <- gps_data_plus |>
  filter(dt == 30) |>
  slice(2:61) |>
  # select(date_time, speed_km_h, dxy, speedxy) |>
  mutate(sinuosity = speed_km_h / 120 / distance_m,
         date_time_start = lag(date_time)) 

speeds <- sinuosity |>
  pivot_longer(cols = c(speed_km_h, speedxy))

ggplot(speeds) +
  geom_step(aes(x = date_time, y = value, group = name, linetype = name), 
            direction = "vh", lwd = 0.5) +
  geom_segment(data = sinuosity, 
               aes(x = date_time_start, xend = date_time, 
                   y = speed_km_h, colour = rank(sinuosity)), lwd = 2) +
  scale_colour_distiller(palette = "Reds", direction = 1) +
  xlab("Date/time") + ylab("Estimated speed") +
  theme_minimal()

Since the estimated sinuosity here is based only on the GPS estimated speed and point to point distances, that it is generally highest when the difference between the GPS reported speed speed_km_h and the manually estimated speed speedxy is largest supports the assumption that the GPS is reporting speed using satellite carrier signal doppler effects.

Based on this check, the GPS speed_km_h variable seems reasonable to use for hiking function estimation.

GPS height and elevation from the DEM

Again, there are differences between heights recorded by the GPS units, and heights extracted from the 10m DEM. Again, it is difficult to know why these might occur, although there is a clear correlation between the two. There is no obvious correlation with surface roughness (TRI), for example, or with the vertical dilution of precision (vdop).

ggplot(gps_data_plus |> 
         filter(dt == 30, distance_m < 60, height_m > 0, Z > 0)) +
  geom_point(aes(x = TRI, y = height_m / Z, colour = vdop), alpha = 0.25) +
  scale_colour_viridis_c(option = "A", direction = -1) +
  theme_minimal()

Pragmatically, in the next section, I find that the relationship between slope and speed based on the GPS readings is much better behaved than that derived from the speed and elevation extracted from the DEM. The reason it can be used to estimate slope in that context is divergences between height_m (from the GPS) and Z (from the DEM) are serially correlated, so that estimates of slope derived from consecutive data points are useable for that purpose.

Some reassurance that the GPS height estimates are suitably aligned is obtained by plotting each days set of observations of the two as time series. On days where much of the activity occurred at the same elevation or across a very limited elevation range there are obvious deviations, but these are relatively small in magnitude (a few metres); on days when more elevation chance is observed, the overall profile of the day’s activity matches well.

gps_data_plus |> 
  filter(dt == 30) |>
  pivot_longer(cols = c(height_m, Z)) |>
  ggplot() + 
  geom_line(aes(x = date_time, y = value, colour = name, group = name), alpha = 0.5) +
  xlab("Date/time") + ylab("Estimated elevation") + 
  facet_wrap(~ date, scales = "free", ncol = 2) +
  theme_minimal()

Again, it seems reasonable to use the GPS unit reported heights for estimation of hiking functions. The main impact on the analysis in the next section is to introduce noise into the slope at low values.

A hiking function

A hiking function relates slope of the land to speed at which it is traversed. Using the GPS unit reported speed and slope derived from GPS unit reported distance and height, locally estimated scatter plot smoothing yields an approximately Gaussian curve with the peak speed at or close to 0 slope. This gives some confidence that we can use these data to estimate hiking functions. The approach implied in this plot is insufficient, as the data contain many fixes that don’t give much information about ‘hiking’ as such. The preponderance of lower turn angle fixes in the more ‘mobile’ parts of the data (i.e. higher speeds) suggests that it may be possible to filter out those data points.

Better estimation of a hiking function is left to a later notebook (see Estimating a hiking function).

gps_data_plus |>
  # here's a possible filter to apply to the data to improve the estimation
  filter(dt == 30, speed_km_h < 10, !is.na(slope_h), distance_m > 2.5, turn_angle < 150) |>
  ggplot(aes(x = slope_h, y = speed_km_h, colour = turn_angle)) +
  geom_point(size = 0.25) +
  geom_smooth(aes(x = slope_h, y = speed_km_h))

model <- loess(speed_km_h ~ slope_h, data = gps_data_plus)

Many data in the above plot are probably from ‘milling about’ in breaks, at camp, or around experiments, rather than while on the move. The best way to remove these sections is probably based on turn angles.

Conclusion: a reasonable GPS file cleaning function

Based on this exploration, GPS data need to be processed as follows (note that no initial conversion to a spatial format is required at this stage, since none of the added spatial data introduced above is required).

  • Read raw data from CSV.
  • Clean names with janitor::clean_names.
  • Remove meaningless variable x.
  • Add a name reflecting the id of the person whose data is.
  • Augment data with dt (time interval between fixes), dh (height change between fixes), slope_h (estimated slope between fixes), and turn_angle (change in direction at this fix).
  • Remove data outside expected latitudinal range (> -78 latitude).
  • Remove fixes with high estimated speeds (>~ 10km/h).
  • Remove fixes with long estimated distances (>~ 60m).
  • Identify breaks in the series where dt != 30 and tag subsets of the data accordingly.
  • Count number of fixes in each subset – consider removing small ones…
  • Drop NAs and any rows where dt != 30.

This is the basis for a function get_cleaned_gps_data() in the script gps-data-utils.R used for cleaning up / augmenting the raw CSV files.

Footnotes

  1. Rees WG. 2004. Least-cost paths in mountainous terrain. Computers & Geosciences 30(3) 203–209. doi: 10.1016/j.cageo.2003.11.001.↩︎