image source xkcd.com

Data wrangling

Processing pipelines using %>%

Tidy data operations in dplyr

filter

select

mutate

from xkcd.com

Are you ready for the tidyverse data munging operations?

Filtering with filter

Selection with select

Mutating data with mutate

image source commons.wikimedia.org by W.carter

Actually, no... you’re not

Before we go any further...

Meet the pipe

%>%

What the pipe is for

Imagine a series of operations we want to apply to some dataset

We’ll call the dataset ds

And the operations op_a, op_b and op_c


            # apply op_a
            ds1 <- op_a(ds)
            # then apply op_b
            ds2 <- op_b(ds1)
            # then apply op_c
            ds3 <- op_c(ds2)
            

Chaining them together

Gets messy very quickly


            # do it all in one line
            ds3 <- op_c(op_b(op_a(ds)))
            

And because of how functions are written,
sequence of execution is reversed

And if there are additional parameters...

It’s even worse:


            ds3 <- op_c(op_b(op_a(ds, op_a_param=...), op_b_param=...), op_c_param=...)
            

Using pipes instead


            ds3 <- ds %>%
              op_a() %>%
              op_b() %>%
              op_c()
            

This reads something like

“Take ds, then apply op_a, then apply op_b, then apply op_c

# For workflow type situations ## The pipeline idea is a natural fit ## It is easier to read ## Easier to switch steps in or out ## (Apparently) it’s also more obvious to beginners... :-)

Now you’re ready for the tidyverse data wrangling operations

Filtering rows with filter

Selecting columns with select

Mutating variables with mutate

filter

Filters a dataset down to keep only rows of interest


            result <- input %>%
              filter(*condition*)
            
For example

            welly_land <- welly %>%
              filter(LandAreaSQ > 0)
            

Conditions

Specify that a variable must be in some range,
or equal, or not equal to some value


            # equal is '=='
            welly_water <- welly %>%
              filter(LandAreaSQ == 0)

            # not equal is '!='
            welly_land <- welly %>%
              filter(LandAreaSQ != 0)

            # here's another example
            welly_populated <- welly %>%
              filter(pop > 0)
            

Combining conditions

Can combine more than one condition using pipes,
or & (and) and | (or) operations


            # using &
            welly_aquanauts <- welly %>%
                filter(LandAreaSQ == 0 & pop > 0)

            # filter applies all , separated operations
            welly_aquanauts <- welly %>%
                filter(LandAreaSQ == 0, pop > 0)

            # or you can pipe things
            welly_aquanauts <- welly %>%
                filter(LandAreaSQ == 0) %>%
                filter(pop > 0)
            

select

Selects columns (i.e. variables) to retain or remove


            welly_reduced <- welly %>%
                select(MeshblockN)

            welly_reduced <- welly %>%
                select(MeshblockN:UrbanAreaN)

            welly_reduced <- welly %>%
                select(1:3, UrbanAreaN, 7:11))
            

You can also do clever things like starts_with(), ends_with(), contains()


            welly_reduced <- welly %>%
                select(-starts_with("urban"))
                # also ends_with(), contains(), matches()
            

You can also prefix any selector with a - sign
so they remove that selection


            welly_reduced <- welly %>%
                select(-MeshblockN)

            welly_reduced <- welly %>%
                select(-(MeshblockN:UrbanAreaN))

            welly_reduced <- welly %>%
                select(-(1:3), UrbanAreaN, 7:11))

            welly_reduced <- welly %>%
                select(-starts_with("urban"))
                # also ends_with(), contains(), matches()
            

mutate

Adds new variables based on results of various calculations


            result <- input %>%
                mutate(sum_xy = x + y,
                       diff_xy = x - y,
                       pc_diff = diff_xy / sum_xy * 100)
            

You can also use the across() function to apply a calculation only to selected columns


            # to change the type of variables
            result <- input %>%
                mutate(across(where(is.integer), as.character))

            result <- input %>%
                mutate(across(matches("MeshblockN"), as.factor))

            result <- input %>%
                mutate(across(where(is.numeric), ~ . / total * 100))
            

This gets complicated—ask about if you think you need it later...

# A key takeaway ## Use the help! ## `?mutate` (or whatever) ## Almost anything is possible ## Use google, stackoverflow, etc. ## Once you have a feel for things, if you get stuck, post questions!

from xkcd.com