R - Overview of v1.0.0 of dtplyr and dplyr

Date Posted: June 18, 2020

In this post, we discuss v1.0.0 of the dtplyr package and some changes made in the v1.0.0 update of dplyr. This post essentially summarizes these two posts:

As these are two short, well-written posts, I highly recommend simply reading those two posts. If not, hopefully this post will inspire you to read them afterwards.

Dtplyr

Summary

If you prefer dplyr syntax over data.table, use dtplyr when the data sets start getting larger (>10,000 rows). Use of dtplyr only requires two additional lines of code.

Implementation

The dtplyr package 1.0.0 was introduced late 2019 by Hadley Wickham. As the implementation is simple, I highly recommend reading his original blog post, as it is a short read and discusses how the package works.

To summarize the post, dtplyr helps convert your dplyr syntax to data.table equivalents, likely speeding up your processing. In a blog post by Sebastian Wolf, they found that in some testing scenariors, there are clear benefits to using dtplyr over dplyr when there are more than 10,000 rows. Not bad!

Implementation is much simpler than one might imagine. Simply specify your dataset as a lazy_dt(), proceed with your normal dplyr wrangling, but end your piping with one of the following: as.data.table(), as.data.frame(), as_tibble(), collect(), or pull().

Here’s a quick example:

# Set as lazy_dt
iris <- lazy_dt(iris)

# Conduct wrangling
iris %>%
  filter(Species == 'setosa') %>%
  as.data.frame()

As you can see, the only two changes required are the lazy_df() call and the as.data.frame() specification at the end. For such simple work, is little to no reason not to incorporate this when datasets get larger.

Possible Concerns

A natural follow-up is what is the cost of using dtyplr. We will quickly summarize the content described in the original blog post.

While this package does have the extra cost of translating the dplyr code to a data.table equivalent, it is often negligible (under 1ms per dplyr call). Translation cost should not be concern.

There are some additional overhead costs while using mutate(), and thus if issues arise, explore the immutable argument in the lazy_df() function.

Additionally, not all functions within data.table are ported over to dtplyr, and thus one should still keep an eye on the original package if desired.

Lastly, we do note that directly using data.table will provide a more optimal result, and so if speed is a priority, one may consider directly using data.table syntax.

Dplyr 1.0.0 Additions To Watch

Full post by Hadley Wickham.

relocate() - easily change the position of columns
rowwise() - enables you to mutate by rows instead of columns (also see colwise()). Additional information provided by this blog post by Wickham.
nest_by() - Similar to group_by(), but visibly makes a change to the data. Each group is now one row, and the data is collapsed into a list-col.

Additionally, there is greater emphasis on the across() function, especially for use within the rowwise() and summarise() functions. The across() function enables you to apply a summary function to multiple columns, such as summarise(across(where(is.numeric), mean), which calculates the mean of every numeric variable. Additionally, one can simply state all the variables, such as: summarise(across(c('Var1', 'Var2'), mean)), which would calculate the mean of the two specified variables.

By using across(), we avoid needless repetition when trying to calculate the summary statistics of multiple variables.