R - Overview of v1.0.0 of dtplyr and dplyr
Date Posted:
In this post, we discuss v1.0.0 of the dtplyr
package and some changes made in the v1.0.0 update of dplyr
. This post essentially summarizes these two posts:
As these are two short, well-written posts, I highly recommend simply reading those two posts. If not, hopefully this post will inspire you to read them afterwards.
Dtplyr
Summary
If you prefer dplyr
syntax over data.table
, use dtplyr
when the data sets start getting larger (>10,000 rows). Use of dtplyr
only requires two additional lines of code.
Implementation
The dtplyr
package 1.0.0 was introduced late 2019 by Hadley Wickham. As the implementation is simple, I highly recommend reading his original blog post, as it is a short read and discusses how the package works.
To summarize the post, dtplyr
helps convert your dplyr
syntax to data.table
equivalents, likely speeding up your processing. In a blog post by Sebastian Wolf, they found that in some testing scenariors, there are clear benefits to using dtplyr
over dplyr
when there are more than 10,000 rows. Not bad!
Implementation is much simpler than one might imagine. Simply specify your dataset as a lazy_dt()
, proceed with your normal dplyr
wrangling, but end your piping with one of the following: as.data.table()
, as.data.frame()
, as_tibble()
, collect()
, or pull()
.
Here’s a quick example:
# Set as lazy_dt
iris <- lazy_dt(iris)
# Conduct wrangling
iris %>%
filter(Species == 'setosa') %>%
as.data.frame()
As you can see, the only two changes required are the lazy_df()
call and the as.data.frame()
specification at the end. For such simple work, is little to no reason not to incorporate this when datasets get larger.
Possible Concerns
A natural follow-up is what is the cost of using dtyplr
. We will quickly summarize the content described in the original blog post.
While this package does have the extra cost of translating the dplyr
code to a data.table
equivalent, it is often negligible (under 1ms per dplyr call). Translation cost should not be concern.
There are some additional overhead costs while using mutate()
, and thus if issues arise, explore the immutable
argument in the lazy_df()
function.
Additionally, not all functions within data.table
are ported over to dtplyr
, and thus one should still keep an eye on the original package if desired.
Lastly, we do note that directly using data.table
will provide a more optimal result, and so if speed is a priority, one may consider directly using data.table
syntax.
Dplyr 1.0.0 Additions To Watch
relocate()
- easily change the position of columnsrowwise()
- enables you to mutate by rows instead of columns (also seecolwise()
). Additional information provided by this blog post by Wickham.nest_by()
- Similar togroup_by()
, but visibly makes a change to the data. Each group is now one row, and the data is collapsed into a list-col.
Additionally, there is greater emphasis on the across()
function, especially for use within the rowwise()
and summarise()
functions. The across()
function enables you to apply a summary function to multiple columns, such as summarise(across(where(is.numeric), mean)
, which calculates the mean of every numeric variable. Additionally, one can simply state all the variables, such as: summarise(across(c('Var1', 'Var2'), mean))
, which would calculate the mean of the two specified variables.
By using across()
, we avoid needless repetition when trying to calculate the summary statistics of multiple variables.