[r] data.table vs dplyr: can one do something well the other can't or does poorly?

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

This question is related to r data.table dplyr

The answer is


In direct response to the Question Title...

dplyr definitely does things that data.table can not.

Your point #3

dplyr abstracts (or will) potential DB interactions

is a direct answer to your own question but isn't elevated to a high enough level. dplyr is truly an extendable front-end to multiple data storage mechanisms where as data.table is an extension to a single one.

Look at dplyr as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. data.table is, from the dplyr perspective, one of those targets.

You will never (I hope) see a day that data.table attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores.

dplyr can possibly do things data.table will not or might not do as well.

Based on the design of working in-memory, data.table could have a much more difficult time extending itself into parallel processing of queries than dplyr.


In response to the in-body questions...

Usage

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).

This may seem like a punt but the real answer is no. People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions.

Performance

Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

Again, no. data.table excels at being efficient in everything it does where dplyr gets the burden of being limited in some respects to the underlying data store and registered handlers.

This means when you run into a performance issue with data.table you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table then you've won yourself the joy of filing a report. This is also true when dplyr is using data.table as the back-end; you may see some overhead from dplyr but odds are it is your query.

When dplyr has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution.

Also see the accepted answer to when is plyr better than data.table?


Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities).

Syntax

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!

One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation.

You create complexity by piping these simple operations together with %>%. Here's an example from one of the posts Arun linked to:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. The disadvantage of English verbs is that they require more typing than [, but I think that can be largely mitigated by better autocomplete.

Here's the equivalent data.table code:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

It's harder to follow this code unless you're already familiar with data.table. (I also couldn't figure out how to indent the repeated [ in a way that looks good to my eye). Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code.

Two other minor factors that I think slightly decrease readability:

  • Since almost every data table operation uses [ you need additional context to figure out what's happening. For example, is x[y] joining two data tables or extracting columns from a data frame? This is only a small issue, because in well-written code the variable names should suggest what's happening.

  • I like that group_by() is a separate operation in dplyr. It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table.

I also like that the the pipe isn't just limited to just one package. You can start by tidying your data with tidyr, and finish up with a plot in ggvis. And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. In fact, I rather prefer the previous data.table code rewritten with %>%:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

And the idea of piping with %>% is not limited to just data frames and is easily generalised to other contexts: interactive web graphics, web scraping, gists, run-time contracts, ...)

Memory and performance

I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data.

The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend.

All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores.

Features

A few things that we're planning to work on in 2015:

  • the readr package, to make it easy to get files off disk and in to memory, analogous to fread().

  • More flexible joins, including support for non-equi-joins.

  • More flexible grouping like bootstrap samples, rollups and more

I'm also investing time into improving R's database connectors, the ability to talk to web apis, and making it easier to scrape html pages.


Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr's syntax would have in some cases to switch over to data.table or compromise for long running times.

But as some have already mentioned, dplyr can use data.table as a backend. This is accomplished using the dtplyr package which recently had it's version 1.0.0 release. Learning dtplyr incurs practically zero additional effort.

When using dtplyr one uses the function lazy_dt() to declare a lazy data.table, after which standard dplyr syntax is used to specify operations on it. This would look something like the following:

new_table <- mtcars2 %>% 
  lazy_dt() %>%
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))

  new_table

#> Source: local data table [?? x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

The new_table object is not evaluated until calling on it as.data.table()/as.data.frame()/as_tibble() at which point the underlying data.table operation is executed.

I've recreated a benchmark analysis done by data.table author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr indeed enables for the most part those who prefer the dplyr syntax to keep using it while enjoying the speed offered by data.table.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to data.table

Error: package or namespace load failed for ggplot2 and for data.table Select subset of columns in data.table R How to get week numbers from dates? data.table vs dplyr: can one do something well the other can't or does poorly? How to replace NA values in a table for selected columns Select multiple columns in data.table by their numeric indices Sort rows in data.table in decreasing order on string key `order(-x,v)` gives error on data.table 1.9.4 or earlier Cleaning `Inf` values from an R dataframe Aggregate / summarize multiple variables per group (e.g. sum, mean) How do you delete a column by name in data.table?

Examples related to dplyr

R dplyr: Drop multiple columns How to specify "does not contain" in dplyr filter Select first and last row from grouped data Error: could not find function "%>%" Sum across multiple columns with dplyr Removing NA observations with dplyr::filter() Changing factor levels with dplyr mutate Change value of variable with dplyr dplyr change many data types What does %>% function mean in R?