I'm relatively familiar with data.table
, not so much with dplyr
. I've read through some dplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
and dplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)dplyr
has more accessible syntaxdplyr
abstracts (or will) potential DB interactionsIn my mind 2. doesn't bear much weight because I am fairly familiar with it data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
What I want to know is:
One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr
would offer much beyond what I can already do in data.table
. Here is the dplyr
solution (data at end of Q):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a data.table
solution. That said, good data.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table
(i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the dplyr
or data.table
way is substantially more concise or performs substantially better.
dplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do
in the answer to @eddi's question).data.table
supports rolling joins (thanks @dholstius) as well as overlap joinsdata.table
internally optimises expressions of the form DT[col == value]
or DT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.dplyr
offers standard evaluation versions of functions (e.g. regroup
, summarize_each_
) that can simplify the programmatic use of dplyr
(note programmatic use of data.table
is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)data.table
becomes substantially faster.data.table
scales better than dplyr
as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has data.table
~6x faster.data.table
75% faster on larger versions of a group/apply/sort while dplyr
was 40% faster on the smaller ones (another SO question from comments, thanks danas).data.table
, has benchmarked grouping operations on data.table
, dplyr
and python pandas
on up to 2 billion rows (~100GB in RAM).data.table
~8x fasterThis is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
This question is related to
r
data.table
dplyr
dplyr
definitely does things that data.table
can not.Your point #3
dplyr abstracts (or will) potential DB interactions
is a direct answer to your own question but isn't elevated to a high enough level. dplyr
is truly an extendable front-end to multiple data storage mechanisms where as data.table
is an extension to a single one.
Look at dplyr
as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. data.table
is, from the dplyr
perspective, one of those targets.
You will never (I hope) see a day that data.table
attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores.
dplyr
can possibly do things data.table
will not or might not do as well.Based on the design of working in-memory, data.table
could have a much more difficult time extending itself into parallel processing of queries than dplyr
.
Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
This may seem like a punt but the real answer is no. People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions.
Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
Again, no. data.table
excels at being efficient in everything it does where dplyr
gets the burden of being limited in some respects to the underlying data store and registered handlers.
This means when you run into a performance issue with data.table
you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table
then you've won yourself the joy of filing a report. This is also true when dplyr
is using data.table
as the back-end; you may see some overhead from dplyr
but odds are it is your query.
When dplyr
has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution.
Also see the accepted answer to when is plyr better than data.table?
Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities).
There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!
One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation.
You create complexity by piping these simple operations together with
%>%
. Here's an example from one of the posts Arun linked
to:
diamonds %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(
AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()
) %>%
arrange(desc(Count))
Even if you've never seen dplyr before (or even R!), you can still get
the gist of what's happening because the functions are all English
verbs. The disadvantage of English verbs is that they require more typing than
[
, but I think that can be largely mitigated by better autocomplete.
Here's the equivalent data.table code:
diamondsDT <- data.table(diamonds)
diamondsDT[
cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
][
order(-Count)
]
It's harder to follow this code unless you're already familiar with
data.table. (I also couldn't figure out how to indent the repeated [
in a way that looks good to my eye). Personally, when I look at code I
wrote 6 months ago, it's like looking at a code written by a stranger,
so I've come to prefer straightforward, if verbose, code.
Two other minor factors that I think slightly decrease readability:
Since almost every data table operation uses [
you need additional
context to figure out what's happening. For example, is x[y]
joining two data tables or extracting columns from a data frame?
This is only a small issue, because in well-written code the
variable names should suggest what's happening.
I like that group_by()
is a separate operation in dplyr. It
fundamentally changes the computation so I think should be obvious
when skimming the code, and it's easier to spot group_by()
than
the by
argument to [.data.table
.
I also like that the the pipe
isn't just limited to just one package. You can start by tidying your
data with
tidyr, and
finish up with a plot in ggvis. And you're
not limited to the packages that I write - anyone can write a function
that forms a seamless part of a data manipulation pipe. In fact, I
rather prefer the previous data.table code rewritten with %>%
:
diamonds %>%
data.table() %>%
.[cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
] %>%
.[order(-Count)]
And the idea of piping with %>%
is not limited to just data frames and
is easily generalised to other contexts: interactive web
graphics, web
scraping,
gists, run-time
contracts, ...)
I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data.
The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend.
All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores.
A few things that we're planning to work on in 2015:
the readr
package, to make it easy to get files off disk and in
to memory, analogous to fread()
.
More flexible joins, including support for non-equi-joins.
More flexible grouping like bootstrap samples, rollups and more
I'm also investing time into improving R's database connectors, the ability to talk to web apis, and making it easier to scrape html pages.
Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr
's syntax would have in some cases to switch over to data.table
or compromise for long running times.
But as some have already mentioned, dplyr
can use data.table
as a backend. This is accomplished using the dtplyr
package which recently had it's version 1.0.0 release. Learning dtplyr
incurs practically zero additional effort.
When using dtplyr
one uses the function lazy_dt()
to declare a lazy data.table, after which standard dplyr
syntax is used to specify operations on it. This would look something like the following:
new_table <- mtcars2 %>%
lazy_dt() %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
new_table
#> Source: local data table [?? x 2]
#> Call: `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)),
#> keyby = .(cyl)]
#>
#> cyl l100k
#> <dbl> <dbl>
#> 1 4 9.05
#> 2 6 12.0
#> 3 8 14.9
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
The new_table
object is not evaluated until calling on it as.data.table()
/as.data.frame()
/as_tibble()
at which point the underlying data.table
operation is executed.
I've recreated a benchmark analysis done by data.table
author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr
indeed enables for the most part those who prefer the dplyr
syntax to keep using it while enjoying the speed offered by data.table
.
Source: Stackoverflow.com