Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities).
There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!
One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation.
You create complexity by piping these simple operations together with
%>%
. Here's an example from one of the posts Arun linked
to:
diamonds %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(
AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()
) %>%
arrange(desc(Count))
Even if you've never seen dplyr before (or even R!), you can still get
the gist of what's happening because the functions are all English
verbs. The disadvantage of English verbs is that they require more typing than
[
, but I think that can be largely mitigated by better autocomplete.
Here's the equivalent data.table code:
diamondsDT <- data.table(diamonds)
diamondsDT[
cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
][
order(-Count)
]
It's harder to follow this code unless you're already familiar with
data.table. (I also couldn't figure out how to indent the repeated [
in a way that looks good to my eye). Personally, when I look at code I
wrote 6 months ago, it's like looking at a code written by a stranger,
so I've come to prefer straightforward, if verbose, code.
Two other minor factors that I think slightly decrease readability:
Since almost every data table operation uses [
you need additional
context to figure out what's happening. For example, is x[y]
joining two data tables or extracting columns from a data frame?
This is only a small issue, because in well-written code the
variable names should suggest what's happening.
I like that group_by()
is a separate operation in dplyr. It
fundamentally changes the computation so I think should be obvious
when skimming the code, and it's easier to spot group_by()
than
the by
argument to [.data.table
.
I also like that the the pipe
isn't just limited to just one package. You can start by tidying your
data with
tidyr, and
finish up with a plot in ggvis. And you're
not limited to the packages that I write - anyone can write a function
that forms a seamless part of a data manipulation pipe. In fact, I
rather prefer the previous data.table code rewritten with %>%
:
diamonds %>%
data.table() %>%
.[cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
] %>%
.[order(-Count)]
And the idea of piping with %>%
is not limited to just data frames and
is easily generalised to other contexts: interactive web
graphics, web
scraping,
gists, run-time
contracts, ...)
I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data.
The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend.
All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores.
A few things that we're planning to work on in 2015:
the readr
package, to make it easy to get files off disk and in
to memory, analogous to fread()
.
More flexible joins, including support for non-equi-joins.
More flexible grouping like bootstrap samples, rollups and more
I'm also investing time into improving R's database connectors, the ability to talk to web apis, and making it easier to scrape html pages.