[r] Counting unique / distinct values by group in a data frame

Here is a benchmark of @David Arenburg's solution there as well as a recap of some solutions posted here (@mnel, @Sven Hohenstein, @Henrik):

library(dplyr)
library(data.table)
library(microbenchmark)
library(tidyr)
library(ggplot2)

df <- mtcars
DT <- as.data.table(df)
DT_32k <- rbindlist(replicate(1e3, mtcars, simplify = FALSE))
df_32k <- as.data.frame(DT_32k)
DT_32M <- rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(DT_32M)
bench <- microbenchmark(
  base_32 = aggregate(hp ~ cyl, df, function(x) length(unique(x))),
  base_32k = aggregate(hp ~ cyl, df_32k, function(x) length(unique(x))),
  base_32M = aggregate(hp ~ cyl, df_32M, function(x) length(unique(x))),
  dplyr_32 = summarise(group_by(df, cyl), count = n_distinct(hp)),
  dplyr_32k = summarise(group_by(df_32k, cyl), count = n_distinct(hp)),
  dplyr_32M = summarise(group_by(df_32M, cyl), count = n_distinct(hp)),
  data.table_32 = DT[, .(count = uniqueN(hp)), by = cyl],
  data.table_32k = DT_32k[, .(count = uniqueN(hp)), by = cyl],
  data.table_32M = DT_32M[, .(count = uniqueN(hp)), by = cyl],
  times = 10
)

Results:

print(bench)

# Unit: microseconds
#            expr          min           lq         mean       median           uq          max neval  cld
#         base_32      816.153     1064.817 1.231248e+03 1.134542e+03     1263.152     2430.191    10 a   
#        base_32k    38045.080    38618.383 3.976884e+04 3.962228e+04    40399.740    42825.633    10 a   
#        base_32M 35065417.492 35143502.958 3.565601e+07 3.534793e+07 35802258.435 37015121.086    10    d
#        dplyr_32     2211.131     2292.499 1.211404e+04 2.370046e+03     2656.419    99510.280    10 a   
#       dplyr_32k     3796.442     4033.207 4.434725e+03 4.159054e+03     4857.402     5514.646    10 a   
#       dplyr_32M  1536183.034  1541187.073 1.580769e+06 1.565711e+06  1600732.034  1733709.195    10  b  
#   data.table_32      403.163      413.253 5.156662e+02 5.197515e+02      619.093      628.430    10 a   
#  data.table_32k     2208.477     2374.454 2.494886e+03 2.448170e+03     2557.604     3085.508    10 a   
#  data.table_32M  2011155.330  2033037.689 2.074020e+06 2.052079e+06  2078231.776  2189809.835    10   c 

Plot:

as_tibble(bench) %>% 
  group_by(expr) %>% 
  summarise(time = median(time)) %>% 
  separate(expr, c("framework", "nrow"), "_", remove = FALSE) %>% 
  mutate(nrow = recode(nrow, "32" = 32, "32k" = 32e3, "32M" = 32e6),
         time = time / 1e3) %>% 
  ggplot(aes(nrow, time, col = framework)) +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() + ylab("microseconds")

aggregate-VS-dplyr-VS-datatable

Session info:

sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Linux Mint 18
# 
# Matrix products: default
# BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
# LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
# 
# locale:
# [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
# [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
# [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_2.2.1          tidyr_0.6.3            bindrcpp_0.2           stringr_1.2.0         
# [5] microbenchmark_1.4-2.1 data.table_1.10.4      dplyr_0.7.1           
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.11     compiler_3.4.1   plyr_1.8.4       bindr_0.1        tools_3.4.1      digest_0.6.12   
# [7] tibble_1.3.3     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.1      Matrix_1.2-10   
# [13] mvtnorm_1.0-6    grid_3.4.1       glue_1.1.1       R6_2.2.2         survival_2.41-3  multcomp_1.4-6  
# [19] TH.data_1.0-8    magrittr_1.5     scales_0.4.1     codetools_0.2-15 splines_3.4.1    MASS_7.3-47     
# [25] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3     sandwich_2.3-4   stringi_1.1.5    lazyeval_0.2.0  
# [31] munsell_0.4.3    zoo_1.8-0 

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to distinct-values

Java 8 Distinct by property How to maintain a Unique List in Java? Counting unique / distinct values by group in a data frame List distinct values in a vector in R

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc