Counting unique distinct values by group in a data frame

Question

Let s say I have the following data frame    gt  myvec     name order no 1    Amy       12 2   Jack       14 3   Jack       16 4   Dave       11 5    Amy       12 6   Jack       16 7    Tom       19 8  Larry       22 9    Tom       19 10  Dave       11 11  Jack       17 12   Tom       20 13   Amy       23 14  Jack       16   I want to count the number of distinct order no values for each name  It should produce the following result   name    number of distinct orders Amy     2 Jack    3 Dave    1 Tom     2 Larry   1   How can I do that

User · Answer

Here is a solution with sqldf  library  sqldf    myvec  lt - read table header TRUE  text      name order no 1    Amy       12 2   Jack       14 3   Jack       16 4   Dave       11 5    Amy       12 6   Jack       16 7    Tom       19 8  Larry       22 9    Tom       19 10  Dave       11 11  Jack       17 12   Tom       20 13   Amy       23 14  Jack       16   sqldf  SELECT name COUNT distinct order no   as number of distinct orders FROM myvec GROUP BY name      gt  sqldf  SELECT name COUNT distinct order no   as number of distinct orders FROM myvec GROUP BY name        name number of distinct orders   1   Amy                         2   2  Dave                         1   3  Jack                         3   4 Larry                         1   5   Tom                         2

User · Answer

This is a simple solution with the function aggregate   aggregate order no   name  myvec  function x  length unique x

User · Answer

Using table    library magrittr  myvec   gt   unique   gt       1    gt   table   gt   as data frame   gt     setNames c  name   number of distinct orders          name number of distinct orders   1   Amy                         2   2  Dave                         1   3  Jack                         3   4 Larry                         1   5   Tom                         2

User · Answer

A data table approach  library data table  DT  lt - data table myvec   DT     number of distinct orders   length unique order no     by   name      data table v    1 9 5 has a built in uniqueN function now  DT     number of distinct orders   uniqueN order no    by   name

User · Answer

This would also work but is less eloquent than the plyr solution   x  lt - sapply split myvec  myvec name    function x  length unique x   2      data frame names names x   number of distinct orders x  row names   NULL

User · Answer

Here is a benchmark of  David Arenburg s solution there as well as a recap of some solutions posted here   mnel   Sven Hohenstein   Henrik    library dplyr  library data table  library microbenchmark  library tidyr  library ggplot2   df  lt - mtcars DT  lt - as data table df  DT 32k  lt - rbindlist replicate 1e3  mtcars  simplify   FALSE   df 32k  lt - as data frame DT 32k  DT 32M  lt - rbindlist replicate 1e6  mtcars  simplify   FALSE   df 32M  lt - as data frame DT 32M  bench  lt - microbenchmark    base 32   aggregate hp   cyl  df  function x  length unique x       base 32k   aggregate hp   cyl  df 32k  function x  length unique x       base 32M   aggregate hp   cyl  df 32M  function x  length unique x       dplyr 32   summarise group by df  cyl   count   n distinct hp      dplyr 32k   summarise group by df 32k  cyl   count   n distinct hp      dplyr 32M   summarise group by df 32M  cyl   count   n distinct hp      data table 32   DT     count   uniqueN hp    by   cyl     data table 32k   DT 32k     count   uniqueN hp    by   cyl     data table 32M   DT 32M     count   uniqueN hp    by   cyl     times   10     Results   print bench     Unit  microseconds              expr          min           lq         mean       median           uq          max neval  cld           base 32      816 153     1064 817 1 231248e 03 1 134542e 03     1263 152     2430 191    10 a             base 32k    38045 080    38618 383 3 976884e 04 3 962228e 04    40399 740    42825 633    10 a             base 32M 35065417 492 35143502 958 3 565601e 07 3 534793e 07 35802258 435 37015121 086    10    d          dplyr 32     2211 131     2292 499 1 211404e 04 2 370046e 03     2656 419    99510 280    10 a            dplyr 32k     3796 442     4033 207 4 434725e 03 4 159054e 03     4857 402     5514 646    10 a            dplyr 32M  1536183 034  1541187 073 1 580769e 06 1 565711e 06  1600732 034  1733709 195    10  b       data table 32      403 163      413 253 5 156662e 02 5 197515e 02      619 093      628 430    10 a       data table 32k     2208 477     2374 454 2 494886e 03 2 448170e 03     2557 604     3085 508    10 a       data table 32M  2011155 330  2033037 689 2 074020e 06 2 052079e 06  2078231 776  2189809 835    10   c    Plot   as tibble bench    gt      group by expr    gt      summarise time   median time     gt      separate expr  c  framework    nrow         remove   FALSE    gt      mutate nrow   recode nrow   32    32   32k    32e3   32M    32e6            time   time   1e3    gt      ggplot aes nrow  time  col   framework       geom line       scale x log10       scale y log10     ylab  microseconds       Session info   sessionInfo     R version 3 4 1  2017-06-30    Platform  x86 64-pc-linux-gnu  64-bit    Running under  Linux Mint 18      Matrix products  default   BLAS   usr lib atlas-base atlas libblas so 3 0   LAPACK   usr lib atlas-base atlas liblapack so 3 0      locale     1  LC CTYPE fr FR UTF-8       LC NUMERIC C               LC TIME fr FR UTF-8           4  LC COLLATE fr FR UTF-8     LC MONETARY fr FR UTF-8    LC MESSAGES fr FR UTF-8       7  LC PAPER fr FR UTF-8       LC NAME C                  LC ADDRESS C                  10  LC TELEPHONE C             LC MEASUREMENT fr FR UTF-8 LC IDENTIFICATION C             attached base packages     1  stats     graphics  grDevices utils     datasets  methods   base           other attached packages     1  ggplot2 2 2 1          tidyr 0 6 3            bindrcpp 0 2           stringr 1 2 0             5  microbenchmark 1 4-2 1 data table 1 10 4      dplyr 0 7 1                 loaded via a namespace  and not attached      1  Rcpp 0 12 11     compiler 3 4 1   plyr 1 8 4       bindr 0 1        tools 3 4 1      digest 0 6 12       7  tibble 1 3 3     gtable 0 2 0     lattice 0 20-35  pkgconfig 2 0 1  rlang 0 1 1      Matrix 1 2-10       13  mvtnorm 1 0-6    grid 3 4 1       glue 1 1 1       R6 2 2 2         survival 2 41-3  multcomp 1 4-6      19  TH data 1 0-8    magrittr 1 5     scales 0 4 1     codetools 0 2-15 splines 3 4 1    MASS 7 3-47         25  assertthat 0 2 0 colorspace 1 3-2 labeling 0 3     sandwich 2 3-4   stringi 1 1 5    lazyeval 0 2 0      31  munsell 0 4 3    zoo 1 8-0

User · Answer

In dplyr you may use n distinct to  count the number of unique values    library dplyr  myvec   gt     group by name    gt     summarise n distinct order no

User · Answer

This should do the trick   ddply myvec  name summarise number of distinct orders length unique order no      This requires package plyr

User · Answer

my 1  lt - table myvec   my 1 my 1    0   lt - 1  rowSums my 1

User · Answer

You can just use the built-in R functions tapply with length  tapply myvec order no  myvec name  FUN   function x  length unique x

User · Answer

Few years old    although had similar requirement and ended up writing my own solution  Applying here   x lt -data frame      quot Name quot  c  quot Amy quot   quot Jack quot   quot Jack quot   quot Dave quot   quot Amy quot   quot Jack quot   quot Tom quot   quot Larry quot   quot Tom quot   quot Dave quot   quot Jack quot   quot Tom quot   quot Amy quot   quot Jack quot      quot OrderNo quot  c 12 14 16 11 12 16 19 22 19 11 17 20 23 16     table sub  quot     quot   quot  quot  unique paste x Name x OrderNo sep  quot   quot  collapse NULL         Amy  Dave  Jack Larry   Tom     2     1     3     1     2

[r] Counting unique / distinct values by group in a data frame

Examples related to r

Examples related to dataframe

Examples related to distinct-values

Examples related to r-faq