Remove duplicated rows using dplyr

Question

I have a data frame like this -   set seed 123  df   data frame x sample 0 1 10 replace T  y sample 0 1 10 replace T  z 1 10   gt  df    x y  z 1  0 1  1 2  1 0  2 3  0 1  3 4  1 1  4 5  1 0  5 6  0 1  6 7  1 0  7 8  1 0  8 9  1 0  9 10 0 1 10   I would like to remove duplicate rows based on first two columns  Expected output -   df  duplicated df  1 2       x y z 1 0 1 1 2 1 0 2 4 1 1 4   I am specifically looking for a solution using dplyr package

User · Answer

Most of the time  the best solution is using distinct   from dplyr  as has already been suggested   However  here s another approach that uses the slice   function from dplyr     Generate fake data for the example   library dplyr    set seed 123    df  lt - data frame      x   sample 0 1  10  replace   T       y   sample 0 1  10  replace   T       z   1 10        In each group of rows formed by combinations of x and y   retain only the first row      df   gt         group by x  y    gt         slice 1    Difference from using the distinct   function  The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe  and it can pair nicely with the arrange   function   Let s say you had customer sales data and you wanted to retain one record per customer  and you want that record to be the one from their latest purchase  Then you could write   customer purchase data   gt      arrange desc Purchase Date     gt      group by Customer ID    gt      slice 1

User · Answer

When selecting columns in R for a reduced data-set you can often end up with duplicates   These two lines give the same result  Each outputs a unique data-set with two selected columns only   distinct mtcars  cyl  hp    summarise group by mtcars  cyl  hp

User · Answer

For completeness    sake  the following also works   df   gt   group by x    gt   filter    duplicated y     However  I prefer the solution using distinct  and I suspect it   s faster  too

User · Answer

Here is a solution using dplyr  gt   0 5   library dplyr  set seed 123  df  lt - data frame    x   sample 0 1  10  replace   T     y   sample 0 1  10  replace   T     z   1 10     gt  df   gt   distinct x  y   keep all   TRUE      x y z   1 0 1 1   2 1 0 2   3 1 1 4

User · Answer

If you want to find the rows that are duplicated you can use find duplicates from hablar   library dplyr  library hablar   df  lt - tibble a   c 1  2  2  4                b   c 5  2  2  8    df   gt   find duplicates

User · Answer

Note  dplyr now contains the distinct function for this purpose   Original answer below     library dplyr  set seed 123  df  lt - data frame    x   sample 0 1  10  replace   T     y   sample 0 1  10  replace   T     z   1 10     One approach would be to group  and then only keep the first row   df   gt   group by x  y    gt   filter row number z     1      Source  local data frame  3 x 3     Groups  x  y          x y z    1 0 1 1    2 1 0 2    3 1 1 4    In dplyr 0 2 you won t need the dummy z variable and will just be able to write row number      1   I ve also been thinking about adding a slice   function that would work like   df   gt   group by x  y    gt   slice from   1  to   1    Or maybe a variation of unique   that would let you select which variables to use   df   gt   unique x  y

[r] Remove duplicated rows using dplyr

Difference from using the `distinct()` function

Examples related to r

Examples related to dplyr

[r] Remove duplicated rows using dplyr

Difference from using the distinct() function

Examples related to r

Examples related to dplyr

Difference from using the `distinct()` function