Remove duplicated rows

Question

I have read a CSV file into an R data frame  Some of the rows have the same element in one of the columns  I would like to remove rows that are duplicates in that column  For example   platform external dbus          202           16                     google        1 platform external dbus          202           16         space-ghost verbum        1 platform external dbus          202           16                  localhost        1 platform external dbus          202           16          users sourceforge        8 platform external dbus          202           16                    hughsie        1   I would like only one of these rows since the others have the same data in the first column

User · Answer

You can also use dplyr s distinct   function  It tends to be more efficient than alternative options  especially if you have loads of observations    distinct data  lt - dplyr  distinct yourdata

User · Answer

the general answer can be for example  df  lt -  data frame rbind c 2 9 6  c 4 6 7  c 4 6 7  c 4 6 7  c 2 9 6        new df  lt - df -which duplicated df       output        X1 X2 X3     1  2  9  6     2  4  6  7

User · Answer

The data table package also has unique and duplicated methods of it s own with some additional features   Both the unique data table and the duplicated data table methods have an additional by argument which allows you to pass a character or integer vector of column names or their locations respectively  library data table  DT  lt - data table id   c 1 1 1 2 2 2                    val   c 10 20 30 10 20 30    unique DT  by    id        id val   1   1  10   2   2  10  duplicated DT  by    id      1  FALSE  TRUE  TRUE FALSE  TRUE  TRUE   Another important feature of these methods is a huge performance gain for larger data sets  library microbenchmark  library data table  set seed 123  DF  lt - as data frame matrix sample 1e8  1e5  replace   TRUE   ncol   10   DT  lt - copy DF  setDT DT   microbenchmark unique DF   unique DT     Unit  microseconds         expr       min         lq      mean    median        uq       max neval cld   unique DF  44708 230 48981 8445 53062 536 51573 276 52844 591 107032 18   100   b   unique DT    746 855   776 6145  2201 657   864 932   919 489  55986 88   100  a    microbenchmark duplicated DF   duplicated DT     Unit  microseconds             expr       min         lq       mean     median        uq        max neval cld   duplicated DF  43786 662 44418 8005 46684 0602 44925 0230 46802 398 109550 170   100   b   duplicated DT    551 982   558 2215   851 0246   639 9795   663 658   5805 243   100  a

User · Answer

This problem can also be solved by selecting first row from each group where the group are the columns based on which we want to select unique values  in the example shared it is just 1st column     Using base R     subset df  ave V2  V1  FUN   seq along     1                          V1  V2 V3     V4 V5  1 platform external dbus 202 16 google  1   In dplyr  library dplyr  df   gt   group by V1    gt   slice 1L    Or using data table  library data table  setDT df     SD 1L   by   V1    If we need to find out unique rows based on multiple columns just add those column names in grouping part for each of the above answer    data  df  lt - structure list V1   structure c 1L  1L  1L  1L  1L     Label    platform external dbus   class    factor     V2   c 202L  202L  202L  202L  202L   V3   c 16L  16L  16L   16L  16L   V4   structure c 1L  4L  3L  5L  2L    Label   c  google     hughsie    localhost    space-ghost verbum    users sourceforge     class    factor    V5   c 1L  1L  1L  8L  1L    class    data frame    row names   c NA  -5L

User · Answer

just isolate your data frame to the columns you need  then use the unique function   D    in the above example  you only need the first three columns deduped data  lt - unique  yourdata    1 3       the fourth column no longer  distinguishes  them     so they re duplicates and thrown out

User · Answer

The function distinct   in the dplyr package performs arbitrary duplicate removal  either from specific columns variables  as in this question  or considering all columns variables  dplyr is part of the tidyverse   Data and package  library dplyr  dat  lt - data frame a   rep c 1 2  4   b   rep LETTERS 1 4  2     Remove rows duplicated in a specific column  e g   columna   Note that  keep all   TRUE retains all columns  otherwise only column a would be retained   distinct dat  a   keep all   TRUE     a b 1 1 A 2 2 B   Remove rows that are complete duplicates of other rows   distinct dat     a b 1 1 A 2 2 B 3 1 C 4 2 D

User · Answer

Or you could nest the data in cols 4 and 5 into a single row with tidyr   library tidyr  df   gt   nest V4 V5     A tibble  1    4                        V1    V2    V3             data                     lt fctr gt   lt int gt   lt int gt             lt list gt   1 platform external dbus   202    16  lt tibble  5    2  gt    The col 2 and 3 duplicates are now removed for statistical analysis  but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with unnest

User · Answer

Here s a very simple  fast dplyr tidy solution   Remove rows that are entirely the same   library dplyr  iris   gt      distinct  keep all   TRUE    Remove rows that are the same only in certain columns   iris   gt      distinct Sepal Length  Sepal Width   keep all   TRUE

User · Answer

With sqldf     Example by Mehdi Nellen a  lt - c rep  A   3   rep  B   3   rep  C  2   b  lt - c 1 1 2 4 1 1 2 2  df  lt -data frame a b    Solution    library sqldf      sqldf  SELECT DISTINCT   FROM df     Output     a b 1 A 1 2 A 2 3 B 4 4 B 1 5 C 2

User · Answer

For people who have come here to look for a general answer for duplicate row removal  use  duplicated     a  lt - c rep  A   3   rep  B   3   rep  C  2   b  lt - c 1 1 2 4 1 1 2 2  df  lt -data frame a b   duplicated df   1  FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE   gt  df duplicated df       a b 2 A 1 6 B 1 8 C 2   gt  df  duplicated df       a b 1 A 1 3 A 2 4 B 4 5 B 1 7 C 2   Answer from  Removing duplicated rows from R data frame

User · Answer

Remove duplicate rows of a dataframe library dplyr  mydata  lt - mtcars    Remove duplicate rows of the dataframe distinct mydata   In this dataset  there is not a single duplicate row so it returned same number of rows as in mydata   Remove Duplicate Rows based on a one variable library dplyr  mydata  lt - mtcars    Remove duplicate rows of the dataframe using carb variable distinct mydata carb   keep all  TRUE   The  keep all function is used to retain all other variables in the output data frame   Remove Duplicate Rows based on multiple variables library dplyr  mydata  lt - mtcars    Remove duplicate rows of the dataframe using cyl and vs variables distinct mydata  cyl vs   keep all  TRUE   The  keep all function is used to retain all other variables in the output data frame   from  http   www datasciencemadesimple com remove-duplicate-rows-r-using-dplyr-distinct-function

[r] Remove duplicated rows

Examples related to r

Examples related to duplicates

Examples related to r-faq