[r] Remove duplicated rows

I have read a CSV file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in that column. For example:

platform_external_dbus          202           16                     google        1
platform_external_dbus          202           16         space-ghost.verbum        1
platform_external_dbus          202           16                  localhost        1
platform_external_dbus          202           16          users.sourceforge        8
platform_external_dbus          202           16                    hughsie        1

I would like only one of these rows since the others have the same data in the first column.

This question is related to r duplicates r-faq

The answer is


Remove duplicate rows of a dataframe

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe
distinct(mydata)

In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.



Remove Duplicate Rows based on a one variable

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe using carb variable
distinct(mydata,carb, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame.



Remove Duplicate Rows based on multiple variables

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe using cyl and vs variables
distinct(mydata, cyl,vs, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame.

(from: http://www.datasciencemadesimple.com/remove-duplicate-rows-r-using-dplyr-distinct-function/ )


The function distinct() in the dplyr package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr is part of the tidyverse.

Data and package

library(dplyr)
dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))

Remove rows duplicated in a specific column (e.g., columna)

Note that .keep_all = TRUE retains all columns, otherwise only column a would be retained.

distinct(dat, a, .keep_all = TRUE)

  a b
1 1 A
2 2 B

Remove rows that are complete duplicates of other rows:

distinct(dat)

  a b
1 1 A
2 2 B
3 1 C
4 2 D

For people who have come here to look for a general answer for duplicate row removal, use !duplicated():

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

duplicated(df)
[1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

> df[duplicated(df), ]
  a b
2 A 1
6 B 1
8 C 2

> df[!duplicated(df), ]
  a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2

Answer from: Removing duplicated rows from R data frame


Here's a very simple, fast dplyr/tidy solution:

Remove rows that are entirely the same:

library(dplyr)
iris %>% 
  distinct(.keep_all = TRUE)

Remove rows that are the same only in certain columns:

iris %>% 
  distinct(Sepal.Length, Sepal.Width, .keep_all = TRUE)


You can also use dplyr's distinct() function! It tends to be more efficient than alternative options, especially if you have loads of observations.

distinct_data <- dplyr::distinct(yourdata)

This problem can also be solved by selecting first row from each group where the group are the columns based on which we want to select unique values (in the example shared it is just 1st column).

Using base R :

subset(df, ave(V2, V1, FUN = seq_along) == 1)

#                      V1  V2 V3     V4 V5
#1 platform_external_dbus 202 16 google  1

In dplyr

library(dplyr)
df %>% group_by(V1) %>% slice(1L)

Or using data.table

library(data.table)
setDT(df)[, .SD[1L], by = V1]

If we need to find out unique rows based on multiple columns just add those column names in grouping part for each of the above answer.

data

df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = "platform_external_dbus", class = "factor"), 
V2 = c(202L, 202L, 202L, 202L, 202L), V3 = c(16L, 16L, 16L, 
16L, 16L), V4 = structure(c(1L, 4L, 3L, 5L, 2L), .Label = c("google", 
"hughsie", "localhost", "space-ghost.verbum", "users.sourceforge"
), class = "factor"), V5 = c(1L, 1L, 1L, 8L, 1L)), class = "data.frame", 
row.names = c(NA, -5L))

The data.table package also has unique and duplicated methods of it's own with some additional features.

Both the unique.data.table and the duplicated.data.table methods have an additional by argument which allows you to pass a character or integer vector of column names or their locations respectively

library(data.table)
DT <- data.table(id = c(1,1,1,2,2,2),
                 val = c(10,20,30,10,20,30))

unique(DT, by = "id")
#    id val
# 1:  1  10
# 2:  2  10

duplicated(DT, by = "id")
# [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Another important feature of these methods is a huge performance gain for larger data sets

library(microbenchmark)
library(data.table)
set.seed(123)
DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
DT <- copy(DF)
setDT(DT)

microbenchmark(unique(DF), unique(DT))
# Unit: microseconds
#       expr       min         lq      mean    median        uq       max neval cld
# unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18   100   b
# unique(DT)   746.855   776.6145  2201.657   864.932   919.489  55986.88   100  a 


microbenchmark(duplicated(DF), duplicated(DT))
# Unit: microseconds
#           expr       min         lq       mean     median        uq        max neval cld
# duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170   100   b
# duplicated(DT)   551.982   558.2215   851.0246   639.9795   663.658   5805.243   100  a 

the general answer can be for example:

df <-  data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))



new_df <- df[-which(duplicated(df)), ]

output:

      X1 X2 X3
    1  2  9  6
    2  4  6  7

With sqldf:

# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)

Solution:

 library(sqldf)
    sqldf('SELECT DISTINCT * FROM df')

Output:

  a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2

Or you could nest the data in cols 4 and 5 into a single row with tidyr:

library(tidyr)
df %>% nest(V4:V5)

# A tibble: 1 × 4
#                      V1    V2    V3             data
#                  <fctr> <int> <int>           <list>
#1 platform_external_dbus   202    16 <tibble [5 × 2]>

The col 2 and 3 duplicates are now removed for statistical analysis, but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with unnest().


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to duplicates

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C Remove duplicates from a dataframe in PySpark How to "select distinct" across multiple data frame columns in pandas? How to find duplicate records in PostgreSQL Drop all duplicate rows across multiple columns in Python Pandas Left Join without duplicate rows from left table Finding duplicate integers in an array and display how many times they occurred How do I use SELECT GROUP BY in DataTable.Select(Expression)? How to delete duplicate rows in SQL Server? Python copy files to a new directory and rename if file name already exists

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc