I have read a CSV
file into an R data.frame. Some of the rows have the same element in one of the columns. I would like to remove rows that are duplicates in that column. For example:
platform_external_dbus 202 16 google 1
platform_external_dbus 202 16 space-ghost.verbum 1
platform_external_dbus 202 16 localhost 1
platform_external_dbus 202 16 users.sourceforge 8
platform_external_dbus 202 16 hughsie 1
I would like only one of these rows since the others have the same data in the first column.
This question is related to
r
duplicates
r-faq
Remove duplicate rows of a dataframe
library(dplyr)
mydata <- mtcars
# Remove duplicate rows of the dataframe
distinct(mydata)
In this dataset, there is not a single duplicate row so it returned same number of rows as in mydata.
Remove Duplicate Rows based on a one variable
library(dplyr)
mydata <- mtcars
# Remove duplicate rows of the dataframe using carb variable
distinct(mydata,carb, .keep_all= TRUE)
The .keep_all function is used to retain all other variables in the output data frame.
Remove Duplicate Rows based on multiple variables
library(dplyr)
mydata <- mtcars
# Remove duplicate rows of the dataframe using cyl and vs variables
distinct(mydata, cyl,vs, .keep_all= TRUE)
The .keep_all
function is used to retain all other variables in the output data frame.
(from: http://www.datasciencemadesimple.com/remove-duplicate-rows-r-using-dplyr-distinct-function/ )
The function distinct()
in the dplyr
package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr
is part of the tidyverse
.
Data and package
library(dplyr)
dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))
Remove rows duplicated in a specific column (e.g., columna
)
Note that .keep_all = TRUE
retains all columns, otherwise only column a
would be retained.
distinct(dat, a, .keep_all = TRUE)
a b
1 1 A
2 2 B
Remove rows that are complete duplicates of other rows:
distinct(dat)
a b
1 1 A
2 2 B
3 1 C
4 2 D
For people who have come here to look for a general answer for duplicate row removal, use !duplicated()
:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
duplicated(df)
[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
> df[duplicated(df), ]
a b
2 A 1
6 B 1
8 C 2
> df[!duplicated(df), ]
a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2
Answer from: Removing duplicated rows from R data frame
Here's a very simple, fast dplyr
/tidy
solution:
Remove rows that are entirely the same:
library(dplyr)
iris %>%
distinct(.keep_all = TRUE)
Remove rows that are the same only in certain columns:
iris %>%
distinct(Sepal.Length, Sepal.Width, .keep_all = TRUE)
You can also use dplyr
's distinct()
function! It tends to be more efficient than alternative options, especially if you have loads of observations.
distinct_data <- dplyr::distinct(yourdata)
This problem can also be solved by selecting first row from each group where the group are the columns based on which we want to select unique values (in the example shared it is just 1st column).
Using base R :
subset(df, ave(V2, V1, FUN = seq_along) == 1)
# V1 V2 V3 V4 V5
#1 platform_external_dbus 202 16 google 1
In dplyr
library(dplyr)
df %>% group_by(V1) %>% slice(1L)
Or using data.table
library(data.table)
setDT(df)[, .SD[1L], by = V1]
If we need to find out unique rows based on multiple columns just add those column names in grouping part for each of the above answer.
data
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = "platform_external_dbus", class = "factor"),
V2 = c(202L, 202L, 202L, 202L, 202L), V3 = c(16L, 16L, 16L,
16L, 16L), V4 = structure(c(1L, 4L, 3L, 5L, 2L), .Label = c("google",
"hughsie", "localhost", "space-ghost.verbum", "users.sourceforge"
), class = "factor"), V5 = c(1L, 1L, 1L, 8L, 1L)), class = "data.frame",
row.names = c(NA, -5L))
The data.table
package also has unique
and duplicated
methods of it's own with some additional features.
Both the unique.data.table
and the duplicated.data.table
methods have an additional by
argument which allows you to pass a character
or integer
vector of column names or their locations respectively
library(data.table)
DT <- data.table(id = c(1,1,1,2,2,2),
val = c(10,20,30,10,20,30))
unique(DT, by = "id")
# id val
# 1: 1 10
# 2: 2 10
duplicated(DT, by = "id")
# [1] FALSE TRUE TRUE FALSE TRUE TRUE
Another important feature of these methods is a huge performance gain for larger data sets
library(microbenchmark)
library(data.table)
set.seed(123)
DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
DT <- copy(DF)
setDT(DT)
microbenchmark(unique(DF), unique(DT))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18 100 b
# unique(DT) 746.855 776.6145 2201.657 864.932 919.489 55986.88 100 a
microbenchmark(duplicated(DF), duplicated(DT))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170 100 b
# duplicated(DT) 551.982 558.2215 851.0246 639.9795 663.658 5805.243 100 a
the general answer can be for example:
df <- data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))
new_df <- df[-which(duplicated(df)), ]
X1 X2 X3
1 2 9 6
2 4 6 7
With sqldf
:
# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
Solution:
library(sqldf)
sqldf('SELECT DISTINCT * FROM df')
Output:
a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2
Or you could nest the data in cols 4 and 5 into a single row with tidyr
:
library(tidyr)
df %>% nest(V4:V5)
# A tibble: 1 × 4
# V1 V2 V3 data
# <fctr> <int> <int> <list>
#1 platform_external_dbus 202 16 <tibble [5 × 2]>
The col 2 and 3 duplicates are now removed for statistical analysis, but you have kept the col 4 and 5 data in a tibble and can go back to the original data frame at any point with unnest()
.
Source: Stackoverflow.com