I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr
package.
If you want to find the rows that are duplicated you can use find_duplicates
from hablar
:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
Here is a solution using dplyr >= 0.5
.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct
, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct()
from dplyr, as has already been suggested.
However, here's another approach that uses the slice()
function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
distinct()
functionThe advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange()
function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
Source: Stackoverflow.com