Find duplicate values in R

Question

I have a table with 21638 unique  rows   vocabulary  lt - read table  http   socserv socsci mcmaster ca jfox Books Applied-Regression-2E datasets Vocabulary txt   header T    This table has five columns  the first of which holds the respondent ID numbers  I want to check if any respondents appear twice  or if all respondents are unique   To count unique IDs I can use  length unique vocabulary id     and to check if there are any duplicates I might do  length unique vocabulary id      nrow vocabulary    which returns TRUE  if there are no duplicates  which there aren t    My question   Is there a direct way to return the values or line numbers of duplicates   Some further explanation   There is an interpretation problem with using the function duplicated    because is only returns the duplicates in the strict sense  excluding the  originals   For example  sum duplicated vocabulary id   or dim vocabulary duplicated vocabulary id     1  might return  5  as the number of duplicate rows  The problem is that if you only know the number of duplicates  you won t know how many rows they duplicate  Does  5  mean that there are five rows with one duplicate each  or that there is one row with five duplicates  And since you won t have the IDs or line numbers of the duplicates  you wouldn t have any means of finding the  originals       I know there are no duplicate IDs in this survey  but it is a good example  because using any of the answers given elsewhere to this question  like duplicated vocabulary id  or table vocabulary id  will output a haystack to your screen in which you ll be quite unable to find any possible rare duplicate needles

User · Accepted Answer

You could use table  i e   n occur  lt - data frame table vocabulary id     gives you a data frame with a list of ids and the number of times they occurred   n occur n occur Freq  gt  1     tells you which ids occurred more than once   vocabulary vocabulary id  in  n occur Var1 n occur Freq  gt  1      returns the records with more than one occurrence

User · Answer

This will give you duplicate rows   vocabulary duplicated vocabulary id      This will give you the number of duplicates   dim vocabulary duplicated vocabulary id     1    Example   vocabulary2  lt -rbind vocabulary vocabulary 1     creates a duplicate at the end vocabulary2 duplicated vocabulary2 id                 id year    sex education vocabulary  21639 20040001 2004 Female         9          3 dim vocabulary2 duplicated vocabulary2 id     1    1  1   1 duplicate   EDIT  OK  with the additional information  here s what you should do  duplicated has a fromLast option which allows you to get duplicates from the end  If you combine this with the normal duplicated  you get all duplicates  The following example adds duplicates to the original vocabulary object  line 1 is duplicated twice and line 5 is duplicated once   I then use table to get the total number of duplicates per ID    Create vocabulary object with duplicates voc dups  lt -rbind vocabulary vocabulary 1   vocabulary 1   vocabulary 5      List duplicates dups  lt -voc dups duplicated voc dups id  duplicated voc dups id  fromLast TRUE    dups              id year    sex education vocabulary  1     20040001 2004 Female         9          3  5     20040008 2004   Male        14          1  21639 20040001 2004 Female         9          3  21640 20040001 2004 Female         9          3  51000 20040008 2004   Male        14          1   Count duplicates by id table dups id   20040001 20040008          3        2

User · Answer

A terser way  either with rev    x    duplicated x   amp  rev  duplicated rev x            rather than fromLast   x    duplicated x   amp   duplicated x  fromLast   TRUE          and as a helper function to provide either logical vector or elements from original vector    duplicates  lt - function x  as bool   FALSE        is dup  lt -    duplicated x   amp  rev  duplicated rev x         if  as bool    is dup   else   x is dup        Treating vectors as data frames to pass to table is handy but can get difficult to read  and the data table solution is fine but I d prefer base R solutions for dealing with simple vectors like IDs

User · Answer

Here s a data table solution that will list the duplicates along with the number of duplications  will be 1 if there are 2 copies  and so on - you can adjust that to suit your needs    library data table  dt   data table vocabulary   dt duplicated id   cbind  SD 1   number    N   by   id

User · Answer

Here  I summarize a few ways which may return different results to your question  so be careful     First assign your  id s to an R object    Here s a hypothetical example  id  lt - c  a   b   b   c   c   c   d   d   d   d     To return ALL MINUS ONE duplicated values  id duplicated id       1   b   c   c   d   d   d    To return ALL duplicated values by specifying fromLast argument  id duplicated id    duplicated id  fromLast TRUE       1   b   b   c   c   c   d   d   d   d    Yet another way to return ALL duplicated values  using  in  operator  id  id  in  id duplicated id         1   b   b   c   c   c   d   d   d   d    Hope these help  Good luck

[r] Find duplicate values in R

Examples related to r