Drop unused factor levels in a subsetted data frame

Question

I have a data frame containing a factor  When I create a subset of this dataframe using subset or another indexing function  a new data frame is created   However  the factor variable retains all of its original levels  even when if they do not exist in the new dataframe   This causes problems when doing faceted plotting or using functions that rely on factor levels   What is the most succinct way to remove levels from a factor in the new dataframe   Here s an example   df  lt - data frame letters letters 1 5                       numbers seq 1 5    levels df letters      1   a   b   c   d   e   subdf  lt - subset df  numbers  lt   3       letters numbers    1       a       1    2       b       2    3       c       3        all levels are still there  levels subdf letters      1   a   b   c   d   e

User · Answer

here is a way of doing that  varFactor  lt - factor letters 1 15   varFactor  lt - varFactor 1 5  varFactor  lt - varFactor drop T

User · Answer

For the sake of completeness  now there is also fct drop in the forcats package http   forcats tidyverse org reference fct drop html   It differs from droplevels in the way it deals with NA   f  lt - factor c  a    b   NA   exclude   NULL   droplevels f     1  a    b     lt NA gt    Levels  a b  lt NA gt   forcats  fct drop f     1  a    b     lt NA gt    Levels  a b

User · Answer

This is obnoxious   This is how I usually do it  to avoid loading other packages   levels subdf letters  lt -c  a   b   c  NA NA    which gets you    gt  subdf letters  1  a b c Levels  a b c   Note that the new levels will replace whatever occupies their index in the old levels subdf letters   so something like   levels subdf letters  lt -c NA  a   c  NA  b     won t work   This is obviously not ideal when you have lots of levels  but for a few  it s quick and easy

User · Answer

If you don t want this behaviour  don t use factors  use character vectors instead   I think this makes more sense than patching things up afterwards  Try the following before loading your data with read table or read csv     options stringsAsFactors   FALSE    The disadvantage is that you re restricted to alphabetical ordering    reorder is your friend for plots

User · Answer

It is a known issue  and one possible remedy is provided by drop levels   in the gdata package where your example becomes   gt  drop levels subdf    letters numbers 1       a       1 2       b       2 3       c       3  gt  levels drop levels subdf  letters   1   a   b   c    There is also the dropUnusedLevels function in the Hmisc package  However  it only works by altering the subset operator   and is not applicable here   As a corollary  a direct approach on a per-column basis is a simple as factor as character data      gt  levels subdf letters   1   a   b   c   d   e   gt  subdf letters  lt - as factor as character subdf letters    gt  levels subdf letters   1   a   b   c

User · Answer

Looking at the droplevels methods code in the R source you can see it wraps to factor function  That means you can basically recreate the column with factor function  Below the data table way to drop levels from all the factor columns       library data table  dt   data table letters factor letters 1 5    numbers seq 1 5   levels dt letters    1   a   b   c   d   e  subdt   dt numbers  lt   3  levels subdt letters    1   a   b   c   d   e   upd cols   sapply subdt  is factor  subdt   names subdt  upd cols     lapply  SD  factor    SDcols   upd cols  levels subdt letters    1   a   b   c

User · Answer

Very interesting thread  I especially liked idea to just factor subselection again  I had the similar problem before and I just converted to character and then back to factor      df  lt - data frame letters letters 1 5  numbers seq 1 5      levels df letters         1   a   b   c   d   e     subdf  lt - df df numbers  lt   3     subdf letters lt -factor as character subdf letters

User · Answer

Another way of doing the same but with dplyr  library dplyr  subdf  lt - df   gt   filter numbers  lt   3    gt   droplevels   str subdf    Edit    Also Works   Thanks to agenis  subdf  lt - df   gt   filter numbers  lt   3    gt   droplevels levels subdf letters

User · Answer

I wrote utility functions to do this   Now that I know about gdata s drop levels  it looks pretty similar   Here they are  from here    present levels  lt - function x  intersect levels x   x   trim levels  lt - function      UseMethod  trim levels    trim levels factor  lt - function x   factor x  levels present levels x    trim levels data frame  lt - function x      for  n in names x       if  is factor x  n          x  n    trim levels x  n     x

User · Answer

Have tried most of the examples here if not all but none seem to be working in my case. After struggling for quite some time I have tried using as.character() on the factor column to change it to a col with strings which seems to working just fine.

Not sure for performance issues.

User · Answer

A genuine droplevels function that is much faster than droplevels and does not perform any kind of unnecessary matching or tabulation of values is collapse  fdroplevels  Example  library collapse  library microbenchmark     wlddev data supplied in collapse  iso3c is a factor data  lt - fsubset wlddev  iso3c   in   quot USA quot    microbenchmark fdroplevels data   droplevels data   unit    quot relative quot      Unit  relative                  expr  min       lq     mean   median       uq      max neval cld     fdroplevels data   1 0  1 00000  1 00000  1 00000  1 00000  1 00000   100  a       droplevels data  30 2 29 15873 24 54175 24 86147 22 11553 14 23274   100   b

User · Answer

Unfortunately factor   doesn t seem to work when using rxDataStep of RevoScaleR  I do it in two steps  1  Convert to character and store in temporary external data frame   xdf   2  Convert back to factor and store in definitive external data frame  This eliminates any unused factor levels  without loading all the data into memory     Step 1  Converts to character  in temporary xdf file  rxDataStep inData    input xdf   outFile    temp xdf   transforms   list VAR X   as character VAR X    overwrite   T    Step 2  Converts back to factor  rxDataStep inData    temp xdf   outFile    output xdf   transforms   list VAR X   as factor VAR X    overwrite   T

User · Answer

Here s another way  which I believe is equivalent to the factor     approach    gt  df  lt - data frame let letters 1 5   num 1 5   gt  subdf  lt - df df num  lt   3      gt  subdf let  lt - subdf let    drop TRUE    gt  levels subdf let   1   a   b   c

User · Answer

Since R version 2 12  there s a droplevels   function   levels droplevels subdf letters

User · Answer

All you should have to do is to apply factor   to your variable again after subsetting    gt  subdf letters  1  a b c Levels  a b c d e subdf letters  lt - factor subdf letters   gt  subdf letters  1  a b c Levels  a b c   EDIT  From the factor page example   factor ff         drops the levels that do not occur   For dropping levels from all factor columns in a dataframe  you can use   subdf  lt - subset df  numbers  lt   3  subdf    lt - lapply subdf  function x  if is factor x   factor x  else x

[r] Drop unused factor levels in a subsetted data frame

The answer is

Examples related to r

Examples related to dataframe

Examples related to r-factor

Examples related to r-faq

Tags