Remove columns from dataframe where ALL values are NA

Question

I m having trouble with a data frame and couldn t really resolve that issue myself  The dataframe has arbitrary properties as columns and each row represents one data set  The question is  How to get rid of columns where for ALL rows the value is NA

User · Answer

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18

User · Answer

Another options with purrr package   library dplyr   df  lt - data frame a   NA                   b   seq 1 5                     c   c rep 1  4   NA    df   gt   purrr  discard  all is na      df   gt   purrr  keep   all is na

User · Answer

Another way would be to use the apply   function   If you have the data frame  df  lt - data frame  var1   c 1 7 NA                     var2   c 1 2 1 3 4 NA NA 9                     var3   c NA                        then you can use apply   to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa  only with an apply approach    gt   apply  is na df   2  all   var1  var2  var3   TRUE  TRUE FALSE    gt  df    apply is na df   2  all     var1 var2 1    1    1 2    2    2 3    3    1 4    4    3 5    5    4 6    6   NA 7    7   NA 8   NA    9

User · Answer

df sapply df  function x  all is na x      lt - NULL

User · Answer

Update You can now use select with the where selection helper  select if is superceded  but still functional as of dplyr 1 0 2   thanks to  mcstrother for bringing this to attention   library dplyr  temp  lt - data frame x   1 5  y   c 1 2 NA 4  5   z   rep NA  5   not all na  lt - function x  any  is na x   not any na  lt - function x  all  is na x     gt  temp   x  y  z 1 1  1 NA 2 2  2 NA 3 3 NA NA 4 4  4 NA 5 5  5 NA   gt  temp   gt   select where not all na     x  y 1 1  1 2 2  2 3 3 NA 4 4  4 5 5  5   gt  temp   gt   select where not any na     x 1 1 2 2 3 3 4 4 5 5  Old Answer dplyr now has a select if verb that may be helpful here   gt  temp   x  y  z 1 1  1 NA 2 2  2 NA 3 3 NA NA 4 4  4 NA 5 5  5 NA   gt  temp   gt   select if not all na    x  y 1 1  1 2 2  2 3 3 NA 4 4  4 5 5  5   gt  temp   gt   select if not any na    x 1 1 2 2 3 3 4 4 5 5

User · Answer

Try this   df  lt - df  colSums is na df   lt nrow df

User · Answer

Late to the game but you can also use the janitor package  This function will remove columns which are all NA  and can be changed to remove rows that are all NA as well   df  lt - janitor  remove empty df  which    cols

User · Answer

A handy base R option could be colMeans     df   colMeans is na df      1

User · Answer

You can use Janitor package remove empty   library janitor   df   gt     remove empty c  rows    cols     select either row or cols or both   Also  Another dplyr approach   library dplyr    df   gt   select if  all  is na        OR  df   gt   select if colSums  is na        nrow df     this is also useful if you want to only exclude   keep column with certain number of missing values e g    df   gt   select if colSums  is na     gt 500

User · Answer

I hope this may also help  It could be made into a single command  but I found it easier for me to read by dividing it in two commands  I made a function with the following instruction and worked lightning fast   naColsRemoval   function  DataTable         na cols   DataTable        which   apply   is na    SD     2   all             DataTable     unlist  na cols     NULL   with   F           SD will allow to limit the verification to part of the table  if you wish  but it will take the whole table as

User · Answer

From my experience of having trouble applying previous answers  I have found that I needed to modify their approach in order to achieve what the question here is   How to get rid of columns where for ALL rows the value is NA   First note that my solution will only work if you do not have duplicate columns  that issue is dealt with here  on stack overflow  Second  it uses dplyr  Instead of df  lt - df   gt   select if  all  is na       I find that what works is df  lt - df   gt   select if   all is na       The point is that the  quot not quot  symbol  quot   quot  needs to be on the outside of the universal quantifier  I e  the select if operator acts on columns  In this case  it selects only those that do not satisfy the criterion  every element is equal to  quot NA quot

User · Answer

janitor  remove constant   does this very nicely

[r] Remove columns from dataframe where ALL values are NA

examples using large data (30 columns, 1e6 rows)

Examples related to r

Examples related to apply

Examples related to dataframe