[r] Remove columns from dataframe where ALL values are NA

I'm having trouble with a data frame and couldn't really resolve that issue myself:
The dataframe has arbitrary properties as columns and each row represents one data set.

The question is:
How to get rid of columns where for ALL rows the value is NA?

This question is related to r apply dataframe

The answer is


Try this:

df <- df[,colSums(is.na(df))<nrow(df)]

Another options with purrr package:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) { na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )] DataTable [ , unlist (na.cols) := NULL , with = F] }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as


You can use Janitor package remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

OR

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values e.g.

 df %>% select_if(colSums(!is.na(.))>500)

A handy base R option could be colMeans():

df[, colMeans(is.na(df)) != 1]

janitor::remove_constant() does this very nicely.


df[sapply(df, function(x) all(is.na(x)))] <- NULL

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")


The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:

How to get rid of columns where for ALL rows the value is NA?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)

Second, it uses dplyr.

Instead of

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion

every element is equal to "NA"


Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to apply

pandas create new column based on values from other columns / apply a function of multiple columns, row-wise R Apply() function on specific dataframe columns Why isn't my Pandas 'apply' function referencing multiple columns working? python pandas: apply a function with arguments to a series Apply function to each column in a data frame observing each columns existing data type Apply a function to every row of a matrix or a data frame Remove columns from dataframe where ALL values are NA

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe