[r] Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

This question is related to r dataframe missing-data

The answer is


Another graphical and interactive way is to use is.na10 function from heatmaply library:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

enter image description here

Probably won't work well with large datasets..


More succinct-: sum(is.na(x[1]))

That is

  1. x[1] Look at the first column

  2. is.na() true if it's NA

  3. sum() TRUE is 1, FALSE is 0


We can use map_df with purrr.

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

For one more graphical solution, visdat package offers vis_miss.

library(visdat)
vis_miss(airquality)

enter image description here

Very similar to Amelia output with a small difference of giving %s on missings out of the box.


My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

enter image description here

It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

enter image description here

Or for categorical variables:

gg_miss_fct(x = riskfactors, fct = marital)

enter image description here

These examples are from package vignette that lists other interesting visualizations.


I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.

install.packages("Amelia")
library(Amelia)
missmap(airquality)

enter image description here

You can also run the following code will return the logic values of na

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

summary(airquality)

already gives you this information

The VIM packages also offers some nice missing data plot for data.frame

library("VIM")
aggr(airquality)

enter image description here


ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

enter image description here


Another graphical alternative - plot_missing function from excellent DataExplorer package:

enter image description here

Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).


If you want to do it for particular column, then you can also use this

length(which(is.na(airquality[1])==T))

A dplyr solution to get the count could be:

summarise_all(df, ~sum(is.na(.)))

Or to get a percentage:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5

Another function that would help you look at missing data would be df_status from funModeling library

library(funModeling)

iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.

df_status(iris.2)

This will give you the number and percentage of NAs in each column.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to missing-data

Replace missing values with column mean How to lowercase a pandas dataframe string column if it has missing values? Delete rows with blank values in one particular column Elegant way to report missing values in a data.frame How do I replace NA values with zeros in an R dataframe? Remove NA values from a vector Remove rows with all or some NAs (missing values) in data.frame