Elegant way to report missing values in a data frame

Question

Here s a little piece of code I wrote to report variables with missing values from a data frame   I m trying to think of a more elegant way to do this  one that perhaps returns a data frame  but I m stuck   for  Var in names airquality         missing  lt - sum is na airquality  Var        if  missing  gt  0            print c Var missing             Edit  I m dealing with data frames with dozens to hundreds of variables  so it s key that we only report variables with missing values

User · Answer

If you want to do it for particular column  then you can also use this   length which is na airquality 1    T

User · Answer

summary airquality    already gives you this information  The VIM packages also offers some nice missing data plot for data frame  library  VIM   aggr airquality

User · Answer

ExPanDaR   s package function prepare missing values graph can be used to explore panel data

User · Answer

Another function that would help you look at missing data would be df status from funModeling library  library funModeling    iris 2 is the iris dataset with some added NAs You can replace this with your dataset   df status iris 2    This will give you the number and percentage of NAs in each column

User · Answer

Another graphical and interactive way is to use is na10 function from heatmaply library   library heatmaply   heatmaply is na10 airquality   grid gap   1             showticklabels   c T F               k col  3  k row   3              margins   c 55  30                colors   c  grey80    grey20        Probably won t work well with large datasets

User · Answer

We can use map df with purrr    library mice  library purrr     map df with purrr map df airquality  function x  sum is na x      A tibble  1    6   Ozone Solar R  Wind  Temp Month   Day    lt int gt     lt int gt   lt int gt   lt int gt   lt int gt   lt int gt    1    37       7     0     0     0     0

User · Answer

Another graphical alternative - plot missing function from excellent DataExplorer package     Docs also points out to the fact that you can save this results for additional analysis with missing data  lt - plot missing data

User · Answer

Just use sapply   gt  sapply airquality  function x  sum is na x      Ozone Solar R    Wind    Temp   Month     Day       37       7       0       0       0       0   You could also use apply or colSums on the matrix created by is na     gt  apply is na airquality  2 sum    Ozone Solar R    Wind    Temp   Month     Day       37       7       0       0       0       0  gt  colSums is na airquality     Ozone Solar R    Wind    Temp   Month     Day       37       7       0       0       0       0

User · Answer

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows   install packages  Amelia   library Amelia  missmap airquality      You can also run the following code will return the logic values of na  row has na  lt - apply training  1  function x  any is na x

User · Answer

For one more graphical solution  visdat package offers vis miss   library visdat  vis miss airquality      Very similar to Amelia output with a small difference of giving  s on missings out of the box

User · Answer

My new favourite for  not too wide  data are methods from excellent naniar package  Not only you get frequencies but also patterns of missingness   library naniar  library UpSetR   riskfactors   gt     as shadow upset     gt     upset       It s often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings   ggplot airquality         aes x   Ozone             y   Solar R      geom miss point       Or for categorical variables   gg miss fct x   riskfactors  fct   marital      These examples are from package vignette that lists other interesting visualizations

User · Answer

More succinct-  sum is na x 1     That is   x 1  Look at the first column is na   true if it s NA sum   TRUE is 1  FALSE is 0

User · Answer

A dplyr solution to get the count could be   summarise all df   sum is na        Or to get a percentage   summarise all df    sum is missing      nrow df       Maybe also worth noting that missing data can be ugly  inconsistent  and not always coded as NA depending on the source or how it s handled when imported  The following function could be tweaked depending on your data and what you want to consider missing   is missing  lt - function x     missing strs  lt - c      null    na    nan    inf    -inf    -9    unknown    missing     ifelse  is na x    is nan x    is infinite x    TRUE           ifelse trimws tolower x    in  missing strs  TRUE  FALSE        sample ugly data df  lt - data frame a   c NA   1          missing                     b   c 0  2  NaN  4                    c   c  NA    b    -9    null                     d   1 4                   e   c 1  Inf  -Inf  0      counts   gt  summarise all df   sum is missing        a b c d e 1 3 1 3 0 2    percentage   gt  summarise all df    sum is missing      nrow df          a    b    c d   e 1 0 75 0 25 0 75 0 0 5

[r] Elegant way to report missing values in a data.frame

Examples related to r

Examples related to dataframe

Examples related to missing-data