Remove rows with all or some NAs missing values in data frame

Question

I d like to remove the lines in this data frame that   a  contain NAs across all columns  Below is my example data frame                 gene hsap mmul mmus rnor cfam 1 ENSG00000208234    0   NA   NA   NA   NA 2 ENSG00000199674    0   2    2    2    2 3 ENSG00000221622    0   NA   NA   NA   NA 4 ENSG00000207604    0   NA   NA   1    2 5 ENSG00000207431    0   NA   NA   NA   NA 6 ENSG00000221312    0   1    2    3    2   Basically  I d like to get a data frame such as the following                gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0   2    2    2    2 6 ENSG00000221312    0   1    2    3    2   b  contain NAs in only some columns  so I can also get this result                gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0   2    2    2    2 4 ENSG00000207604    0   NA   NA   1    2 6 ENSG00000221312    0   1    2    3    2

User · Answer

For your first question  I have a code that I am comfortable with to get rid of all NAs  Thanks for  Gregor to make it simpler   final   rowSums is na final        For the second question  the code is just an alternation from the previous solution   final as logical  rowSums is na final  -5       Notice the -5 is the number of columns in your data  This will eliminate rows with all NAs  since the rowSums adds up to 5 and they become zeroes after subtraction  This time  as logical is necessary

User · Answer

Another option if you want greater control over how rows are deemed to be invalid is  final  lt - final   is na final rnor       is na rawdata cfam       Using the above  this                gene hsap mmul mmus rnor cfam 1 ENSG00000208234    0   NA   NA   NA   2 2 ENSG00000199674    0   2    2    2    2 3 ENSG00000221622    0   NA   NA   2   NA 4 ENSG00000207604    0   NA   NA   1    2 5 ENSG00000207431    0   NA   NA   NA   NA 6 ENSG00000221312    0   1    2    3    2   Becomes                gene hsap mmul mmus rnor cfam 1 ENSG00000208234    0   NA   NA   NA   2 2 ENSG00000199674    0   2    2    2    2 3 ENSG00000221622    0   NA   NA   2   NA 4 ENSG00000207604    0   NA   NA   1    2 6 ENSG00000221312    0   1    2    3    2      where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam  The boolean logic can then be changed to fit specific requirements

User · Answer

I am a synthesizer    Here I combined the answers into one function      keep rows that have a certain number  range  of NAs anywhere somewhere and delete others     param df a data frame     param col restrict to the columns where you would like to search for NA  eg  3  c 3   2 5   place   c  place   age       cr default is NULL  search for all columns     param n integer or vector  0  c 3 5   number range of NAs allowed      cr If a number  the exact number of NAs kept     cr Range includes both ends 3 lt  n lt  5     cr Range could be -Inf  Inf     return returns a new df with rows that have NA s  removed     export ez na keep   function df  col NULL  n 0       if   is null col               R converts a single row col to a vector if the parameter col has only one col           see https   radfordneal wordpress com 2008 08 20 design-flaws-in-r-2- E2 80 94-dropped-dimensions  comments         df temp   df  col drop FALSE        else           df temp   df            if  length n   1           if  n  0                  simply call complete cases which might be faster             result   df complete cases df temp              else                 credit  http   stackoverflow com a 30461945 2292993             log  lt - apply df temp  2  is na              logindex  lt - apply log  1  function x  sum x     n              result   df logindex                         if  length n   2           min   n 1   max   n 2          log  lt - apply df temp  2  is na          logindex  lt - apply log  1  function x   sum x   gt   min  amp  amp  sum x   lt   max           result   df logindex               return result

User · Answer

Also check complete cases     gt  final complete cases final                  gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0    2    2    2    2 6 ENSG00000221312    0    1    2    3    2   na omit is nicer for just removing all NA s  complete cases allows partial selection by including only certain columns of the dataframe    gt  final complete cases final    5 6                  gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0    2    2    2    2 4 ENSG00000207604    0   NA   NA    1    2 6 ENSG00000221312    0    1    2    3    2   Your solution can t work  If you insist on using is na  then you have to do something like    gt  final rowSums is na final    5 6       0                 gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0    2    2    2    2 4 ENSG00000207604    0   NA   NA    1    2 6 ENSG00000221312    0    1    2    3    2   but using complete cases is quite a lot more clear  and faster

User · Answer

If you want control over how many NAs are valid for each row  try this function  For many survey data sets  too many blank question responses can ruin the results  So they are deleted after a certain threshold  This function will allow you to choose how many NAs the row can have before it s deleted   delete na  lt - function DF  n 0      DF rowSums is na DF    lt   n       By default  it will eliminate all NAs   delete na final               gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0    2    2    2    2 6 ENSG00000221312    0    1    2    3    2   Or specify the maximum number of NAs allowed   delete na final  2               gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0    2    2    2    2 4 ENSG00000207604    0   NA   NA    1    2 6 ENSG00000221312    0    1    2    3    2

User · Answer

delete dirt  lt - function DF  dart c  NA        dirty rows  lt - apply DF  1  function r   any r  in  dart     DF  lt - DF dirty rows       mydata  lt - delete dirt mydata    Above function deletes all the rows from the data frame that has  NA  in any column and returns the resultant data  If you want to check for multiple values like NA and   change dart c  NA   in function param to dart c  NA

User · Answer

Try na omit your data frame   As for the second question  try posting it as another question  for clarity

User · Answer

Using dplyr package we can filter NA as follows   dplyr  filter df    is na columnname

User · Answer

This will return the rows that have at least ONE non-NA value   final rowSums is na final   lt length final      This will return the rows that have at least TWO non-NA value   final rowSums is na final   lt  length final -1

User · Answer

Assuming dat as your dataframe  the expected output can be achieved using  1 rowSums   gt  dat  rowSums  is na dat                   gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0   2    2    2    2 6 ENSG00000221312    0   1    2    3    2   2 lapply   gt  dat  Reduce     lapply dat is na                  gene hsap mmul mmus rnor cfam 2 ENSG00000199674    0   2    2    2    2 6 ENSG00000221312    0   1    2    3    2

User · Answer

My guess is that this could be more elegantly solved in this way     m  lt - matrix 1 25  ncol   5    m c 1  6  13  25    lt - NA   df  lt - data frame m    library dplyr     df   gt     filter all any vars is na          gt    X1 X2 X3 X4 X5     gt  1 NA NA 11 16 21     gt  2  3  8 NA 18 23     gt  3  5 10 15 20 NA

User · Answer

We can also use the subset function for this   finalData lt -subset data   is na data  mmul      is na data  rnor        This will give only those rows that do not have NA in both mmul and rnor

User · Answer

If performance is a priority  use data table and na omit   with optional param cols   na omit data table is the fastest on my benchmark  see below   whether for all columns or for select columns  OP question part 2   If you don t want to use data table  use complete cases    On a vanilla data frame  complete cases is faster than na omit   or dplyr  drop na     Notice that na omit data frame does not support cols   Benchmark result Here is a comparison of base  blue   dplyr  pink   and data table  yellow  methods for dropping either all or select missing observations  on notional dataset of 1 million observations of 20 numeric variables with independent 5  likelihood of being missing  and a subset of 4 variables for part 2  Your results may vary based on length  width  and sparsity of your particular dataset  Note log scale on y axis   Benchmark script  -------  Adjust these assumptions for your own use case  ------------ row size    lt - 1e6L  col size    lt - 20      not including ID column p missing   lt - 0 05     likelihood of missing observation  except ID col  col subset  lt - 18 21    second part of question  filter on select columns   -------  System info for benchmark  ---------------------------------- R version   R version 3 4 3  2017-11-30   platform   x86 64-w64-mingw32 library data table   packageVersion  data table     1 10 4 3 library dplyr        packageVersion  dplyr          0 7 4 library tidyr        packageVersion  tidyr          0 8 0 library microbenchmark    -------  Example dataset using above assumptions  -------------------- fakeData  lt - function m  n  p     set seed 123    m  lt -  matrix runif m n   nrow m  ncol n    m m lt p   lt - NA   return m    df  lt - cbind  data frame id   paste0  ID  seq row size                             stringsAsFactors   FALSE                data frame fakeData row size  col size  p missing                   dt  lt - data table df   par las 3  mfcol c 1 2   mar c 22 4 1 1  0 1  boxplot    microbenchmark      df complete cases df          na omit df       df   gt   drop na      dt complete cases dt          na omit dt       xlab        main    Performance  Drop any NA observation     col c rep  lightblue  2   salmon  rep  beige  2     boxplot    microbenchmark      df complete cases df  col subset            na omit df     col subset not supported in na omit data frame     df   gt   drop na col subset       dt complete cases dt  col subset with FALSE           na omit dt  cols col subset    see  na omit data table      xlab        main    Performance  Drop NA obs  in select cols     col c  lightblue   salmon  rep  beige  2

User · Answer

tidyr has a new function drop na    library tidyr  df   gt   drop na                  gene hsap mmul mmus rnor cfam   2 ENSG00000199674    0    2    2    2    2   6 ENSG00000221312    0    1    2    3    2 df   gt   drop na rnor  cfam                 gene hsap mmul mmus rnor cfam   2 ENSG00000199674    0    2    2    2    2   4 ENSG00000207604    0   NA   NA    1    2   6 ENSG00000221312    0    1    2    3    2

User · Answer

I prefer following way to check whether rows contain any NAs   row has na  lt - apply final  1  function x  any is na x       This returns logical vector with values denoting whether there is any NA in a row  You can use it to see how many rows you ll have to drop   sum row has na    and eventually drop them  final filtered  lt - final  row has na     For filtering rows with certain part of NAs it becomes a little trickier  for example  you can feed  final  5 6   to  apply    Generally  Joris Meys  solution seems to be more elegant

User · Answer

One approach that s both general and yields fairly-readable code is to use the filter   function and the across   helper functions from the  dplyr  package  library dplyr   vars to check  lt - c  quot rnor quot    quot cfam quot      Filter a specific list of columns to keep only non-missing entries  df   gt      filter across one of vars to check                      is na  x       Filter all the columns to exclude NA df   gt      filter across everything                       is na         Filter only numeric columns df   gt     filter across where is numeric                      is na       Similarly  there are also the variant functions in the dplyr package  filter all  filter at  filter if  which accomplish the same thing  library dplyr   vars to check  lt - c  quot rnor quot    quot cfam quot      Filter a specific list of columns to keep only non-missing entries df   gt      filter at  vars   vars one of vars to check                   is na        Filter all the columns to exclude NA df   gt      filter all    is na        Filter only numeric columns df   gt     filter if is numeric                 is na

[r] Remove rows with all or some NAs (missing values) in data.frame

Examples related to r

Examples related to dataframe

Examples related to filter

Examples related to missing-data

Examples related to r-faq