How to delete rows from a pandas DataFrame based on a conditional expression

Question

I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2   I expect to be able to do this  per this answer    df  len df  column name     lt  2     but I just get the error   KeyError  u no item named False    What am I doing wrong    Note  I know I can use df dropna   to get rid of rows that contain any NaN  but I didn t see how to remove rows based on a conditional expression

User · Answer

To directly answer this question s original title  quot How to delete rows from a pandas DataFrame based on a conditional expression quot   which I understand is not necessarily the OP s problem but could help other users coming across this question  one way to do this is to use the drop method  df   df drop some labels  df   df drop df  lt some boolean condition gt   index   Example To remove all rows where column  score  is  lt  50  df   df drop df df score  lt  50  index   In place version  as pointed out in comments  df drop df df score  lt  50  index  inplace True   Multiple conditions  see Boolean Indexing   The operators are    for or   amp  for and  and   for not  These must be grouped by using parentheses   To remove all rows where column  score  is  lt  50 and  gt  20 df   df drop df  df score  lt  50   amp   df score  gt  20   index

User · Answer

When you do len df  column name    you are just getting one number  namely the number of rows in the DataFrame  i e   the length of the column itself    If you want to apply len to each element in the column  use df  column name   map len    So try  df df  column name   map len   lt  2

User · Answer

In pandas you can do str len with your boundary and using the Boolean result to filter it     df df  column name   str len   lt 2

User · Answer

I will expand on  User s generic solution to provide a drop free alternative  This is for folks directed here based on the question s title  not OP  s problem     Say you want to delete all rows with negative values  One liner solution is -  df   df  df  gt  0  all axis 1     Step by step Explanation --  Let s generate a 5x5 random normal distribution data frame  np random seed 0  df   pd DataFrame np random randn 5 5   columns list  ABCDE          A         B         C         D         E 0  1 764052  0 400157  0 978738  2 240893  1 867558 1 -0 977278  0 950088 -0 151357 -0 103219  0 410599 2  0 144044  1 454274  0 761038  0 121675  0 443863 3  0 333674  1 494079 -0 205158  0 313068 -0 854096 4 -2 552990  0 653619  0 864436 -0 742165  2 269755   Let the condition be deleting negatives  A boolean df satisfying the condition -   df  gt  0       A     B      C      D      E 0   True  True   True   True   True 1  False  True  False  False   True 2   True  True   True   True   True 3   True  True  False   True  False 4  False  True   True  False   True   A boolean series for all rows satisfying the condition  Note if any element in the row fails the condition the row is marked false   df  gt  0  all axis 1  0     True 1    False 2     True 3    False 4    False dtype  bool   Finally filter out rows from data frame based on the condition    df  df  gt  0  all axis 1         A         B         C         D         E 0  1 764052  0 400157  0 978738  2 240893  1 867558 2  0 144044  1 454274  0 761038  0 121675  0 443863   You can assign it back to df to actually delete vs filter ing done above df   df  df  gt  0  all axis 1    This can easily be extended to filter out rows containing NaN s  non numeric entries  - df   df   df isnull    all axis 1     This can also be simplified for cases like  Delete all rows where column E is negative    df   df  df E gt 0     I would like to end with some profiling stats on why  User s drop solution is slower than raw column based filtration -     timeit df new   df  df E gt 0   345   s    10 5   s per loop  mean    std  dev  of 7 runs  1000 loops each   timeit dft drop dft dft E  lt  0  index  inplace True  890   s    94 9   s per loop  mean    std  dev  of 7 runs  1000 loops each    A column is basically a Series i e a NumPy array  it can be indexed without any cost  For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas

User · Answer

You can assign the DataFrame to a filtered version of itself   df   df df score  gt  50    This is faster than drop     timeit test   pd DataFrame   x   np random randn int 1e6     test   test test x  lt  0    54 5 ms    2 02 ms per loop  mean    std  dev  of 7 runs  10 loops each     timeit test   pd DataFrame   x   np random randn int 1e6     test drop test test x  gt  0  index  inplace True    201 ms    17 9 ms per loop  mean    std  dev  of 7 runs  10 loops each     timeit test   pd DataFrame   x   np random randn int 1e6     test   test drop test test x  gt  0  index    194 ms    7 03 ms per loop  mean    std  dev  of 7 runs  10 loops each

User · Answer

If you want to drop rows of data frame on the basis of some complicated condition on the column value then writing that in the way shown above can be complicated  I have the following simpler solution which always works  Let us assume that you want to drop the column with  header  so get that column in a list first   text data   df  name   tolist     now apply some function on the every element of the list and put that in a panda series   text length   pd Series  func t  for t in text data     in my case I was just trying to get the number of tokens   text length   pd Series  len t split    for t in text data     now add one extra column with the above series in the data frame   df   df assign text length   text length  values    now we can apply condition on the new column such as   df   df df text length   gt   10    def pass filter df  label  length  pass type        text data   df label  tolist        text length   pd Series  len t split    for t in text data        df   df assign text length   text length  values       if pass type     high           df   df df text length   gt   length       if pass type     low           df   df df text length   lt   length       df   df drop columns   text length         return df

[python] How to delete rows from a pandas DataFrame based on a conditional expression

Examples related to python

Examples related to pandas