How to iterate over rows in a DataFrame in Pandas

Question

I have a DataFrame from Pandas   import pandas as pd inp      c1  10   c2  100     c1  11  c2  110     c1  12  c2  120   df   pd DataFrame inp  print df   Output      c1   c2 0  10  100 1  11  110 2  12  120   Now I want to iterate over the rows of this frame  For every row I want to be able to access its elements  values in cells  by the name of the columns  For example   for row in df rows     print row  c1    row  c2     Is it possible to do that in Pandas   I found this similar question  But it does not give me the answer I need  For example  it is suggested there to use   for date  row in df T iteritems      or  for row in df iterrows      But I do not understand what the row object is and how I can work with it

User · Answer

You should use df iterrows    Though iterating row-by-row is not especially efficient since Series objects have to be created

User · Answer

You can also do NumPy indexing for even greater speed ups  It s not really iterating but works much better than iteration for certain applications   subset   row  c1   0 5  all   row  c1        You may also want to cast it to an array  These indexes selections are supposed to act like NumPy arrays already  but I ran into issues and needed to cast  np asarray all  imgs      cv2 resize imgs      224 224      Resize every image in an hdf5 file

User · Answer

cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes   I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization  it s even faster than Pandas dataframe vectorization   and that includes the time to turn it back into a dataframe series    If you add the following functions to cs95 s benchmark code  this becomes pretty evident   def np vectorization df       np arr   df to numpy       return pd Series np arr   0    np arr   1   index df index   def just np vectorization df       np arr   df to numpy       return np arr   0    np arr   1

User · Answer

In short  Use vectorization if possible If operation can t be vectorized - use list comprehensions If you need a single object representing entire row - use itertuples If the above is too slow - try swifter apply If it s still too slow - try Cython routine  Benchmark

User · Answer

The easiest way  use the apply function def print row row      print row  c1    row  c2    df apply lambda row  print row row   axis 1

User · Answer

You can write your own iterator that implements namedtuple  from collections import namedtuple  def myiter d  cols None       if cols is None          v   d values tolist           cols   d columns values tolist       else          j    d columns get loc c  for c in cols          v   d values    j  tolist        n   namedtuple  MyTuple   cols       for line in iter v           yield n  line    This is directly comparable to pd DataFrame itertuples   I m aiming at performing the same task with more efficiency     For the given dataframe with my function   list myiter df     MyTuple c1 10  c2 100   MyTuple c1 11  c2 110   MyTuple c1 12  c2 120     Or with pd DataFrame itertuples   list df itertuples index False     Pandas c1 10  c2 100   Pandas c1 11  c2 110   Pandas c1 12  c2 120       A comprehensive test We test making all columns available and subsetting the columns     def iterfullA d       return list myiter d    def iterfullB d       return list d itertuples index False    def itersubA d       return list myiter d    col3    col4    col5    col6    col7      def itersubB d       return list d   col3    col4    col5    col6    col7    itertuples index False    res   pd DataFrame      index  10  30  100  300  1000  3000  10000  30000       columns  iterfullA iterfullB itersubA itersubB  split        dtype float    for i in res index      d   pd DataFrame np random randint 10  size  i  10    add prefix  col       for j in res columns          stmt       d   format j          setp    from   main   import d      format j          res at i  j    timeit stmt  setp  number 100   res groupby res columns str 4 -1   axis 1  plot loglog True

User · Answer

for ind in df index       print df  c1   ind   df  c2   ind

User · Answer

I was looking for How to iterate on rows and columns and ended here so   for i  row in df iterrows        for j  column in row iteritems            print column

User · Answer

While iterrows   is a good option  sometimes itertuples   can be much faster   df   pd DataFrame   a   randn 1000    b   randn 1000   N   randint 100  1000   1000     x    x      timeit  row a   2 for idx  row in df iterrows        gt  10 loops  best of 3  50 3 ms per loop   timeit  row 1    2 for row in df itertuples        gt  1000 loops  best of 3  541   s per loop

User · Answer

Along with the great answers in this post I am going to propose Divide and Conquer approach  I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me  It has two steps of splitting and merging the pandas dataframe  PROS of Divide and Conquer   You don t need to use vectorization or any other methods to cast the type of your dataframe into another type You don t need to Cythonize your code which normally takes extra time from you Both iterrows   and itertuples   in my case were having the same performance over entire dataframe Depends on your choice of slicing index  you will be able to exponentially quicken the iteration  The higher index  the quicker your iteration process   CONS of Divide and Conquer   You shouldn t have dependency over the iteration process to the same dataframe and different slice  Meaning if you want to read or write from other slice  it maybe difficult to do that                          Divide and Conquer Approach                      Step 1  Splitting Slicing In this step  we are going to divide the iteration over the entire dataframe  Think that you are going to read a csv file into pandas df then iterate over it  In may case I have 5 000 000 records and I am going to split it into 100 000 records  NOTE  I need to reiterate as other runtime analysis explained in the other solutions in this page   quot number of records quot  has exponential proportion of  quot runtime quot  on search on the df  Based on the benchmark on my data here are the results  Number of records   Iteration per second                                          100 000             500 it s 500 000             200 it s 1 000 000           50 it s 5 000 000           20 it s  Step 2  Merging This is going to be an easy step  just merge all the written csv files into one dataframe and write it into a bigger csv file  Here is the sample code    Step 1  Splitting Slicing  import pandas as pd df all   pd read csv  C  KtV csv   df index   100000 df len   len df  for i in range df len    df index   1       lower bound   i   df index      higher bound   min lower bound   df index  df len        splitting slicing df  make sure to copy   otherwise it will be a view     df   df all lower bound higher bound  copy               write your iteration over the sliced df here     using iterrows   or intertuples   or                   writing into csv files     df to csv  C  KtV prep   str i    csv        Step 2  Merging  filename  C  KtV prep   df    pd read csv f  for f in  filename str i    csv  for i in range ktv len    ktv index   1    df prep all   pd concat df  df prep all to csv  C  KtV prep all csv    Reference  Efficient way of iteration over datafreame Concatenate csv files into one Pandas Dataframe

User · Answer

There is a way to iterate throw rows while getting a DataFrame in return  and not a Series  I don t see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame   for i in range len df        row   df iloc  i     Note the usage of double brackets  This returns a DataFrame with a single row

User · Answer

This example uses iloc to isolate each digit in the data frame    import pandas as pd   a    1  2  3  4   b    5  6  7  8    mjr   pd DataFrame   a  a   b  b     size   mjr shape   for i in range size 0         for j in range size 1             print mjr iloc i  j

User · Answer

Some libraries  e g  a Java interop library that I use  require values to be passed in a row at a time  for example  if streaming data  To replicate the streaming nature  I  stream  my dataframe values one by one  I wrote the below  which comes in handy from time to time   class DataFrameReader    def   init   self  df       self  df   df     self  row   None     self  columns   df columns tolist       self reset       self row index   0    def   getattr   self  key       return self   getitem   key     def read self  - gt  bool      self  row   next self  iterator  None      self row index    1     return self  row is not None    def columns self       return self  columns    def reset self  - gt  None      self  iterator   self  df itertuples      def get index self       return self  row 0     def index self       return self  row 0     def to dict self  columns  List str    None       return self row columns columns     def tolist self  cols  - gt  List object       return  self   getitem   c  for c in cols     def row self  columns  List str    None  - gt  Dict str  object       cols   set self  columns if columns is None else columns      return  c   self   getitem   c  for c in self  columns if c in cols     def   getitem   self  key  - gt  object        the df index of the row is at index 0     try          if type key  is list              ix    self  columns index key    1 for k in key          else              ix   self  columns index key    1         return self  row ix      except BaseException as e          return None    def   next   self  - gt   DataFrameReader       if self read            return self     else          raise StopIteration    def   iter   self  - gt   DataFrameReader       return self   Which can be used   for row in DataFrameReader df     print row my column name    print row to dict      print row  my column name      print row tolist      And preserves the values  name mapping for the rows being iterated  Obviously  is a lot slower than using apply and Cython as indicated above  but is necessary in some circumstances

User · Answer

How to iterate over rows in a DataFrame in Pandas   Answer  DON T   Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option  You should not use any function with  quot iter quot  in its name for more than a few thousand rows or you will have to get used to a lot of waiting  Do you want to print a DataFrame  Use DataFrame to string    Do you want to compute something  In that case  search for methods in this order  list modified from here    Vectorization Cython routines List Comprehensions  vanilla for loop  DataFrame apply    i    Reductions that can be performed in Cython  ii  Iteration in Python space DataFrame itertuples   and iteritems   DataFrame iterrows    iterrows and itertuples  both receiving many votes in answers to this question  should be used in very rare circumstances  such as generating row objects nametuples for sequential processing  which is really the only thing these functions are useful for  Appeal to Authority The documentation page on iteration has a huge red warning box that says   Iterating through pandas objects is generally slow  In many cases  iterating manually over the rows is not needed           It s actually a little more complicated than  quot don t quot   df iterrows   is the correct answer to this question  but  quot vectorize your ops quot  is the better one  I will concede that there are circumstances where iteration cannot be avoided  for example  some operations where the result depends on the value computed for the previous row   However  it takes some familiarity with the library to know when  If you re not sure whether you need an iterative solution  you probably don t  PS  To know more about my rationale for writing this answer  skip to the very bottom   Faster than Looping  Vectorization  Cython A good number of basic operations and computations are  quot vectorised quot  by pandas  either through NumPy  or through Cythonized functions   This includes arithmetic  comparisons   most  reductions  reshaping  such as pivoting   joins  and groupby operations  Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem  If none exists  feel free to write your own using custom Cython extensions   Next Best Thing  List Comprehensions  List comprehensions should be your next port of call if 1  there is no vectorized solution available  2  performance is important  but not important enough to go through the hassle of cythonizing your code  and 3  you re trying to perform elementwise transformation on your code  There is a good amount of evidence to suggest that list comprehensions are sufficiently fast  and even sometimes faster  for many common Pandas tasks  The formula is simple    Iterating over one column -  f  is some function that processes your data result    f x  for x in df  col      Iterating over two columns  use  zip  result    f x  y  for x  y in zip df  col1    df  col2       Iterating over multiple columns - same data type result    f row 0        row n   for row in df   col1        coln    to numpy      Iterating over multiple columns - differing data type result    f row 0        row n   for row in zip df  col1         df  coln      If you can encapsulate your business logic into a function  you can use a list comprehension that calls it  You can make arbitrarily complex things work through the simplicity and speed of raw Python code  Caveats List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don t have NaNs  but this cannot always be guaranteed   The first one is more obvious  but when dealing with NaNs  prefer in-built pandas methods if they exist  because they have much better corner-case handling logic   or ensure your business logic includes appropriate NaN handling logic  When dealing with mixed data types you should iterate over zip df  A    df  B         instead of df   A    B    to numpy   as the latter implicitly upcasts data to the most common type  As an example if A is numeric and B is string  to numpy   will cast the entire array to string  which may not be what you want  Fortunately zipping your columns together is the most straightforward workaround to this    Your mileage may vary for the reasons outlined in the Caveats section above   An Obvious Example Let s demonstrate the difference with a simple example of adding two pandas columns A   B  This is a vectorizable operaton  so it will be easy to contrast the performance of the methods discussed above   Benchmarking code  for your reference  The line at the bottom measures a function written in numpandas  a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance  Writing numpandas code should be avoided unless you know what you re doing  Stick to the API where you can  i e   prefer vec over vec numpy   I should mention  however  that it isn t always this cut and dry  Sometimes the answer to  quot what is the best method for an operation quot  is  quot it depends on your data quot   My advice is to test out different approaches on your data before settling on one   Further Reading  10 Minutes to pandas  and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized  cythonized functions   Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations  Are for-loops in pandas really bad  When should I care  - a detailed writeup by me on list comprehensions and their suitability for various operations  mainly ones involving non-numeric data   When should I  not  want to use pandas apply   in my code  - apply is slow  but not as slow as the iter  family  There are  however  situations where one can  or should  consider apply as a serious alternative  especially in some GroupBy operations       Pandas string methods are  quot vectorized quot  in the sense that they are specified on the series but operate on each element  The underlying mechanisms are still iterative  because string operations are inherently hard to vectorize   Why I Wrote this Answer A common trend I notice from new users is to ask questions of the form  quot How can I iterate over my df to do X  quot   Showing code that calls iterrows   while doing something inside a for loop  Here is why  A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something  Not knowing how to iterate over a DataFrame  the first thing they do is Google it and end up here  at this question  They then see the accepted answer telling them how to  and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do  The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem  and that better  faster and more idiomatic solutions could exist  and that it is worth investing time in exploring them  I m not trying to start a war of iteration vs  vectorization  but I want new users to be informed when developing solutions to their problems with this library

User · Answer

There are so many ways to iterate over the rows in Pandas dataframe  One very simple and intuitive way is   df   pd DataFrame   A   1  2  3    B   4  5  6    C   7  8  9    print df  for i in range df shape 0          For printing the second column     print df iloc i  1          For printing more than one columns     print df iloc i   0  2

User · Answer

Sometimes a useful pattern is     Borrowing  KutalmisB df example df   pd DataFrame   col1    1  2    col2    0 1  0 2    index   a    b      The to dict call results in a list of dicts   where each row dict is a dictionary with k v pairs of columns value for that row for row dict in df to dict orient  records        print row dict    Which results in     col1  1 0   col2  0 1    col1  2 0   col2  0 2

User · Answer

As many answers here correctly and clearly point out  you should not generally attempt to loop in pandas  but rather should write vectorized code   But the question remains if you should EVER write loops in pandas  and if so the best way to loop in those situations  I believe there is at least one general situation where loops are appropriate  when you need to calculate some function that depends on values in other rows in a somewhat complex manner   In this case  the looping code is often simpler  more readable  and less error prone than vectorized code   The looping code might even be faster  too  I will attempt to show this with an example   Suppose you want to take a cumulative sum of a column  but reset it whenever some other column equals zero  import pandas as pd import numpy as np  df   pd DataFrame     x   1 2 3 4 5 6    y   1 1 1 0 1 1            x  y  desired result  0  1  1               1  1  2  1               3  2  3  1               6  3  4  0               4  4  5  1               9  5  6  1              15  This is a good example where you could certainly write one line of pandas to achieve this  although it s not especially readable  especially if you aren t fairly experienced with pandas already  df groupby   df y  0  cumsum      x   cumsum    That s going to be fast enough for most situations  although you could also write faster code by avoiding the groupby  but it will likely be even less readable  Alternatively  what if we write this as a loop   You could do something like the following with numpy  import numba as nb        nb jit nopython True     optional def custom sum x y       x sum   x copy       for i in range 1 len df            if y i   gt  0  x sum i    x sum i-1    x i      return x sum  df  desired result     custom sum  df x to numpy    df y to numpy      Admittedly  there s a bit of overhead there required to convert DataFrame columns to numpy arrays  but the core piece of code is just one line of code that you could read even if you didn t know anything about pandas or numpy  if y i   gt  0  x sum i    x sum i-1    x i   And this code is actually faster than the vectorized code   In some quick tests with 100 000 rows  the above is about 10x faster than the groupby approach   Note that one key to the speed there is numba  which is options   Without the  quot  nb jit quot  line  the looping code is actually about 10x slower than the groupby approach  Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead   However  there are more complex versions of this problem for which the readability or speed of the numpy numba loop approach likely makes sense

User · Answer

To loop all rows in a dataframe you can use   for x in range len date example index        print date example  Date   iloc x

User · Answer

You can use the df iloc function as follows   for i in range 0  len df        print df iloc i   c1    df iloc i   c2

User · Answer

For both viewing and modifying values  I would use iterrows    In a for loop and by using tuple unpacking  see the example  i  row   I use the row for only viewing the value and use i with the loc method when I want to modify values  As stated in previous answers  here you should not modify something you are iterating over   for i  row in df iterrows        df column A   df loc i   A       if df column A     Old Value           df column A    New value      Here the row in the loop is a copy of that row  and not a view of it  Therefore  you should NOT write something like row  A      New Value   it will not modify the DataFrame  However  you can use i and loc and specify the DataFrame to do the work

User · Answer

How to iterate efficiently  If you really have to iterate a Pandas dataframe  you will probably want to avoid using iterrows    There are different methods and the usual iterrows   is far from being the best  itertuples   can be 100 times faster   In short    As a general rule  use df itertuples name None   In particular  when you have a fixed number columns and less than 255 columns  See point  3  Otherwise  use df itertuples   except if your columns have special characters such as spaces or  -   See point  2  It is possible to use itertuples   even if your dataframe has strange columns by using the last example  See point  4  Only use iterrows   if you cannot the previous solutions  See point  1    Different methods to iterate over rows in a Pandas dataframe   Generate a random dataframe with a million rows and 4 columns       df   pd DataFrame np random randint 0  100  size  1000000  4    columns list  ABCD        print df    1  The usual iterrows   is convenient  but damn slow   start time   time clock   result   0 for    row in df iterrows        result    max row  B    row  C     total elapsed time   round time clock   - start time  2  print  1  Iterrows done in    seconds  result       format total elapsed time  result     2  The default itertuples   is already much faster  but it doesn t work with column names such as My Col-Name is very Strange  you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name     start time   time clock   result   0 for row in df itertuples index False       result    max row B  row C   total elapsed time   round time clock   - start time  2  print  2  Named Itertuples done in    seconds  result       format total elapsed time  result     3  The default itertuples   using name None is even faster but not really convenient as you have to define a variable per column   start time   time clock   result   0 for    col1  col2  col3  col4  in df itertuples name None       result    max col2  col3   total elapsed time   round time clock   - start time  2  print  3  Itertuples done in    seconds  result       format total elapsed time  result     4  Finally  the named itertuples   is slower than the previous point  but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange   start time   time clock   result   0 for row in df itertuples index False       result    max row df columns get loc  B     row df columns get loc  C      total elapsed time   round time clock   - start time  2  print  4  Polyvalent Itertuples working even with special characters in the column name done in    seconds  result       format total elapsed time  result     Output            A   B   C   D 0       41  63  42  23 1       54   9  24  65 2       15  34  10   9 3       39  94  82  97 4        4  88  79  54                        999995  48  27   4  25 999996  16  51  34  28 999997   1  39  61  14 999998  66  51  27  70 999999  51  53  47  99   1000000 rows x 4 columns   1  Iterrows done in 104 96 seconds  result   66151519 2  Named Itertuples done in 1 26 seconds  result   66151519 3  Itertuples done in 0 94 seconds  result   66151519 4  Polyvalent Itertuples working even with special characters in the column name done in 2 94 seconds  result   66151519   This article is a very interesting comparison between iterrows and itertuples

User · Answer

First consider if you really need to iterate over rows in a DataFrame  See this answer for alternatives   If you still need to iterate over rows  you can use methods below  Note some  important caveats which are not mentioned in any of the other answers    DataFrame iterrows    for index  row in df iterrows        print row  c1    row  c2     DataFrame itertuples    for row in df itertuples index True  name  Pandas        print row c1  row c2     itertuples   is supposed to be faster than iterrows    But be aware  according to the docs  pandas 0 24 2 at the moment     iterrows  dtype might not match from row to row     Because iterrows returns a Series for each row  it does not preserve dtypes across the rows  dtypes are preserved across columns for DataFrames   To preserve dtypes while iterating over the rows  it is better to use itertuples   which returns namedtuples of the values and which is generally much faster than iterrows    iterrows  Do not modify rows     You should never modify something you are iterating over  This is not guaranteed to work in all cases  Depending on the data types  the iterator returns a copy and not a view  and writing to it will have no effect    Use DataFrame apply   instead   new df   df apply lambda x  x   2   itertuples       The column names will be renamed to positional names if they are invalid Python identifiers  repeated  or start with an underscore  With a large number of columns   255   regular tuples are returned     See pandas docs on iteration for more details

User · Answer

DataFrame iterrows is a generator which yields both the index and row  as a Series   import pandas as pd  df   pd DataFrame   c1    10  11  12    c2    100  110  120     for index  row in df iterrows        print row  c1    row  c2      10 100 11 110 12 120

User · Answer

You can also use df apply   to iterate over rows and access multiple columns for a function   docs  DataFrame apply    def valuation formula x  y       return x   y   0 5  df  price     df apply lambda row  valuation formula row  x    row  y     axis 1

User · Answer

To loop all rows in a dataframe and use values of each row conveniently  namedtuples can be converted to ndarrays  For example   df   pd DataFrame   col1    1  2    col2    0 1  0 2    index   a    b      Iterating over the rows   for row in df itertuples index False  name  Pandas        print np asarray row    results in     1    0 1    2    0 2    Please note that if index True  the index is added as the first element of the tuple  which may be undesirable for some applications

[python] How to iterate over rows in a DataFrame in Pandas

Examples related to python

Examples related to pandas

Examples related to dataframe