Creating an empty Pandas DataFrame then filling it

Question

I m starting from the pandas DataFrame docs here  http   pandas pydata org pandas-docs stable dsintro html  I d like to iteratively fill the DataFrame with values in a time series kind of calculation  So basically  I d like to initialize the DataFrame with columns A  B and timestamp rows  all 0 or all NaN   I d then add initial values and go over this data calculating the new row from the row before  say row A  t    row A  t-1  1 or so   I m currently using the code as below  but I feel it s kind of ugly and there must be a  way to do this with a DataFrame directly  or just a better way in general  Note  I m using Python 2 7   import datetime as dt import pandas as pd import scipy as s  if   name         main         base   dt datetime today   date       dates     base - dt timedelta days x  for x in range 0 10        dates sort        valdict          symbols     A   B    C       for symb in symbols          valdict symb    pd Series  s zeros  len dates    dates        for thedate in dates          if thedate  gt  dates 0               for symb in valdict                  valdict symb  thedate    1 valdict symb  thedate - dt timedelta days 1        print valdict

User · Answer

Initialize empty frame with column names  import pandas as pd  col names      A    B    C   my df    pd DataFrame columns   col names  my df   Add a new record to a frame  my df loc len my df      2  4  5    You also might want to pass a dictionary   my dic     A  2   B  4   C  5  my df loc len my df     my dic    Append another frame to your existing frame  col names      A    B    C   my df2    pd DataFrame columns   col names  my df   my df append my df2      Performance considerations  If you are adding rows inside a loop consider performance issues  For around the first 1000 records  my df loc  performance is better  but it gradually becomes slower by increasing the number of records in the loop   If you plan to do thins inside a big loop  say 10M  records or so   you are better off using a mixture of these two  fill a dataframe with iloc until the size gets around 1000  then append it to the original dataframe  and empty the temp dataframe  This would boost your performance by around 10 times

User · Answer

Assume a dataframe with 19 rows  index range 0 19  index  columns   A   test   pd DataFrame index index  columns columns    Keeping Column A as a constant  test  A   10   Keeping column b as a variable given by a loop  for x in range 0 19       test loc  x    b     pd Series  x   index    x     You can replace the first x in pd Series  x   index    x   with any value

User · Answer

Here s a couple of suggestions   Use date range for the index   import datetime import pandas as pd import numpy as np  todays date   datetime datetime now   date   index   pd date range todays date-datetime timedelta 10   periods 10  freq  D    columns     A   B    C     Note  we could create an empty DataFrame  with NaNs  simply by writing   df    pd DataFrame index index  columns columns  df    df  fillna 0    with 0s rather than NaNs   To do these type of calculations for the data  use a numpy array   data   np array  np arange 10   3  T   Hence we can create the DataFrame   In  10   df   pd DataFrame data  index index  columns columns   In  11   df Out 11                A  B  C 2012-11-29  0  0  0 2012-11-30  1  1  1 2012-12-01  2  2  2 2012-12-02  3  3  3 2012-12-03  4  4  4 2012-12-04  5  5  5 2012-12-05  6  6  6 2012-12-06  7  7  7 2012-12-07  8  8  8 2012-12-08  9  9  9

User · Answer

If you simply want to create an empty data frame and fill it with some incoming data frames later  try this   newDF   pd DataFrame    creates a new dataframe that s empty newDF   newDF append oldDF  ignore index   True    ignoring index is optional   try printing some data from newDF print newDF head    again optional    In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF    If I have to keep appending new data into this newDF from more than    one oldDFs  I just use a for loop to iterate over    pandas DataFrame append

User · Answer

NEVER grow a DataFrame   TLDR   just read the bold text   Most answers here will tell you how to create an empty DataFrame and fill it out  but no one will tell you that it is a bad thing to do  Here is my advice  Accumulate data in a list  not a DataFrame  Use a list to collect your data  then initialise a DataFrame when you are ready  Either a list-of-lists or list-of-dicts format will work  pd DataFrame accepts both  data      for a  b  c in some function that yields data        data append  a  b  c    df   pd DataFrame data  columns   A    B    C     Pros of this approach   It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame  or one of NaNs  and append to it over and over again   Lists also take up less memory and are a much lighter data structure to work with  append  and remove  if needed    dtypes are automatically inferred  rather than assigning object to all of them    A RangeIndex is automatically created for your data  instead of you having to take care to assign the correct index to the row you are appending at each iteration    If you aren t convinced yet  this is also mentioned in the documentation   Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate  A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once   But what if my function returns smaller DataFrames that I need to combine into one large DataFrame  That s fine  you can still do this in linear time by growing or creating a python list of smaller DataFrames  then calling pd concat  small dfs      for small df in some function that yields dataframes        small dfs append small df   large df   pd concat small dfs  ignore index True   or  more concisely  large df   pd concat      list some function that yields dataframes     ignore index True     These options are horrible append or concat inside a loop Here is the biggest mistake I ve seen from beginners  df   pd DataFrame columns   A    B    C    for a  b  c in some function that yields data        df   df append   A   i   B   b   C   c   ignore index True    yuck       or similarly        df   pd concat  df  pd Series   A   i   B   b   C   c     ignore index True   Memory is re-allocated for every append or concat operation you have  Couple this with a loop and you have a quadratic complexity operation   The other mistake associated with df append is that users tend to forget append is not an in-place function  so the result must be assigned back  You also have to worry about the dtypes  df   pd DataFrame columns   A    B    C    df   df append   A   1   B   12 3   C    xyz    ignore index True   df dtypes A     object     yuck  B    float64 C     object dtype  object  Dealing with object columns is never a good thing  because pandas cannot vectorize operations on those columns  You will need to do this to fix it  df infer objects   dtypes A      int64 B    float64 C     object dtype  object  loc inside a loop I have also seen loc used to append to a DataFrame that was created empty  df   pd DataFrame columns   A    B    C    for a  b  c in some function that yields data        df loc len df      a  b  c   As before  you have not pre-allocated the amount of memory you need each time  so the memory is re-grown each time you create a new row  It s just as bad as append  and even more ugly  Empty DataFrame of NaNs And then  there s creating a DataFrame of NaNs  and all the caveats associated therewith  df   pd DataFrame columns   A    B    C    index range 5   df      A    B    C 0  NaN  NaN  NaN 1  NaN  NaN  NaN 2  NaN  NaN  NaN 3  NaN  NaN  NaN 4  NaN  NaN  NaN  It creates a DataFrame of object columns  like the others  df dtypes A    object    you DON T want this B    object C    object dtype  object  Appending still has all the issues as the methods above  for i   a  b  c  in enumerate some function that yields data         df iloc i     a  b  c     The Proof is in the Pudding Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility   Benchmarking code for reference

[python] Creating an empty Pandas DataFrame, then filling it?

Examples related to python

Examples related to dataframe

Examples related to pandas