[python] Pandas DataFrame concat vs append

I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps. See details below:

data

[<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35228 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-03-28 18:59:20.357000+02:00
Data columns:
Price       4040  non-null values
Volume      4040  non-null values
BidQty      35228  non-null values
BidPrice    35228  non-null values
AskPrice    35228  non-null values
AskQty      35228  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 33088 entries, 2013-04-01 00:03:17.047000+02:00 to 2013-04-01 18:59:58.175000+02:00
Data columns:
Price       3969  non-null values
Volume      3969  non-null values
BidQty      33088  non-null values
BidPrice    33088  non-null values
AskPrice    33088  non-null values
AskQty      33088  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 50740 entries, 2013-04-02 00:03:27.470000+02:00 to 2013-04-02 18:59:58.172000+02:00
Data columns:
Price       7326  non-null values
Volume      7326  non-null values
BidQty      50740  non-null values
BidPrice    50740  non-null values
AskPrice    50740  non-null values
AskQty      50740  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 60799 entries, 2013-04-03 00:03:06.994000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
Price       8258  non-null values
Volume      8258  non-null values
BidQty      60799  non-null values
BidPrice    60799  non-null values
AskPrice    60799  non-null values
AskQty      60799  non-null values
dtypes: float64(6)]

Using append I get:

pd.DataFrame().append(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
AskPrice    179855  non-null values
AskQty      179855  non-null values
BidPrice    179855  non-null values
BidQty      179855  non-null values
Price       23593  non-null values
Volume      23593  non-null values
dtypes: float64(6)

Using concat I get:

pd.concat(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-27 22:00:07.089000+02:00 to 2013-04-03 16:59:58.180000+02:00
Data columns:
Price       23593  non-null values
Volume      23593  non-null values
BidQty      179855  non-null values
BidPrice    179855  non-null values
AskPrice    179855  non-null values
AskQty      179855  non-null values
dtypes: float64(6)

Notice how the index changes when using concat. Why is that happening and how would I go about using concat to reproduce the results obtained using append? (Since concat seems so much faster; 24.6 ms per loop vs 3.02 s per loop)

This question is related to python pandas

The answer is


Pandas concat vs append vs join vs merge

  • Concat gives the flexibility to join based on the axis( all rows or all columns)

  • Append is the specific case(axis=0, join='outer') of concat

  • Join is based on the indexes (set by set_index) on how variable =['left','right','inner','couter']

  • Merge is based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'


I have implemented a tiny benchmark (please find the code on Gist) to evaluate the pandas' concat and append. I updated the code snippet and the results after the comment by ssk08 - thanks alot!

The benchmark ran on a Mac OS X 10.13 system with Python 3.6.2 and pandas 0.20.3.

+--------+---------------------------------+---------------------------------+
|        | ignore_index=False              | ignore_index=True               |
+--------+---------------------------------+---------------------------------+
| size   | append | concat | append/concat | append | concat | append/concat |
+--------+--------+--------+---------------+--------+--------+---------------+
| small  | 0.4635 | 0.4891 | 94.77 %       | 0.4056 | 0.3314 | 122.39 %      |
+--------+--------+--------+---------------+--------+--------+---------------+
| medium | 0.5532 | 0.6617 | 83.60 %       | 0.3605 | 0.3521 | 102.37 %      |
+--------+--------+--------+---------------+--------+--------+---------------+
| large  | 0.9558 | 0.9442 | 101.22 %      | 0.6670 | 0.6749 | 98.84 %       |
+--------+--------+--------+---------------+--------+--------+---------------+

Using ignore_index=False append is slightly faster, with ignore_index=True concat is slightly faster.

tl;dr No significant difference between concat and append.


One more thing you have to keep in mind that the APPEND() method in Pandas doesn't modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. You'd better use CONCAT() function when doing multi-APPEND operations.