I understand that pandas is designed to load fully populated DataFrame
but I need to create an empty DataFrame then add rows, one by one.
What is the best way to do this ?
I successfully created an empty DataFrame with :
res = DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row and fill a field with :
res = res.set_value(len(res), 'qty1', 10.0)
It works but seems very odd :-/ (it fails for adding string value)
How can I add a new row to my DataFrame (with different columns type) ?
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
For efficient appending see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix
on non existing key index data. e.g. :
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square 0 0 0 0 1 1 2 1 2 2 4 4 3 3 6 9 4 4 8 16 5 5 10 25
For the sake of Pythonic way, here add my answer:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
You can append a single row as a dictionary using the ignore_index
option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
This is not an answer to the OP question but a toy example to illustrate the answer of @ShikharDua above which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the stats below for more than one taget column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you @ShikharDua !
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]]
or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
This will take care of adding an item to an empty DataFrame. The issue is that df.index.max() == nan
for the first index:
df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]
Make it simple. By taking list as input which will be appended as row in data-frame:-
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)
You could use pandas.concat()
or DataFrame.append()
. For details and examples, see Merge, join, and concatenate.
Instead of a list of dictionaries as in ShikharDua's answer, we can also represent our table as a dictionary of lists, where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
For c columns and n rows, this uses 1 dictionary and c lists, versus 1 list and n dictionaries. The list of dictionaries method has each dictionary storing all keys and requires creating a new dictionary for every row. Here we only append to lists, which is constant time and theoretically very fast.
# current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# at the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use for loop to iterate through values or can add arrays of values
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Here is the way to add/append a row in pandas DataFrame
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in empty or populated pandas DataFrame
You can use generator object to create Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments
It's been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use.
In the case of adding a lot of rows to dataframe I interested in speed performance. So I tried 4 most popular methods and checked their speed.
UPDATED IN 2019 using new versions of packages. Also updated after @FooBar comment
Results (in secs):
|------------|-------------|-------------|-------------|
| Approach | 1000 rows | 5000 rows | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append | 0.69 | 3.39 | 6.78 |
|------------|-------------|-------------|-------------|
| .loc w/o | 0.74 | 3.90 | 8.35 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| .loc with | 0.24 | 2.58 | 8.70 |
| prealloc | | | |
|------------|-------------|-------------|-------------|
| dict | 0.012 | 0.046 | 0.084 |
|------------|-------------|-------------|-------------|
Also thanks to @krassowski for useful comment - I updated the code.
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S. I believe, my realization isn't perfect, and maybe there is some optimization.
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
dtypes
are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object
, which is bad.data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
append
or concat
inside a loop
append
and concat
aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object
(slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc
or append
if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you have a data frame df
and want to add a list new_list
as a new row to df
, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df
under data frame df
, then you can use:
df.append(new_df)
Create a new record(data frame) and add to old_data_frame.
pass list of values and corresponding column names to create a new_record (data_frame)
new_record = pd.DataFrame([[0,'abcd',0,1,123]],columns=['a','b','c','d','e'])
old_data_frame = pd.concat([old_data_frame,new_record])
if you want to add row at the end append it as a list
valuestoappend = [va1,val2,val3]
res = res.append(pd.Series(valuestoappend,index = ['lib', 'qty1', 'qty2']),ignore_index = True)
If all data in your Dataframe has the same dtype you might use a numpy array. You can write rows directly into the predefined array and convert it to a dataframe at the end. Seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) ? 'DataFrame'
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index(not numeric). So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
We often see the construct df.loc[subscript] = …
to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far. But if we replace the df3.loc[i] = …
(with preallocated DataFrame) in his code with df3.values[i] = …
, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = …
into consideration. However note that .values
takes a zero-based subscript, which may be different from the DataFrame.index.
before going to add a row, we have to convert the dataframe to dictionary there you can see the keys as columns in dataframe and values of the columns are again stored in the dictionary but there key for every column is the index number in dataframe. That idea make me to write the below code.
df2=df.to_dict()
values=["s_101","hyderabad",10,20,16,13,15,12,12,13,25,26,25,27,"good","bad"] #this is total row that we are going to add
i=0
for x in df.columns: #here df.columns gives us the main dictionary key
df2[x][101]=values[i] #here the 101 is our index number it is also key of sub dictionary
i+=1
Source: Stackoverflow.com