INNER
| (LEFT
|RIGHT
|FULL
) OUTER
) JOIN
with pandas?merge
? join
? concat
? update
? Who? What? Why?!... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
For ease of access.
This question is related to
python
pandas
join
merge
concatenation
This post will go through the following topics:
merge
, join
, concat
There are a few options, some simpler than others depending on the use case.
DataFrame.merge
withleft_index
andright_index
(orleft_on
andright_on
using names indexes)
- supports inner/left/right/full
- can only join two at a time
- supports column-column, index-column, index-index joins
DataFrame.join
(join on index)
- supports inner/left (default)/right/full
- can join multiple DataFrames at a time
- supports index-index joins
pd.concat
(joins on index)
- supports inner/full (default)
- can join multiple DataFrames at a time
- supports index-index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join
defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat
joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner'
is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat
, see this post.
To perform an inner join using index of left, column of right, you will use DataFrame.merge
a combination of left_index=True
and right_on=...
.
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge
can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join
and concat
are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index
.
If your index is named, then from pandas >= 0.23, DataFrame.merge
allows you to specify the index name to on
(or left_on
and right_on
as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on
with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Jump to other topics in Pandas Merging 101 to continue learning:
* you are here
In this answer, I will consider practical examples.
The first one, is of pandas.concat
.
The second one, of merging dataframes from the index of one and the column of another one.
Considering the following DataFrames
with the same column names:
Preco2018 with size (8784, 5)
Preco 2019 with size (8760, 5)
That have the same column names.
You can combine them using pandas.concat
, by simply
import pandas as pd
frames = [Preco2018, Preco2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with the following size (17544, 5)
If you want to visualize, it ends up working like this
(Source)
2. Merge by Column and Index
In this part, I will consider a specific case: If one wants to merge the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo
with 54 columns, being one of the columns the Date Data
, which is of type datetime64[ns]
.
And the dataframe Price
that has one column with the price and the index corresponds to the dates
In this specific case, to merge them, one uses pd.merge
merged = pd.merge(Price, Geo, left_index=True, right_on='Data')
Which results in the following dataframe
This post will go through the following topics:
merge
has shortcomings here)Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat
for multi-way joins on unique keys, and DataFrame.join
for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat
. Note that pd.concat
joins DataFrames on the index.
# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner'
for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join
, described below).
concat
is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join
since it can handle non-unique keys (note that join
joins DataFrames on their index; it calls merge
under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Jump to other topics in Pandas Merging 101 to continue learning:
* you are here
Source: Stackoverflow.com