Abstract:
merge two data frames using pd.concat() or pd.merge() of the Pandas
module.
Merging data frames
requires a sharing items between the two data frames, namely row
names or a column. The function pd.concat() is used for merging based
on column/row names, and pd.merge() is based on the column/row items.
1. Use of
pd.concat()
Combine the two data frames by row
names (axis=0) and column names (axis=1) using pd.concat().
>>>
df1 = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"),
index=list("abc"))
>>>
df2 = pd.DataFrame(np.random.rand(3,4), columns=list("BACD"),
index=list("cba"))
>>>
print df1
A B C D
a
0.371818 0.556964 0.307669 0.559687
b
0.898462 0.588832 0.596533 0.619365
c
0.007486 0.428250 0.385228 0.600094
[3
rows x 4 columns]
>>>
print df2
B A C D
c
0.979984 0.938620 0.602475 0.843502
b
0.298193 0.474888 0.067354 0.746102
a
0.431692 0.485134 0.956539 0.921407
[3
rows x 4 columns]
>>>
>>>
new_df=pd.concat([df1,df2], axis=1)
>>>
print 'Combined data frame by columns:\n', new_df
A B C D B A
C D
a
0.371818 0.556964 0.307669 0.559687 0.431692 0.485134 0.956539
0.921407
b
0.898462 0.588832 0.596533 0.619365 0.298193 0.474888 0.067354
0.746102
c
0.007486 0.428250 0.385228 0.600094 0.979984 0.938620 0.602475
0.843502
[3
rows x 8 columns]
>>>
new_df=pd.concat([df1,df2], axis=0)
>>>
print 'Combined data frame by rows:\n', new_df
A B C D
a
0.371818 0.556964 0.307669 0.559687
b
0.898462 0.588832 0.596533 0.619365
c
0.007486 0.428250 0.385228 0.600094
a
0.371818 0.556964 0.307669 0.559687
b
0.898462 0.588832 0.596533 0.619365
c
0.007486 0.428250 0.385228 0.600094
[6
rows x 4 columns]
>>>
With the above examples, the column or
row names between the data frames should matched to each other though
their order may be different. If some column/row names was different,
np.nan could be filled in the new data frame:
>>>
print df1, df2
A B C D
a
0.962725 0.302218 0.993607 0.691932
b
0.979072 0.351700 0.998637 0.316548
c
0.847911 0.488782 0.194264 0.153505
[3
rows x 4 columns]
A
B E F
c
0.544601 0.532101 0.950279 0.733621
b
0.967500 0.845823 0.528155 0.929168
d
0.219037 0.499900 0.747760 0.029587
[3
rows x 4 columns]
>>>
new_df=pd.concat([df1,df2], axis=0)
>>>
print 'Combined data frame by rows:\n', new_df
Combined
data frame by rows:
A B C D E F
a
0.962725 0.302218 0.993607 0.691932 NaN NaN
b
0.979072 0.351700 0.998637 0.316548 NaN NaN
c
0.847911 0.488782 0.194264 0.153505 NaN NaN
c
0.544601 0.532101 NaN NaN 0.950279 0.733621
b
0.967500 0.845823 NaN NaN 0.528155 0.929168
d
0.219037 0.499900 NaN NaN 0.747760 0.029587
[6
rows x 6 columns]
>>>
2. Use of pd.merge()
Combine the two data frames based
their shared on one or more column items using pd.merge(). Here, the
column name as 'id' is the shared column
>>>
df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})
>>>
df2 = pd.DataFrame({'id':['a','b','c'], 'level':range(3)})
>>>
new_df=pd.merge(df1,df2, on='id')
df1
id
no
0 a
0
1 a
1
2 b
2
3 b
3
4 c
4
5 c
5
[6
rows x 2 columns]
df2
id
level
0 a
0
1 b
1
2 c
2
[3
rows x 2 columns]
new_df
id
no level
0 a
0 0
1 a
1 0
2 b
2 1
3 b
3 1
4 c
4 2
5 c
5 2
[6
rows x 3 columns]
The defaults shared values in the
column would be reserved if the values in the column 'id' might not
match to each other. The values with ['a', 'c] in the column 'id'
would be reserved. The switch 'how='outer' should be added if you
want values to involved.
>>>
df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})
>>>
df2 = pd.DataFrame({'id':['a','b','d'], 'level':range(3)})
>>>
new_df=pd.merge(df1,df2, on='id')
df1
id
no
0 a
0
1 a
1
2 b
2
3 b
3
4 c
4
5 c
5
[6
rows x 2 columns]
df2
id
level
0 a
0
1 b
1
2 d
2
[3
rows x 2 columns]
new_df
id
no level
0 a
0 0
1 a
1 0
2 b
2 1
3 b
3 1
[4
rows x 3 columns]
>>>
>>>
new_df=pd.merge(df1,df2, on='id', how='outer')
id
no level
0 a
0 0
1 a
1 0
2 b
2 1
3 b
3 1
4 c
4 NaN
5 c
5 NaN
6 d
NaN 2
[7
rows x 3 columns]
>>>
if the columns names are not identical
in the two data frame, the switch 'left_on' and 'right_on' can be
used:
>>>
df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'no':range(6)})
>>>
df2 = pd.DataFrame({'id2':['a','b','c'], 'level':range(3)})
>>>
new_df=pd.merge(df1,df2, left_on='id1', right_on='id2')
>>>
print new_df
id1
no id2 level
0 a
0 a 0
1 a
1 a 0
2 b
2 b 1
3 b
3 b 1
4 c
4 c 2
5 c
5 c 2
[6
rows x 4 columns]
>>>
And one more factors can be used for
merging the data frame:
>>>
df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'],
'id2':['E','E','F','F','G','G'], 'no':range(6)})
>>>
df2 = pd.DataFrame({'id1':['a','b','c'], 'id2':['E','F','G'],
'level':range(3)})
>>>
new_df=pd.merge(df1,df2, on=['id1','id2'])
>>>
print new_df
id1
id2 no level
0 a
E 0 0
1 a
E 1 0
2 b
F 2 1
3 b
F 3 1
4 c
G 4 2
5 c
G 5 2
[6
rows x 4 columns]
>>>
Here, we merge the data frame df1 with
its the column 'id' with the data frame df2 based on it row names:
>>>
df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'],
'id2':['E','E','F','F','G','G'], 'no':range(6)})
>>>
df2 = pd.DataFrame(['E','F','G'],columns=['id2'],index=['a','b','c'])
>>>
new_df=pd.merge(df1,df2, left_on='id1', right_index=True)
>>>
print new_df
id1
id2_x no id2_y
0 a
E 0 E
1 a
E 1 E
2 b
F 2 F
3 b
F 3 F
4 c
G 4 G
5 c
G 5 G
[6
rows x 4 columns]
No comments:
Post a Comment