Sunday, November 8, 2015

Python: Pandas (2) merge data frames


Abstract: merge two data frames using pd.concat() or pd.merge() of the Pandas module.


Merging data frames requires a sharing items between the two data frames, namely row names or a column. The function pd.concat() is used for merging based on column/row names, and pd.merge() is based on the column/row items.

1. Use of pd.concat()

Combine the two data frames by row names (axis=0) and column names (axis=1) using pd.concat().
>>> df1 = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"), index=list("abc"))
>>> df2 = pd.DataFrame(np.random.rand(3,4), columns=list("BACD"), index=list("cba"))
>>> print df1
A B C D
a 0.371818 0.556964 0.307669 0.559687
b 0.898462 0.588832 0.596533 0.619365
c 0.007486 0.428250 0.385228 0.600094

[3 rows x 4 columns]
>>> print df2
B A C D
c 0.979984 0.938620 0.602475 0.843502
b 0.298193 0.474888 0.067354 0.746102
a 0.431692 0.485134 0.956539 0.921407

[3 rows x 4 columns]
>>>
>>> new_df=pd.concat([df1,df2], axis=1)
>>> print 'Combined data frame by columns:\n', new_df
A B C D B A C D
a 0.371818 0.556964 0.307669 0.559687 0.431692 0.485134 0.956539 0.921407
b 0.898462 0.588832 0.596533 0.619365 0.298193 0.474888 0.067354 0.746102
c 0.007486 0.428250 0.385228 0.600094 0.979984 0.938620 0.602475 0.843502
[3 rows x 8 columns]
>>> new_df=pd.concat([df1,df2], axis=0)
>>> print 'Combined data frame by rows:\n', new_df
A B C D
a 0.371818 0.556964 0.307669 0.559687
b 0.898462 0.588832 0.596533 0.619365
c 0.007486 0.428250 0.385228 0.600094
a 0.371818 0.556964 0.307669 0.559687
b 0.898462 0.588832 0.596533 0.619365
c 0.007486 0.428250 0.385228 0.600094
[6 rows x 4 columns]

>>>
With the above examples, the column or row names between the data frames should matched to each other though their order may be different. If some column/row names was different, np.nan could be filled in the new data frame:
>>> print df1, df2
A B C D
a 0.962725 0.302218 0.993607 0.691932
b 0.979072 0.351700 0.998637 0.316548
c 0.847911 0.488782 0.194264 0.153505
[3 rows x 4 columns]
A B E F
c 0.544601 0.532101 0.950279 0.733621
b 0.967500 0.845823 0.528155 0.929168
d 0.219037 0.499900 0.747760 0.029587
[3 rows x 4 columns]
>>> new_df=pd.concat([df1,df2], axis=0)
>>> print 'Combined data frame by rows:\n', new_df
Combined data frame by rows:
A B C D E F
a 0.962725 0.302218 0.993607 0.691932 NaN NaN
b 0.979072 0.351700 0.998637 0.316548 NaN NaN
c 0.847911 0.488782 0.194264 0.153505 NaN NaN
c 0.544601 0.532101 NaN NaN 0.950279 0.733621
b 0.967500 0.845823 NaN NaN 0.528155 0.929168
d 0.219037 0.499900 NaN NaN 0.747760 0.029587

[6 rows x 6 columns]
>>>
2. Use of pd.merge()
Combine the two data frames based their shared on one or more column items using pd.merge(). Here, the column name as 'id' is the shared column
>>> df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})
>>> df2 = pd.DataFrame({'id':['a','b','c'], 'level':range(3)})
>>> new_df=pd.merge(df1,df2, on='id')
df1
id no
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
5 c 5
[6 rows x 2 columns]

df2
id level
0 a 0
1 b 1
2 c 2
[3 rows x 2 columns]

new_df
id no level
0 a 0 0
1 a 1 0
2 b 2 1
3 b 3 1
4 c 4 2
5 c 5 2
[6 rows x 3 columns]

The defaults shared values in the column would be reserved if the values in the column 'id' might not match to each other. The values with ['a', 'c] in the column 'id' would be reserved. The switch 'how='outer' should be added if you want values to involved.
>>> df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})
>>> df2 = pd.DataFrame({'id':['a','b','d'], 'level':range(3)})
>>> new_df=pd.merge(df1,df2, on='id')
df1
id no
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
5 c 5
[6 rows x 2 columns]

df2
id level
0 a 0
1 b 1
2 d 2
[3 rows x 2 columns]

new_df
id no level
0 a 0 0
1 a 1 0
2 b 2 1
3 b 3 1
[4 rows x 3 columns]
>>>
>>> new_df=pd.merge(df1,df2, on='id', how='outer')
id no level
0 a 0 0
1 a 1 0
2 b 2 1
3 b 3 1
4 c 4 NaN
5 c 5 NaN
6 d NaN 2
[7 rows x 3 columns]
>>>
if the columns names are not identical in the two data frame, the switch 'left_on' and 'right_on' can be used:
>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'no':range(6)})
>>> df2 = pd.DataFrame({'id2':['a','b','c'], 'level':range(3)})
>>> new_df=pd.merge(df1,df2, left_on='id1', right_on='id2')
>>> print new_df
id1 no id2 level
0 a 0 a 0
1 a 1 a 0
2 b 2 b 1
3 b 3 b 1
4 c 4 c 2
5 c 5 c 2
[6 rows x 4 columns]
>>>
And one more factors can be used for merging the data frame:
>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'id2':['E','E','F','F','G','G'], 'no':range(6)})
>>> df2 = pd.DataFrame({'id1':['a','b','c'], 'id2':['E','F','G'], 'level':range(3)})
>>> new_df=pd.merge(df1,df2, on=['id1','id2'])
>>> print new_df
id1 id2 no level
0 a E 0 0
1 a E 1 0
2 b F 2 1
3 b F 3 1
4 c G 4 2
5 c G 5 2
[6 rows x 4 columns]
>>>
Here, we merge the data frame df1 with its the column 'id' with the data frame df2 based on it row names:
>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'id2':['E','E','F','F','G','G'], 'no':range(6)})
>>> df2 = pd.DataFrame(['E','F','G'],columns=['id2'],index=['a','b','c'])
>>> new_df=pd.merge(df1,df2, left_on='id1', right_index=True)
>>> print new_df
id1 id2_x no id2_y
0 a E 0 E
1 a E 1 E
2 b F 2 F
3 b F 3 F
4 c G 4 G
5 c G 5 G
[6 rows x 4 columns]









No comments:

Post a Comment