Tiezheng Yuan Ph.D.: Python: Pandas (2) merge data frames

Abstract: merge two data frames using pd.concat() or pd.merge() of the Pandas module.

Merging data frames requires a sharing items between the two data frames, namely row names or a column. The function pd.concat() is used for merging based on column/row names, and pd.merge() is based on the column/row items.

1. Use of pd.concat()

Combine the two data frames by row names (axis=0) and column names (axis=1) using pd.concat().

>>> df1 = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"), index=list("abc"))

>>> df2 = pd.DataFrame(np.random.rand(3,4), columns=list("BACD"), index=list("cba"))

>>> print df1

A B C D

a 0.371818 0.556964 0.307669 0.559687

b 0.898462 0.588832 0.596533 0.619365

c 0.007486 0.428250 0.385228 0.600094

[3 rows x 4 columns]

>>> print df2

B A C D

c 0.979984 0.938620 0.602475 0.843502

b 0.298193 0.474888 0.067354 0.746102

a 0.431692 0.485134 0.956539 0.921407

[3 rows x 4 columns]

>>>

>>> new_df=pd.concat([df1,df2], axis=1)

>>> print 'Combined data frame by columns:\n', new_df

A B C D B A C D

a 0.371818 0.556964 0.307669 0.559687 0.431692 0.485134 0.956539 0.921407

b 0.898462 0.588832 0.596533 0.619365 0.298193 0.474888 0.067354 0.746102

c 0.007486 0.428250 0.385228 0.600094 0.979984 0.938620 0.602475 0.843502

[3 rows x 8 columns]

>>> new_df=pd.concat([df1,df2], axis=0)

>>> print 'Combined data frame by rows:\n', new_df

A B C D

a 0.371818 0.556964 0.307669 0.559687

b 0.898462 0.588832 0.596533 0.619365

c 0.007486 0.428250 0.385228 0.600094

a 0.371818 0.556964 0.307669 0.559687

b 0.898462 0.588832 0.596533 0.619365

c 0.007486 0.428250 0.385228 0.600094

[6 rows x 4 columns]

>>>

With the above examples, the column or row names between the data frames should matched to each other though their order may be different. If some column/row names was different, np.nan could be filled in the new data frame:

>>> print df1, df2

A B C D

a 0.962725 0.302218 0.993607 0.691932

b 0.979072 0.351700 0.998637 0.316548

c 0.847911 0.488782 0.194264 0.153505

[3 rows x 4 columns]

A B E F

c 0.544601 0.532101 0.950279 0.733621

b 0.967500 0.845823 0.528155 0.929168

d 0.219037 0.499900 0.747760 0.029587

[3 rows x 4 columns]

>>> new_df=pd.concat([df1,df2], axis=0)

>>> print 'Combined data frame by rows:\n', new_df

Combined data frame by rows:

A B C D E F

a 0.962725 0.302218 0.993607 0.691932 NaN NaN

b 0.979072 0.351700 0.998637 0.316548 NaN NaN

c 0.847911 0.488782 0.194264 0.153505 NaN NaN

c 0.544601 0.532101 NaN NaN 0.950279 0.733621

b 0.967500 0.845823 NaN NaN 0.528155 0.929168

d 0.219037 0.499900 NaN NaN 0.747760 0.029587

[6 rows x 6 columns]

>>>

2. Use of pd.merge()

Combine the two data frames based their shared on one or more column items using pd.merge(). Here, the column name as 'id' is the shared column

>>> df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})

>>> df2 = pd.DataFrame({'id':['a','b','c'], 'level':range(3)})

>>> new_df=pd.merge(df1,df2, on='id')

df1

id no

0 a 0

1 a 1

2 b 2

3 b 3

4 c 4

5 c 5

[6 rows x 2 columns]

df2

id level

0 a 0

1 b 1

2 c 2

[3 rows x 2 columns]

new_df

id no level

0 a 0 0

1 a 1 0

2 b 2 1

3 b 3 1

4 c 4 2

5 c 5 2

[6 rows x 3 columns]

The defaults shared values in the column would be reserved if the values in the column 'id' might not match to each other. The values with ['a', 'c] in the column 'id' would be reserved. The switch 'how='outer' should be added if you want values to involved.

>>> df1 = pd.DataFrame({'id':['a','a','b','b','c','c'], 'no':range(6)})

>>> df2 = pd.DataFrame({'id':['a','b','d'], 'level':range(3)})

>>> new_df=pd.merge(df1,df2, on='id')

df1

id no

0 a 0

1 a 1

2 b 2

3 b 3

4 c 4

5 c 5

[6 rows x 2 columns]

df2

id level

0 a 0

1 b 1

2 d 2

[3 rows x 2 columns]

new_df

id no level

0 a 0 0

1 a 1 0

2 b 2 1

3 b 3 1

[4 rows x 3 columns]

>>>

>>> new_df=pd.merge(df1,df2, on='id', how='outer')

id no level

0 a 0 0

1 a 1 0

2 b 2 1

3 b 3 1

4 c 4 NaN

5 c 5 NaN

6 d NaN 2

[7 rows x 3 columns]

>>>

if the columns names are not identical in the two data frame, the switch 'left_on' and 'right_on' can be used:

>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'no':range(6)})

>>> df2 = pd.DataFrame({'id2':['a','b','c'], 'level':range(3)})

>>> new_df=pd.merge(df1,df2, left_on='id1', right_on='id2')

>>> print new_df

id1 no id2 level

0 a 0 a 0

1 a 1 a 0

2 b 2 b 1

3 b 3 b 1

4 c 4 c 2

5 c 5 c 2

[6 rows x 4 columns]

>>>

And one more factors can be used for merging the data frame:

>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'id2':['E','E','F','F','G','G'], 'no':range(6)})

>>> df2 = pd.DataFrame({'id1':['a','b','c'], 'id2':['E','F','G'], 'level':range(3)})

>>> new_df=pd.merge(df1,df2, on=['id1','id2'])

>>> print new_df

id1 id2 no level

0 a E 0 0

1 a E 1 0

2 b F 2 1

3 b F 3 1

4 c G 4 2

5 c G 5 2

[6 rows x 4 columns]

>>>

Here, we merge the data frame df1 with its the column 'id' with the data frame df2 based on it row names:

>>> df1 = pd.DataFrame({'id1':['a','a','b','b','c','c'], 'id2':['E','E','F','F','G','G'], 'no':range(6)})

>>> df2 = pd.DataFrame(['E','F','G'],columns=['id2'],index=['a','b','c'])

>>> new_df=pd.merge(df1,df2, left_on='id1', right_index=True)

>>> print new_df

id1 id2_x no id2_y

0 a E 0 E

1 a E 1 E

2 b F 2 F

3 b F 3 F

4 c G 4 G

5 c G 5 G

[6 rows x 4 columns]

Tiezheng Yuan Ph.D.

Sunday, November 8, 2015

Python: Pandas (2) merge data frames

No comments:

Post a Comment