{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Merges\n", "\n", "Merge lets us combine multiple data frames. The general syntax is as follows to merge pandas dataframes df_1 and df_2:\n", " \n", " df_1.merge(df_2, how = [\"inner\", \"outer\", \"left\"] , left_on = left_column_name, right_on = right_column_name)\n", "\n", "the example above shows the three main inputs to the how arguments. When actually using a merge you should only specify one of these three values for the how arguments. The how argument specifies the type of merge and the left_on and right_on arguments tells pandas which columns to match up from each of the two dataframes when deciding which rows to keep. In the example above, df_1 is the left dataframe and df_2 is the right dataframe." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
degreedoc_namedoctor_id
0MDJon210
1MBBSPeter211
2MDKe212
3MDPat213
\n", "
" ], "text/plain": [ " degree doc_name doctor_id\n", "0 MD Jon 210\n", "1 MBBS Peter 211\n", "2 MD Ke 212\n", "3 MD Pat 213" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "#Create doctor df\n", "\n", "doctors = pd.DataFrame({\"doctor_id\":[210,211,212,213],\"doc_name\":[\"Jon\", \"Peter\", \"Ke\", \"Pat\"],\\\n", " \"degree\":[\"MD\",\"MBBS\", \"MD\", \"MD\"]})\n", "\n", "#Create visits\n", "visits = pd.DataFrame({\"doctor_id\":[210,214,215,212,212], \"patient_name\": [\"Julia\", \"TJ\", \"John\", \"James\", \"Jason\"],\\\n", " \"year\":[2015,2014, 2015, 2016, 2012]})\n", "\n", "doctors" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doctor_idpatient_nameyear
0210Julia2015
1214TJ2014
2215John2015
3212James2016
4212Jason2012
\n", "
" ], "text/plain": [ " doctor_id patient_name year\n", "0 210 Julia 2015\n", "1 214 TJ 2014\n", "2 215 John 2015\n", "3 212 James 2016\n", "4 212 Jason 2012" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visits" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "For every visit, I would like to create a table with the doctor that saw the patient, if one exists. We will merge on the doctor_id column with an inner merge. With an inner merge, if a match isn't found, then we do not include the row. If the two columns that you want to merge on have the same name, then you can simply specify the on argument instead of separate left_on and right_on arguments taking the same value." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doctor_idpatient_nameyeardegreedoc_name
0210Julia2015MDJon
1212James2016MDKe
2212Jason2012MDKe
\n", "
" ], "text/plain": [ " doctor_id patient_name year degree doc_name\n", "0 210 Julia 2015 MD Jon\n", "1 212 James 2016 MD Ke\n", "2 212 Jason 2012 MD Ke" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Basic inner merge\n", "visits.merge(doctors, how = \"inner\", on = \"doctor_id\" )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we have lost the visits of TJ and John. This is because the doctors table does not have a doctor corresponding to ids 214 and 215. In the case, the inner merge will not include these visits in the dataframe return from the merge. We can slice the returned dataframe using .loc to get the desired columns or remove duplicates that arise because of the merge. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient_nameyeardoc_name
0Julia2015Jon
1James2016Ke
2Jason2012Ke
\n", "
" ], "text/plain": [ " patient_name year doc_name\n", "0 Julia 2015 Jon\n", "1 James 2016 Ke\n", "2 Jason 2012 Ke" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "visits.merge(doctors, how = \"inner\", on = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets see how we can use a left merge to keep these visits that don't have a matching doctor." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient_nameyeardoc_name
0Julia2015Jon
1TJ2014NaN
2John2015NaN
3James2016Ke
4Jason2012Ke
\n", "
" ], "text/plain": [ " patient_name year doc_name\n", "0 Julia 2015 Jon\n", "1 TJ 2014 NaN\n", "2 John 2015 NaN\n", "3 James 2016 Ke\n", "4 Jason 2012 Ke" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Basic left merge\n", "visits.merge(doctors, how = \"left\", on = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that in this example, we get a row for every visit, but for visits that don't have a matching doctor_id we get NaNs from the doctors table." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Finally, let's see an example of an outer merge. Consider the following two dataframes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WeekRushPoints
0112
1232
2334
3412
\n", "
" ], "text/plain": [ " Week RushPoints\n", "0 1 12\n", "1 2 32\n", "2 3 34\n", "3 4 12" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Create Two data frames\n", "df1 = pd.DataFrame({ \"RushPoints\":[12,32,34,12], \"Week\":[1,2,3,4]})\n", "df1 = df1[[\"Week\", \"RushPoints\"]]\n", "df1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WeekRecPoints
02101
13310
25234
\n", "
" ], "text/plain": [ " Week RecPoints\n", "0 2 101\n", "1 3 310\n", "2 5 234" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame({\"Week\":[2,3,5], \"RecPoints\":[101,310,234]})\n", "df2 = df2[[\"Week\", \"RecPoints\"]]\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets say I wanted to combine the records and put a zero if one of the points categories did not exists for the given week. In this case I want to merge on Week, but each table has weeks that other doesn't have so a left merge will delete rows no matter which dataframe I choose as the left one. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WeekRushPointsRecPoints
0112.0NaN
1232.0101.0
2334.0310.0
3412.0NaN
45NaN234.0
\n", "
" ], "text/plain": [ " Week RushPoints RecPoints\n", "0 1 12.0 NaN\n", "1 2 32.0 101.0\n", "2 3 34.0 310.0\n", "3 4 12.0 NaN\n", "4 5 NaN 234.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Basic outer merge\n", "final = df1.merge(df2, how = \"outer\", on = \"Week\")\n", "final" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the resulting data frame has a row for each week!" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WeekRushPointsRecPoints
0112.00.0
1232.0101.0
2334.0310.0
3412.00.0
450.0234.0
\n", "
" ], "text/plain": [ " Week RushPoints RecPoints\n", "0 1 12.0 0.0\n", "1 2 32.0 101.0\n", "2 3 34.0 310.0\n", "3 4 12.0 0.0\n", "4 5 0.0 234.0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Replace the NAN with 0\n", "final.fillna(0, inplace = True)\n", "final" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "toc": { "nav_menu": { "height": "84px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }