{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Merges\n",
    "\n",
    "Merge lets us combine multiple data frames. The general syntax is as follows to merge pandas dataframes df_1 and df_2:\n",
    "    \n",
    "    df_1.merge(df_2, how = [\"inner\", \"outer\", \"left\"] , left_on = left_column_name, right_on = right_column_name)\n",
    "\n",
    "the example above shows the three main inputs to the how arguments.  When actually using a merge you should only specify one of these three values for the how arguments. The how argument specifies the type of merge and the left_on and right_on arguments tells pandas which columns to match up from each of the two dataframes when deciding which rows to keep.  In the example above, df_1 is the left dataframe and df_2 is the right dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>degree</th>\n",
       "      <th>doc_name</th>\n",
       "      <th>doctor_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MD</td>\n",
       "      <td>Jon</td>\n",
       "      <td>210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MBBS</td>\n",
       "      <td>Peter</td>\n",
       "      <td>211</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MD</td>\n",
       "      <td>Ke</td>\n",
       "      <td>212</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MD</td>\n",
       "      <td>Pat</td>\n",
       "      <td>213</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  degree doc_name  doctor_id\n",
       "0     MD      Jon        210\n",
       "1   MBBS    Peter        211\n",
       "2     MD       Ke        212\n",
       "3     MD      Pat        213"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "#Create doctor df\n",
    "\n",
    "doctors = pd.DataFrame({\"doctor_id\":[210,211,212,213],\"doc_name\":[\"Jon\", \"Peter\", \"Ke\", \"Pat\"],\\\n",
    "                       \"degree\":[\"MD\",\"MBBS\", \"MD\", \"MD\"]})\n",
    "\n",
    "#Create visits\n",
    "visits = pd.DataFrame({\"doctor_id\":[210,214,215,212,212], \"patient_name\": [\"Julia\", \"TJ\", \"John\", \"James\", \"Jason\"],\\\n",
    "                      \"year\":[2015,2014, 2015, 2016, 2012]})\n",
    "\n",
    "doctors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>doctor_id</th>\n",
       "      <th>patient_name</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>210</td>\n",
       "      <td>Julia</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>214</td>\n",
       "      <td>TJ</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>215</td>\n",
       "      <td>John</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>212</td>\n",
       "      <td>James</td>\n",
       "      <td>2016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>212</td>\n",
       "      <td>Jason</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   doctor_id patient_name  year\n",
       "0        210        Julia  2015\n",
       "1        214           TJ  2014\n",
       "2        215         John  2015\n",
       "3        212        James  2016\n",
       "4        212        Jason  2012"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "visits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "For every visit, I would like to create a table with the doctor that saw the patient, if one exists. We will merge on the doctor_id column with an inner merge.  With an inner merge, if a match isn't found, then we do not include the row. If the two columns that you want to merge on have the same name, then you can simply specify the on argument instead of separate left_on and right_on arguments taking the same value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>doctor_id</th>\n",
       "      <th>patient_name</th>\n",
       "      <th>year</th>\n",
       "      <th>degree</th>\n",
       "      <th>doc_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>210</td>\n",
       "      <td>Julia</td>\n",
       "      <td>2015</td>\n",
       "      <td>MD</td>\n",
       "      <td>Jon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>212</td>\n",
       "      <td>James</td>\n",
       "      <td>2016</td>\n",
       "      <td>MD</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>212</td>\n",
       "      <td>Jason</td>\n",
       "      <td>2012</td>\n",
       "      <td>MD</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   doctor_id patient_name  year degree doc_name\n",
       "0        210        Julia  2015     MD      Jon\n",
       "1        212        James  2016     MD       Ke\n",
       "2        212        Jason  2012     MD       Ke"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Basic inner merge\n",
    "visits.merge(doctors, how  = \"inner\", on  = \"doctor_id\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that we have lost the visits of TJ and John. This is because the doctors table does not have a doctor corresponding to ids 214 and 215.  In the case, the inner merge will not include these visits in the dataframe return from the merge. We can slice the returned dataframe using .loc to get the desired columns or remove duplicates that arise because of the merge. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patient_name</th>\n",
       "      <th>year</th>\n",
       "      <th>doc_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Julia</td>\n",
       "      <td>2015</td>\n",
       "      <td>Jon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>James</td>\n",
       "      <td>2016</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Jason</td>\n",
       "      <td>2012</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  patient_name  year doc_name\n",
       "0        Julia  2015      Jon\n",
       "1        James  2016       Ke\n",
       "2        Jason  2012       Ke"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "visits.merge(doctors, how  = \"inner\", on  = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets see how we can use a left merge to keep these visits that don't have a matching doctor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patient_name</th>\n",
       "      <th>year</th>\n",
       "      <th>doc_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Julia</td>\n",
       "      <td>2015</td>\n",
       "      <td>Jon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TJ</td>\n",
       "      <td>2014</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>2015</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>James</td>\n",
       "      <td>2016</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Jason</td>\n",
       "      <td>2012</td>\n",
       "      <td>Ke</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  patient_name  year doc_name\n",
       "0        Julia  2015      Jon\n",
       "1           TJ  2014      NaN\n",
       "2         John  2015      NaN\n",
       "3        James  2016       Ke\n",
       "4        Jason  2012       Ke"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Basic left merge\n",
    "visits.merge(doctors, how  = \"left\", on  = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that in this example, we get a row for every visit, but for visits that don't have a matching doctor_id we get NaNs from the doctors table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Finally, let's see an example of an outer merge.  Consider the following two dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Week</th>\n",
       "      <th>RushPoints</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Week  RushPoints\n",
       "0     1          12\n",
       "1     2          32\n",
       "2     3          34\n",
       "3     4          12"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Create Two data frames\n",
    "df1 = pd.DataFrame({ \"RushPoints\":[12,32,34,12], \"Week\":[1,2,3,4]})\n",
    "df1 = df1[[\"Week\", \"RushPoints\"]]\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Week</th>\n",
       "      <th>RecPoints</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>234</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Week  RecPoints\n",
       "0     2        101\n",
       "1     3        310\n",
       "2     5        234"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame({\"Week\":[2,3,5], \"RecPoints\":[101,310,234]})\n",
    "df2 = df2[[\"Week\", \"RecPoints\"]]\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets say I wanted to combine the records and put a zero if one of the points categories did not exists for the given week. In this case I want to merge on Week, but each table has weeks that other doesn't have so a left merge will delete rows no matter which dataframe I choose as the left one. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Week</th>\n",
       "      <th>RushPoints</th>\n",
       "      <th>RecPoints</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>32.0</td>\n",
       "      <td>101.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>34.0</td>\n",
       "      <td>310.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>12.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>234.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Week  RushPoints  RecPoints\n",
       "0     1        12.0        NaN\n",
       "1     2        32.0      101.0\n",
       "2     3        34.0      310.0\n",
       "3     4        12.0        NaN\n",
       "4     5         NaN      234.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Basic outer merge\n",
    "final = df1.merge(df2, how = \"outer\", on = \"Week\")\n",
    "final"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the resulting data frame has a row for each week!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Week</th>\n",
       "      <th>RushPoints</th>\n",
       "      <th>RecPoints</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>32.0</td>\n",
       "      <td>101.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>34.0</td>\n",
       "      <td>310.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.0</td>\n",
       "      <td>234.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Week  RushPoints  RecPoints\n",
       "0     1        12.0        0.0\n",
       "1     2        32.0      101.0\n",
       "2     3        34.0      310.0\n",
       "3     4        12.0        0.0\n",
       "4     5         0.0      234.0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Replace the NAN with 0\n",
    "final.fillna(0, inplace = True)\n",
    "final"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  },
  "toc": {
   "nav_menu": {
    "height": "84px",
    "width": "252px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}