{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Merges\n",
"\n",
"Merge lets us combine multiple data frames. The general syntax is as follows to merge pandas dataframes df_1 and df_2:\n",
" \n",
" df_1.merge(df_2, how = [\"inner\", \"outer\", \"left\"] , left_on = left_column_name, right_on = right_column_name)\n",
"\n",
"the example above shows the three main inputs to the how arguments. When actually using a merge you should only specify one of these three values for the how arguments. The how argument specifies the type of merge and the left_on and right_on arguments tells pandas which columns to match up from each of the two dataframes when deciding which rows to keep. In the example above, df_1 is the left dataframe and df_2 is the right dataframe."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" degree | \n",
" doc_name | \n",
" doctor_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MD | \n",
" Jon | \n",
" 210 | \n",
"
\n",
" \n",
" 1 | \n",
" MBBS | \n",
" Peter | \n",
" 211 | \n",
"
\n",
" \n",
" 2 | \n",
" MD | \n",
" Ke | \n",
" 212 | \n",
"
\n",
" \n",
" 3 | \n",
" MD | \n",
" Pat | \n",
" 213 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" degree doc_name doctor_id\n",
"0 MD Jon 210\n",
"1 MBBS Peter 211\n",
"2 MD Ke 212\n",
"3 MD Pat 213"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"#Create doctor df\n",
"\n",
"doctors = pd.DataFrame({\"doctor_id\":[210,211,212,213],\"doc_name\":[\"Jon\", \"Peter\", \"Ke\", \"Pat\"],\\\n",
" \"degree\":[\"MD\",\"MBBS\", \"MD\", \"MD\"]})\n",
"\n",
"#Create visits\n",
"visits = pd.DataFrame({\"doctor_id\":[210,214,215,212,212], \"patient_name\": [\"Julia\", \"TJ\", \"John\", \"James\", \"Jason\"],\\\n",
" \"year\":[2015,2014, 2015, 2016, 2012]})\n",
"\n",
"doctors"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" doctor_id | \n",
" patient_name | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 210 | \n",
" Julia | \n",
" 2015 | \n",
"
\n",
" \n",
" 1 | \n",
" 214 | \n",
" TJ | \n",
" 2014 | \n",
"
\n",
" \n",
" 2 | \n",
" 215 | \n",
" John | \n",
" 2015 | \n",
"
\n",
" \n",
" 3 | \n",
" 212 | \n",
" James | \n",
" 2016 | \n",
"
\n",
" \n",
" 4 | \n",
" 212 | \n",
" Jason | \n",
" 2012 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" doctor_id patient_name year\n",
"0 210 Julia 2015\n",
"1 214 TJ 2014\n",
"2 215 John 2015\n",
"3 212 James 2016\n",
"4 212 Jason 2012"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"visits"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"For every visit, I would like to create a table with the doctor that saw the patient, if one exists. We will merge on the doctor_id column with an inner merge. With an inner merge, if a match isn't found, then we do not include the row. If the two columns that you want to merge on have the same name, then you can simply specify the on argument instead of separate left_on and right_on arguments taking the same value."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" doctor_id | \n",
" patient_name | \n",
" year | \n",
" degree | \n",
" doc_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 210 | \n",
" Julia | \n",
" 2015 | \n",
" MD | \n",
" Jon | \n",
"
\n",
" \n",
" 1 | \n",
" 212 | \n",
" James | \n",
" 2016 | \n",
" MD | \n",
" Ke | \n",
"
\n",
" \n",
" 2 | \n",
" 212 | \n",
" Jason | \n",
" 2012 | \n",
" MD | \n",
" Ke | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" doctor_id patient_name year degree doc_name\n",
"0 210 Julia 2015 MD Jon\n",
"1 212 James 2016 MD Ke\n",
"2 212 Jason 2012 MD Ke"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Basic inner merge\n",
"visits.merge(doctors, how = \"inner\", on = \"doctor_id\" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that we have lost the visits of TJ and John. This is because the doctors table does not have a doctor corresponding to ids 214 and 215. In the case, the inner merge will not include these visits in the dataframe return from the merge. We can slice the returned dataframe using .loc to get the desired columns or remove duplicates that arise because of the merge. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" patient_name | \n",
" year | \n",
" doc_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Julia | \n",
" 2015 | \n",
" Jon | \n",
"
\n",
" \n",
" 1 | \n",
" James | \n",
" 2016 | \n",
" Ke | \n",
"
\n",
" \n",
" 2 | \n",
" Jason | \n",
" 2012 | \n",
" Ke | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" patient_name year doc_name\n",
"0 Julia 2015 Jon\n",
"1 James 2016 Ke\n",
"2 Jason 2012 Ke"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"visits.merge(doctors, how = \"inner\", on = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets see how we can use a left merge to keep these visits that don't have a matching doctor."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" patient_name | \n",
" year | \n",
" doc_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Julia | \n",
" 2015 | \n",
" Jon | \n",
"
\n",
" \n",
" 1 | \n",
" TJ | \n",
" 2014 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" John | \n",
" 2015 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" James | \n",
" 2016 | \n",
" Ke | \n",
"
\n",
" \n",
" 4 | \n",
" Jason | \n",
" 2012 | \n",
" Ke | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" patient_name year doc_name\n",
"0 Julia 2015 Jon\n",
"1 TJ 2014 NaN\n",
"2 John 2015 NaN\n",
"3 James 2016 Ke\n",
"4 Jason 2012 Ke"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Basic left merge\n",
"visits.merge(doctors, how = \"left\", on = \"doctor_id\" ).loc[:,[\"patient_name\", \"year\", \"doc_name\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that in this example, we get a row for every visit, but for visits that don't have a matching doctor_id we get NaNs from the doctors table."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Finally, let's see an example of an outer merge. Consider the following two dataframes."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Week | \n",
" RushPoints | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 12 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 32 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 34 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 12 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Week RushPoints\n",
"0 1 12\n",
"1 2 32\n",
"2 3 34\n",
"3 4 12"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Create Two data frames\n",
"df1 = pd.DataFrame({ \"RushPoints\":[12,32,34,12], \"Week\":[1,2,3,4]})\n",
"df1 = df1[[\"Week\", \"RushPoints\"]]\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Week | \n",
" RecPoints | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 101 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" 310 | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 234 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Week RecPoints\n",
"0 2 101\n",
"1 3 310\n",
"2 5 234"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pd.DataFrame({\"Week\":[2,3,5], \"RecPoints\":[101,310,234]})\n",
"df2 = df2[[\"Week\", \"RecPoints\"]]\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets say I wanted to combine the records and put a zero if one of the points categories did not exists for the given week. In this case I want to merge on Week, but each table has weeks that other doesn't have so a left merge will delete rows no matter which dataframe I choose as the left one. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Week | \n",
" RushPoints | \n",
" RecPoints | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 12.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 32.0 | \n",
" 101.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 34.0 | \n",
" 310.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 12.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" NaN | \n",
" 234.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Week RushPoints RecPoints\n",
"0 1 12.0 NaN\n",
"1 2 32.0 101.0\n",
"2 3 34.0 310.0\n",
"3 4 12.0 NaN\n",
"4 5 NaN 234.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Basic outer merge\n",
"final = df1.merge(df2, how = \"outer\", on = \"Week\")\n",
"final"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the resulting data frame has a row for each week!"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Week | \n",
" RushPoints | \n",
" RecPoints | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 12.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 32.0 | \n",
" 101.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 34.0 | \n",
" 310.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 12.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0.0 | \n",
" 234.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Week RushPoints RecPoints\n",
"0 1 12.0 0.0\n",
"1 2 32.0 101.0\n",
"2 3 34.0 310.0\n",
"3 4 12.0 0.0\n",
"4 5 0.0 234.0"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Replace the NAN with 0\n",
"final.fillna(0, inplace = True)\n",
"final"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
},
"toc": {
"nav_menu": {
"height": "84px",
"width": "252px"
},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}