{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Pandas\n", "\n", "This notebook covers the following advanced pandas techniques:\n", "\n", "- String methods\n", "- map\n", "- apply \n", "- iterrows\n", "\n", "In accordance with the slides, this tutorial will use the Titanic Data Set.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
109023Ilieff, Mr. YliomaleNaN003492207.8958NaNS
119031Jones, Mr. Charles Cressonmale46.00069426.0000NaNS
129041Snyder, Mrs. John Pillsbury (Nelle Stevenson)female23.0102122882.2667B45S
139052Howard, Mr. Benjaminmale63.0102406526.0000NaNS
149061Chaffee, Mrs. Herbert Fuller (Carrie Constance...female47.010W.E.P. 573461.1750E31S
159072del Carlo, Mrs. Sebastiano (Argenia Genovesi)female24.010SC/PARIS 216727.7208NaNC
169082Keane, Mr. Danielmale35.00023373412.3500NaNQ
179093Assaf, Mr. Geriosmale21.00026927.2250NaNC
189103Ilmakangas, Miss. Ida Livijafemale27.010STON/O2. 31012707.9250NaNS
199113Assaf Khalil, Mrs. Mariana (Miriam\")\"female45.00026967.2250NaNC
\n", "
" ], "text/plain": [ " PassengerId Pclass Name \\\n", "0 892 3 Kelly, Mr. James \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) \n", "2 894 2 Myles, Mr. Thomas Francis \n", "3 895 3 Wirz, Mr. Albert \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) \n", "5 897 3 Svensson, Mr. Johan Cervin \n", "6 898 3 Connolly, Miss. Kate \n", "7 899 2 Caldwell, Mr. Albert Francis \n", "8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) \n", "9 901 3 Davies, Mr. John Samuel \n", "10 902 3 Ilieff, Mr. Ylio \n", "11 903 1 Jones, Mr. Charles Cresson \n", "12 904 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) \n", "13 905 2 Howard, Mr. Benjamin \n", "14 906 1 Chaffee, Mrs. Herbert Fuller (Carrie Constance... \n", "15 907 2 del Carlo, Mrs. Sebastiano (Argenia Genovesi) \n", "16 908 2 Keane, Mr. Daniel \n", "17 909 3 Assaf, Mr. Gerios \n", "18 910 3 Ilmakangas, Miss. Ida Livija \n", "19 911 3 Assaf Khalil, Mrs. Mariana (Miriam\")\" \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "0 male 34.5 0 0 330911 7.8292 NaN Q \n", "1 female 47.0 1 0 363272 7.0000 NaN S \n", "2 male 62.0 0 0 240276 9.6875 NaN Q \n", "3 male 27.0 0 0 315154 8.6625 NaN S \n", "4 female 22.0 1 1 3101298 12.2875 NaN S \n", "5 male 14.0 0 0 7538 9.2250 NaN S \n", "6 female 30.0 0 0 330972 7.6292 NaN Q \n", "7 male 26.0 1 1 248738 29.0000 NaN S \n", "8 female 18.0 0 0 2657 7.2292 NaN C \n", "9 male 21.0 2 0 A/4 48871 24.1500 NaN S \n", "10 male NaN 0 0 349220 7.8958 NaN S \n", "11 male 46.0 0 0 694 26.0000 NaN S \n", "12 female 23.0 1 0 21228 82.2667 B45 S \n", "13 male 63.0 1 0 24065 26.0000 NaN S \n", "14 female 47.0 1 0 W.E.P. 5734 61.1750 E31 S \n", "15 female 24.0 1 0 SC/PARIS 2167 27.7208 NaN C \n", "16 male 35.0 0 0 233734 12.3500 NaN Q \n", "17 male 21.0 0 0 2692 7.2250 NaN C \n", "18 female 27.0 1 0 STON/O2. 3101270 7.9250 NaN S \n", "19 female 45.0 0 0 2696 7.2250 NaN C " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_titanic = pd.read_csv(\"Data/Titanic.csv\")\n", "\n", "df_titanic.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Methods\n", "\n", "We can use the usual string methods to manipulate data inside of a dataframe. To invoke a string method use the .str attribute of a series. Since the only column that is stored as a string is the Name column we will work with that." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 kelly, mr. james\n", "1 wilkes, mrs. james (ellen needs)\n", "2 myles, mr. thomas francis\n", "3 wirz, mr. albert\n", "4 hirvonen, mrs. alexander (helga e lindqvist)\n", "Name: Name, dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using the lower() method\n", "\n", "df_titanic[\"Name\"].str.lower().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above bit of code returns the Name column as a str with all lowercase letters. Another useful string method is the contains() method, which returns a boolean if the given string contain the input string. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "run_control": { "marked": false } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 True\n", "4 False\n", "Name: Name, dtype: bool" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#using contains() method to check for title Mr.\n", "df_titanic[\"Name\"].str.contains(\"Mr\\.\").head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we need the escape character \"\\\" to look for the \".\". We can easily the results of the returned series in a new column as follows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBool_Mr
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQTrue
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNSFalse
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQTrue
38953Wirz, Mr. Albertmale27.0003151548.6625NaNSTrue
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNSFalse
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked Bool_Mr \n", "0 34.5 0 0 330911 7.8292 NaN Q True \n", "1 47.0 1 0 363272 7.0000 NaN S False \n", "2 62.0 0 0 240276 9.6875 NaN Q True \n", "3 27.0 0 0 315154 8.6625 NaN S True \n", "4 22.0 1 1 3101298 12.2875 NaN S False " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Creating new column\n", "df_titanic[\"Bool_Mr\"] = df_titanic[\"Name\"].str.contains(\"Mr\\.\")\n", "\n", "df_titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a new boolean column and we can easily figure out, for example, the total number of \"Mr.\"s as follows" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.57416267942583732" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Computing the proportion of Mr.s\n", "df_titanic.Bool_Mr.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use these string methods on the column names. Let's say I want to replace all the underscores (there is only one) with blank spaces. I can do that with the replace method. Recall that I access the column names through the columns attribute of any dataframe." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',\n", " 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Bool_Mr'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_titanic.columns" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBoolMr
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQTrue
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNSFalse
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQTrue
38953Wirz, Mr. Albertmale27.0003151548.6625NaNSTrue
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNSFalse
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked BoolMr \n", "0 34.5 0 0 330911 7.8292 NaN Q True \n", "1 47.0 1 0 363272 7.0000 NaN S False \n", "2 62.0 0 0 240276 9.6875 NaN Q True \n", "3 27.0 0 0 315154 8.6625 NaN S True \n", "4 22.0 1 1 3101298 12.2875 NaN S False " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Replace the underscore\n", "df_titanic.columns = df_titanic.columns.str.replace(\"_\", \"\")\n", "\n", "df_titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the name of the column we created above has been changed. Before we move to the next section, I will delete this column." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "0 34.5 0 0 330911 7.8292 NaN Q \n", "1 47.0 1 0 363272 7.0000 NaN S \n", "2 62.0 0 0 240276 9.6875 NaN Q \n", "3 27.0 0 0 315154 8.6625 NaN S \n", "4 22.0 1 1 3101298 12.2875 NaN S " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del df_titanic[\"BoolMr\"]\n", "\n", "df_titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Map\n", "\n", "The map method lets us map values in a column to other values. Let's use map to create a binary column that is 1 for females and 0 for males. We give the mapping that we want by providing the appropriate dictionary" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 0\n", "2 1\n", "3 1\n", "4 0\n", "Name: Sex, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using map\n", "df_titanic.Sex.map({\"male\":1, \"female\":0}).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a series, which we can again store as a column" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBinary_Male
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ1
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS0
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ1
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS1
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS0
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked Binary_Male \n", "0 34.5 0 0 330911 7.8292 NaN Q 1 \n", "1 47.0 1 0 363272 7.0000 NaN S 0 \n", "2 62.0 0 0 240276 9.6875 NaN Q 1 \n", "3 27.0 0 0 315154 8.6625 NaN S 1 \n", "4 22.0 1 1 3101298 12.2875 NaN S 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Use map to create new column\n", "df_titanic[\"Binary_Male\"] = df_titanic.Sex.map({\"male\":1, \"female\":0})\n", "\n", "df_titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compute the fraction of male passengers." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.6363636363636364" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Compute fraction of male passengers\n", "df_titanic.Binary_Male.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply\n", "\n", "The apply method applies built-in or custom functions to each row or column of a dataframe. Let's use the apply method to compute the average age and fair. We will need the mean function from numpy to do so." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Age 30.272590\n", "Fare 35.627188\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using apply - axis =0\n", "import numpy as np\n", "\n", "df_titanic[[\"Age\", \"Fare\"]].apply(np.mean,axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using apply, you first specify the function you want applied and then you specify whether you want the function applied to each row (axis = 1) or each column (axis = 0). The result is a series where the index are the column names. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 21.16460\n", "1 27.00000\n", "2 35.84375\n", "3 17.83125\n", "4 17.14375\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using apply - axis = 1\n", "df_titanic[[\"Age\", \"Fare\"]].apply(np.mean,axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's say we also want to look at rounded versions of the columns corresponding to the rounded age and fare. We can use numpy's round function and the apply method to accomplish this. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFare
034.08.0
147.07.0
262.010.0
327.09.0
422.012.0
\n", "
" ], "text/plain": [ " Age Fare\n", "0 34.0 8.0\n", "1 47.0 7.0\n", "2 62.0 10.0\n", "3 27.0 9.0\n", "4 22.0 12.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using apply with the round function\n", "rounded_cols = df_titanic[[\"Age\", \"Fare\"]].apply(np.round).head()\n", "rounded_cols.head() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main benefit of apply is using it with custom functions. Each passenger has a title (Mr., Mrs., etc ...). The title comes after the comma in the name. Let's write a custom function to get this title from each name a and store it in a column." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def Get_Title(name):\n", " \n", " parsed_name = name.split(\" \")\n", " for i in range(len(parsed_name)):\n", " if \",\" in parsed_name[i]:\n", " return parsed_name[i+1]\n", " \n", " \n", " return title" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Mr.\n", "1 Mrs.\n", "2 Mr.\n", "3 Mr.\n", "4 Mrs.\n", "Name: Name, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Use apply with this custome function\n", "df_titanic.Name.apply(Get_Title).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like with the built in functions, we supply the name of the function we want applied to each value in the column. The result is a series that we can store as a new column." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBinary_MaleTitle
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ1Mr.
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS0Mrs.
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ1Mr.
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS1Mr.
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS0Mrs.
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked Binary_Male Title \n", "0 34.5 0 0 330911 7.8292 NaN Q 1 Mr. \n", "1 47.0 1 0 363272 7.0000 NaN S 0 Mrs. \n", "2 62.0 0 0 240276 9.6875 NaN Q 1 Mr. \n", "3 27.0 0 0 315154 8.6625 NaN S 1 Mr. \n", "4 22.0 1 1 3101298 12.2875 NaN S 0 Mrs. " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Add in the new Title column\n", "df_titanic[\"Title\"] = df_titanic.Name.apply(Get_Title)\n", "\n", "df_titanic.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Mr. 240\n", "Miss. 78\n", "Mrs. 72\n", "Master. 21\n", "Rev. 2\n", "Col. 2\n", "Dona. 1\n", "Ms. 1\n", "Dr. 1\n", "Name: Title, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's see the breakdown of each title\n", "df_titanic.Title.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above example demonstrates how we can use apply on single column or series. When we use apply on multiple columns we pass the function a row. Let's create a column called \"Old_Man\" that is 1 if the passenger is male and above the age of 60." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def Get_Old_Man(row):\n", " \n", " gender = row[\"Sex\"]\n", " age = row[\"Age\"]\n", " \n", " if gender == \"male\" and age>=60:\n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBinary_MaleTitleOld_Man
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ1Mr.0
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS0Mrs.0
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ1Mr.1
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS1Mr.0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS0Mrs.0
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked Binary_Male Title \\\n", "0 34.5 0 0 330911 7.8292 NaN Q 1 Mr. \n", "1 47.0 1 0 363272 7.0000 NaN S 0 Mrs. \n", "2 62.0 0 0 240276 9.6875 NaN Q 1 Mr. \n", "3 27.0 0 0 315154 8.6625 NaN S 1 Mr. \n", "4 22.0 1 1 3101298 12.2875 NaN S 0 Mrs. \n", "\n", " Old_Man \n", "0 0 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Make the Old Man column\n", "df_titanic[\"Old_Man\"] = df_titanic.apply(Get_Old_Man, axis=1)\n", "df_titanic.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "13" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Lets compute the number of old men.\n", "df_titanic.Old_Man.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that since we are again applying the function to multiple columns, I have to specify axis=1 because I want the function applied to each row.\n", "\n", "Lastly, let's see how you can use apply with a custom function that take an input that is not just the row. Considering the problem above, let's allow for an input old_man_age that specifies the lower bound on the age of an old man. Let's recreate the column with this lower bound set to 50." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedBinary_MaleTitleOld_Man
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ1Mr.0
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS0Mrs.0
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ1Mr.1
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS1Mr.0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS0Mrs.0
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked Binary_Male Title \\\n", "0 34.5 0 0 330911 7.8292 NaN Q 1 Mr. \n", "1 47.0 1 0 363272 7.0000 NaN S 0 Mrs. \n", "2 62.0 0 0 240276 9.6875 NaN Q 1 Mr. \n", "3 27.0 0 0 315154 8.6625 NaN S 1 Mr. \n", "4 22.0 1 1 3101298 12.2875 NaN S 0 Mrs. \n", "\n", " Old_Man \n", "0 0 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Function has new input\n", "def Get_Old_Man(row, old_man_age):\n", " \n", " gender = row[\"Sex\"]\n", " age = row[\"Age\"]\n", " \n", " if gender == \"male\" and age>=old_man_age:\n", " return 1\n", " else:\n", " return 0\n", " \n", "#Make the Old Man column\n", "#need to specify old_man_age explicitly\n", "df_titanic[\"Old_Man\"] = df_titanic.apply(Get_Old_Man, old_man_age=50, axis=1)\n", "df_titanic.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's recompute the number of old men, which should go up." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Lets compute the number of old m\n", "df_titanic.Old_Man.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterrows\n", "\n", "Iterrows allows us to iterate through the row of a dataframe using a for loop. The iterrows method returns two things when I place it as the object in a for loop and hence I have to define two loop variables. The first item it returns is the index of the given data frame and the second item is the row, which is return as a series.\n", "Let's say I want to create a dictionary where the key is each unique title and the values are lists of ages of each person with that title." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false, "run_control": { "marked": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[9.0, 10.0, 13.0, 6.0, 2.0, 13.0, 7.0, 11.5, 8.0, 6.0, 0.33, nan, 0.75, 0.83, nan, nan, 14.5, 5.0, 6.0, 13.0, nan]\n" ] } ], "source": [ "#Initialize dictionary\n", "D= {}\n", "\n", "#Use iterrows to iterate over rows of the datafra,e\n", "for index, row in df_titanic.iterrows():\n", " \n", " title= row[\"Title\"].strip(\" .\")\n", " age = row[\"Age\"]\n", " \n", " if title in D.keys():\n", " D[title].append(age)\n", " else:\n", " D[title] = [age]\n", " \n", "print(D[\"Master\"])\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "102px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": false, "threshold": 4, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }