{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filtering 2\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lecture, we cover the groupby operation which let's us summarize and filter a dataframe based on groupings of the rows that we specify. There are three types of methods that can be used along with a group by:\n", "\n", "- aggregate: compute statistics (mean, stdev, max, min, etc ... ) of each group.\n", "- transformations: perform some group specific operation on the data.\n", "- filter: filter the data based on some information form each group.\n", "\n", "The examples in this notebook will come from Apple stock data for most of 2017. This data is stored in AAPL.csv. Let's first read in the data and create a column for the month and weekday." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOpenHighLowCloseAdj CloseVolumeMonthWeekday
02017-01-03115.800003116.330002114.760002116.150002114.311760287819001Tuesday
12017-01-04115.849998116.510002115.750000116.019997114.183815211181001Wednesday
22017-01-05115.919998116.860001115.809998116.610001114.764473221936001Thursday
32017-01-06116.779999118.160004116.470001117.910004116.043915317519001Friday
42017-01-09117.949997119.430000117.940002118.989998117.106812335619001Monday
\n", "
" ], "text/plain": [ " Date Open High Low Close Adj Close \\\n", "0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760 \n", "1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815 \n", "2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473 \n", "3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915 \n", "4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812 \n", "\n", " Volume Month Weekday \n", "0 28781900 1 Tuesday \n", "1 21118100 1 Wednesday \n", "2 22193600 1 Thursday \n", "3 31751900 1 Friday \n", "4 33561900 1 Monday " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in the data \n", "df_aapl = pd.read_csv(\"../Data/AAPL.csv\", index_col= 0, parse_dates= [\"Date\"] )\n", "\n", "#Add a month column\n", "df_aapl[\"Month\"] = df_aapl.Date.dt.month\n", "\n", "#Add a week column\n", "df_aapl[\"Weekday\"] = df_aapl.Date.dt.weekday_name\n", "\n", "df_aapl.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(169, 9)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#shape of df\n", "df_aapl.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Date datetime64[ns]\n", "Open float64\n", "High float64\n", "Low float64\n", "Close float64\n", "Adj Close float64\n", "Volume int64\n", "Month int64\n", "Weekday object\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#How columns are stored\n", "df_aapl.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say I want to the know the average open price for each month. I will groupby month and for each of these groups I want to the know the average open price." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Month\n", "1 119.093499\n", "2 133.234738\n", "3 140.362174\n", "4 143.030001\n", "5 151.965908\n", "6 148.215001\n", "7 148.096500\n", "8 158.946958\n", "9 164.800003\n", "Name: Open, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Basic groupby\n", "df_aapl.groupby(by = \"Month\").Open.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, I get back a series with the avg open price for each month. \n", "\n", "What is I want to know both the average open price and the average volume in each month. To do this, I just select the two column columns instead of just the Open column." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenVolume
Month
1119.0934992.815610e+07
2133.2347383.026151e+07
3140.3621742.441863e+07
4143.0300011.964758e+07
5151.9659082.971615e+07
6148.2150013.109900e+07
7148.0965002.109962e+07
8158.9469582.874213e+07
9164.8000031.659110e+07
\n", "
" ], "text/plain": [ " Open Volume\n", "Month \n", "1 119.093499 2.815610e+07\n", "2 133.234738 3.026151e+07\n", "3 140.362174 2.441863e+07\n", "4 143.030001 1.964758e+07\n", "5 151.965908 2.971615e+07\n", "6 148.215001 3.109900e+07\n", "7 148.096500 2.109962e+07\n", "8 158.946958 2.874213e+07\n", "9 164.800003 1.659110e+07" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Summarize groups based on two columns\n", "df_aapl.groupby(by = \"Month\")[\"Open\", \"Volume\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, I get back a data frame with the information from the each month given in the two columns. \n", "\n", "What if I want the average open price for each month and each day of the week? I can specify a list of column names for the by argument in the group by to accomplish this." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Month Weekday \n", "1 Friday 119.619999\n", " Monday 119.626666\n", " Thursday 118.972500\n", " Tuesday 118.722000\n", " Wednesday 118.752499\n", "2 Friday 132.945004\n", " Monday 133.116669\n", " Thursday 133.170000\n", " Tuesday 134.329998\n", " Wednesday 132.582500\n", "3 Friday 140.850000\n", " Monday 139.502499\n", " Thursday 140.982001\n", " Tuesday 140.345001\n", " Wednesday 139.956000\n", "Name: Open, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Groubpy multiple columns\n", "df_aapl.groupby(by = [\"Month\", \"Weekday\"]).Open.mean().head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I now have a group for every month + weekday combination. The above is a series with multi-level. In the PPT I show how to slice the series in this case. Another way to get around this is to simply reset the index, which will move the multilevel index to columns in your dataframe." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthWeekdayOpen
01Friday119.619999
11Monday119.626666
21Thursday118.972500
31Tuesday118.722000
41Wednesday118.752499
52Friday132.945004
62Monday133.116669
72Thursday133.170000
82Tuesday134.329998
92Wednesday132.582500
103Friday140.850000
113Monday139.502499
123Thursday140.982001
133Tuesday140.345001
143Wednesday139.956000
\n", "
" ], "text/plain": [ " Month Weekday Open\n", "0 1 Friday 119.619999\n", "1 1 Monday 119.626666\n", "2 1 Thursday 118.972500\n", "3 1 Tuesday 118.722000\n", "4 1 Wednesday 118.752499\n", "5 2 Friday 132.945004\n", "6 2 Monday 133.116669\n", "7 2 Thursday 133.170000\n", "8 2 Tuesday 134.329998\n", "9 2 Wednesday 132.582500\n", "10 3 Friday 140.850000\n", "11 3 Monday 139.502499\n", "12 3 Thursday 140.982001\n", "13 3 Tuesday 140.345001\n", "14 3 Wednesday 139.956000" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_aapl.groupby(by = [\"Month\", \"Weekday\"]).Open.mean().reset_index().head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregate\n", "\n", "I can use the aggregate method to summarize the groups using multiple functions. Let's say I want to know the avg and standard deviation of the open price for each month. After the groupby, I use the agg() method and inside I can specify a list of functions that will be used to summarize each group." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanstd
Month
1119.0934991.891817
2133.2347383.410836
3140.3621741.728701
4143.0300011.120391
5151.9659083.476493
6148.2150014.450870
7148.0965003.479140
8158.9469582.926976
9164.800003NaN
\n", "
" ], "text/plain": [ " mean std\n", "Month \n", "1 119.093499 1.891817\n", "2 133.234738 3.410836\n", "3 140.362174 1.728701\n", "4 143.030001 1.120391\n", "5 151.965908 3.476493\n", "6 148.215001 4.450870\n", "7 148.096500 3.479140\n", "8 158.946958 2.926976\n", "9 164.800003 NaN" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using agg\n", "df_aapl.groupby(by = \"Month\").Open.agg([np.mean, np.std])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now use agg() where we select both Open and Volume:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenVolume
meanstdmeanstd
Month
1119.0934991.8918172.815610e+076.572536e+06
2133.2347383.4108363.026151e+072.055055e+07
3140.3621741.7287012.441863e+077.416385e+06
4143.0300011.1203911.964758e+074.026376e+06
5151.9659083.4764932.971615e+071.014266e+07
6148.2150014.4508703.109900e+071.416738e+07
7148.0965003.4791402.109962e+074.182316e+06
8158.9469582.9269762.874213e+071.037540e+07
9164.800003NaN1.659110e+07NaN
\n", "
" ], "text/plain": [ " Open Volume \n", " mean std mean std\n", "Month \n", "1 119.093499 1.891817 2.815610e+07 6.572536e+06\n", "2 133.234738 3.410836 3.026151e+07 2.055055e+07\n", "3 140.362174 1.728701 2.441863e+07 7.416385e+06\n", "4 143.030001 1.120391 1.964758e+07 4.026376e+06\n", "5 151.965908 3.476493 2.971615e+07 1.014266e+07\n", "6 148.215001 4.450870 3.109900e+07 1.416738e+07\n", "7 148.096500 3.479140 2.109962e+07 4.182316e+06\n", "8 158.946958 2.926976 2.874213e+07 1.037540e+07\n", "9 164.800003 NaN 1.659110e+07 NaN" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using agg() with\n", "result = df_aapl.groupby(by = \"Month\")[\"Open\", \"Volume\"].agg([np.mean, np.std])\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This gives back a dataframe with multi-level columns, which can be sliced as follows:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanstd
Month
1119.0934991.891817
2133.2347383.410836
3140.3621741.728701
4143.0300011.120391
5151.9659083.476493
6148.2150014.450870
7148.0965003.479140
8158.9469582.926976
9164.800003NaN
\n", "
" ], "text/plain": [ " mean std\n", "Month \n", "1 119.093499 1.891817\n", "2 133.234738 3.410836\n", "3 140.362174 1.728701\n", "4 143.030001 1.120391\n", "5 151.965908 3.476493\n", "6 148.215001 4.450870\n", "7 148.096500 3.479140\n", "8 158.946958 2.926976\n", "9 164.800003 NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the columns for the open price\n", "result.loc[:, \"Open\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say that instead of applying both functions to both columns I wanted the mean to be applied to the Open column and a maximum to be applied to the volume column. To do this, I specify a dictionary in the agg() method where the key is the column and the value is a function to be applied to the groups of that column. In this case, since I specify the columns I want to focus on in the dictionary I do not have to select them after the groupby" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OpenVolume
Month
1119.09349949201000
2133.234738111985000
3140.36217443885000
4143.03000130379400
5151.96590850767700
6148.21500172307300
7148.09650032476300
8158.94695869936800
9164.80000316591100
\n", "
" ], "text/plain": [ " Open Volume\n", "Month \n", "1 119.093499 49201000\n", "2 133.234738 111985000\n", "3 140.362174 43885000\n", "4 143.030001 30379400\n", "5 151.965908 50767700\n", "6 148.215001 72307300\n", "7 148.096500 32476300\n", "8 158.946958 69936800\n", "9 164.800003 16591100" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#agg with a dictionary\n", "df_aapl.groupby(by = \"Month\").agg({\"Open\": np.mean,\"Volume\": np.max})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can even use the agg() with a customer function. Let's say I wanted the count of the number of days in each month the open price was above the average open price for the month. First I'll write my customer function to be applied to each group." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def Compare_Open(group):\n", " count = 0\n", " \n", " avg_price = group.mean()\n", " for index in list(group.index):\n", " open_price = group[index]\n", " \n", " if open_price>=avg_price:\n", " \n", " count+=1\n", " \n", " return count\n", " " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Open
Month
111.0
210.0
310.0
411.0
515.0
67.0
711.0
814.0
91.0
\n", "
" ], "text/plain": [ " Open\n", "Month \n", "1 11.0\n", "2 10.0\n", "3 10.0\n", "4 11.0\n", "5 15.0\n", "6 7.0\n", "7 11.0\n", "8 14.0\n", "9 1.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_aapl.groupby(by = \"Month\").agg({\"Open\":Compare_Open})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, the function Compare_Open is called once for each group (each month), The input group is a series that represents the open price for the given group." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transform\n", "\n", "Instead of aggregating each group we can apply a transformation with transform() method after the groupby. The agg() method uses takes a column of data and spits out a single number summarizing this column based on the specified groups. The transform() method takes a column and returns a back a series that is the same length. For example let's say I use the transform method with the sum function on the column Volume grouping by Month." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 563122000\n", "1 563122000\n", "2 563122000\n", "3 563122000\n", "4 563122000\n", "Name: Volume, dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "volume_by_month = df_aapl.groupby(by = \"Month\").Volume.transform(np.sum)\n", "\n", "volume_by_month.head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(169,)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Look at how many rows there ar\n", "volume_by_month.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this series has 169 rows, which is the same number of days that we have stock information for. What happened is that for each group we compute the sum, but instead of giving a single number for each group, it takes this summarizing number of matches it with the group that each row corresponds to. So let's say I wanted a column for the fraction of the month's volumne that each day represents. I can first use a transform as I have done above to get a column for the total volume in each month and then I can do the simple division to get the desired column." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOpenHighLowCloseAdj CloseVolumeMonthWeekdayTotal_Month_VolumnTotal_Month_Volume
02017-01-03115.800003116.330002114.760002116.150002114.311760287819001Tuesday563122000563122000
12017-01-04115.849998116.510002115.750000116.019997114.183815211181001Wednesday563122000563122000
22017-01-05115.919998116.860001115.809998116.610001114.764473221936001Thursday563122000563122000
32017-01-06116.779999118.160004116.470001117.910004116.043915317519001Friday563122000563122000
42017-01-09117.949997119.430000117.940002118.989998117.106812335619001Monday563122000563122000
\n", "
" ], "text/plain": [ " Date Open High Low Close Adj Close \\\n", "0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760 \n", "1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815 \n", "2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473 \n", "3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915 \n", "4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812 \n", "\n", " Volume Month Weekday Total_Month_Volumn Total_Month_Volume \n", "0 28781900 1 Tuesday 563122000 563122000 \n", "1 21118100 1 Wednesday 563122000 563122000 \n", "2 22193600 1 Thursday 563122000 563122000 \n", "3 31751900 1 Friday 563122000 563122000 \n", "4 33561900 1 Monday 563122000 563122000 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Use transform to create a column for the total volumn in each month\n", "df_aapl[\"Total_Month_Volume\"] = df_aapl.groupby(by = \"Month\").Volume.transform(np.sum)\n", "\n", "df_aapl.head()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOpenHighLowCloseAdj CloseVolumeMonthWeekdayTotal_Month_VolumnTotal_Month_VolumeFrac_Months_Vol
02017-01-03115.800003116.330002114.760002116.150002114.311760287819001Tuesday5631220005631220000.051111
12017-01-04115.849998116.510002115.750000116.019997114.183815211181001Wednesday5631220005631220000.037502
22017-01-05115.919998116.860001115.809998116.610001114.764473221936001Thursday5631220005631220000.039412
32017-01-06116.779999118.160004116.470001117.910004116.043915317519001Friday5631220005631220000.056385
42017-01-09117.949997119.430000117.940002118.989998117.106812335619001Monday5631220005631220000.059600
\n", "
" ], "text/plain": [ " Date Open High Low Close Adj Close \\\n", "0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760 \n", "1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815 \n", "2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473 \n", "3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915 \n", "4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812 \n", "\n", " Volume Month Weekday Total_Month_Volumn Total_Month_Volume \\\n", "0 28781900 1 Tuesday 563122000 563122000 \n", "1 21118100 1 Wednesday 563122000 563122000 \n", "2 22193600 1 Thursday 563122000 563122000 \n", "3 31751900 1 Friday 563122000 563122000 \n", "4 33561900 1 Monday 563122000 563122000 \n", "\n", " Frac_Months_Vol \n", "0 0.051111 \n", "1 0.037502 \n", "2 0.039412 \n", "3 0.056385 \n", "4 0.059600 " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Create new column for the fraction of the month's volumen\n", "df_aapl[\"Frac_Months_Vol\"] = df_aapl[\"Volume\"]/df_aapl[\"Total_Month_Volume\"]\n", "\n", "df_aapl.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can also use transform to create standardize columns for each group. So let's say I want to create a standardized open price column where for each open price I substract off the mean open price for the month and divide by the standard deviation. We can use a lambda function to accomplish this." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOpenHighLowCloseAdj CloseVolumeMonthWeekdayTotal_Month_VolumnTotal_Month_VolumeFrac_Months_VolStandard_Open
02017-01-03115.800003116.330002114.760002116.150002114.311760287819001Tuesday5631220005631220000.051111-1.740917
12017-01-04115.849998116.510002115.750000116.019997114.183815211181001Wednesday5631220005631220000.037502-1.714490
22017-01-05115.919998116.860001115.809998116.610001114.764473221936001Thursday5631220005631220000.039412-1.677489
32017-01-06116.779999118.160004116.470001117.910004116.043915317519001Friday5631220005631220000.056385-1.222899
42017-01-09117.949997119.430000117.940002118.989998117.106812335619001Monday5631220005631220000.059600-0.604447
\n", "
" ], "text/plain": [ " Date Open High Low Close Adj Close \\\n", "0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760 \n", "1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815 \n", "2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473 \n", "3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915 \n", "4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812 \n", "\n", " Volume Month Weekday Total_Month_Volumn Total_Month_Volume \\\n", "0 28781900 1 Tuesday 563122000 563122000 \n", "1 21118100 1 Wednesday 563122000 563122000 \n", "2 22193600 1 Thursday 563122000 563122000 \n", "3 31751900 1 Friday 563122000 563122000 \n", "4 33561900 1 Monday 563122000 563122000 \n", "\n", " Frac_Months_Vol Standard_Open \n", "0 0.051111 -1.740917 \n", "1 0.037502 -1.714490 \n", "2 0.039412 -1.677489 \n", "3 0.056385 -1.222899 \n", "4 0.059600 -0.604447 " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using the tranform method to standardize\n", "df_aapl[\"Standard_Open\"] = df_aapl.groupby(by = \"Month\").Open.transform(lambda x : (x - x.mean())/x.std())\n", "\n", "df_aapl.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lambda function above is only called only once for each group where the input x will be a series representing the column Open for each of the Months that we grouped by. It should be noted that if you subtract a single number from a series, as is the case when this lambda function is called, then pandas knows to subtract this number from all the numbers in the series. The same game for dividing a series by a single number." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering\n", "\n", "The filter() method after a group by let's us only select rows corresponding to each group where a certain criterion regarding the group as a whole is true is satisfied. The filter() method must take in a function that returns a boolean. The function will be run once for each row. For example, lets say I only want to look at rows for days in months where the average opening price for the month was above 140." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "above_140 = df_aapl.groupby(by = \"Month\").filter(lambda group: group.Open.mean() >= 140)\n", "\n", "#Note that we lose Jan and Feb since these months had an avg open below 140.\n", "above_140.Month.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lambda function above is called once for each group where the input group will be dataframe with the rows corresponding to the given month that is being run through the lambda function." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }