{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filtering 2\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lecture, we cover the groupby operation which let's us summarize and filter a dataframe based on groupings of the rows that we specify.  There are three types of methods that can be used along with a group by:\n",
    "\n",
    "- aggregate: compute statistics (mean, stdev, max, min, etc ... ) of each group.\n",
    "- transformations: perform some group specific operation on the data.\n",
    "- filter: filter the data based on some information form each group.\n",
    "\n",
    "The examples in this notebook will come from Apple stock data for most of 2017.  This data is stored in AAPL.csv.  Let's first read in the data and create a column for the month and weekday."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Adj Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Month</th>\n",
       "      <th>Weekday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017-01-03</td>\n",
       "      <td>115.800003</td>\n",
       "      <td>116.330002</td>\n",
       "      <td>114.760002</td>\n",
       "      <td>116.150002</td>\n",
       "      <td>114.311760</td>\n",
       "      <td>28781900</td>\n",
       "      <td>1</td>\n",
       "      <td>Tuesday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>115.849998</td>\n",
       "      <td>116.510002</td>\n",
       "      <td>115.750000</td>\n",
       "      <td>116.019997</td>\n",
       "      <td>114.183815</td>\n",
       "      <td>21118100</td>\n",
       "      <td>1</td>\n",
       "      <td>Wednesday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017-01-05</td>\n",
       "      <td>115.919998</td>\n",
       "      <td>116.860001</td>\n",
       "      <td>115.809998</td>\n",
       "      <td>116.610001</td>\n",
       "      <td>114.764473</td>\n",
       "      <td>22193600</td>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017-01-06</td>\n",
       "      <td>116.779999</td>\n",
       "      <td>118.160004</td>\n",
       "      <td>116.470001</td>\n",
       "      <td>117.910004</td>\n",
       "      <td>116.043915</td>\n",
       "      <td>31751900</td>\n",
       "      <td>1</td>\n",
       "      <td>Friday</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017-01-09</td>\n",
       "      <td>117.949997</td>\n",
       "      <td>119.430000</td>\n",
       "      <td>117.940002</td>\n",
       "      <td>118.989998</td>\n",
       "      <td>117.106812</td>\n",
       "      <td>33561900</td>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date        Open        High         Low       Close   Adj Close  \\\n",
       "0 2017-01-03  115.800003  116.330002  114.760002  116.150002  114.311760   \n",
       "1 2017-01-04  115.849998  116.510002  115.750000  116.019997  114.183815   \n",
       "2 2017-01-05  115.919998  116.860001  115.809998  116.610001  114.764473   \n",
       "3 2017-01-06  116.779999  118.160004  116.470001  117.910004  116.043915   \n",
       "4 2017-01-09  117.949997  119.430000  117.940002  118.989998  117.106812   \n",
       "\n",
       "     Volume  Month    Weekday  \n",
       "0  28781900      1    Tuesday  \n",
       "1  21118100      1  Wednesday  \n",
       "2  22193600      1   Thursday  \n",
       "3  31751900      1     Friday  \n",
       "4  33561900      1     Monday  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Read in the data \n",
    "df_aapl = pd.read_csv(\"../Data/AAPL.csv\", index_col= 0, parse_dates= [\"Date\"] )\n",
    "\n",
    "#Add a month column\n",
    "df_aapl[\"Month\"] = df_aapl.Date.dt.month\n",
    "\n",
    "#Add a week column\n",
    "df_aapl[\"Weekday\"] = df_aapl.Date.dt.weekday_name\n",
    "\n",
    "df_aapl.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(169, 9)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#shape of df\n",
    "df_aapl.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Date         datetime64[ns]\n",
       "Open                float64\n",
       "High                float64\n",
       "Low                 float64\n",
       "Close               float64\n",
       "Adj Close           float64\n",
       "Volume                int64\n",
       "Month                 int64\n",
       "Weekday              object\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#How columns are stored\n",
    "df_aapl.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say I want to the know the average open price for each month.  I will groupby month and for each of these groups I want to the know the average open price."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Month\n",
       "1    119.093499\n",
       "2    133.234738\n",
       "3    140.362174\n",
       "4    143.030001\n",
       "5    151.965908\n",
       "6    148.215001\n",
       "7    148.096500\n",
       "8    158.946958\n",
       "9    164.800003\n",
       "Name: Open, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Basic groupby\n",
    "df_aapl.groupby(by = \"Month\").Open.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, I get back a series with the avg open price for each month.  \n",
    "\n",
    "What is I want to know both the average open price and the average volume in each month.  To do this, I just select the two column  columns instead of just the Open column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Open</th>\n",
       "      <th>Volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.093499</td>\n",
       "      <td>2.815610e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133.234738</td>\n",
       "      <td>3.026151e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>140.362174</td>\n",
       "      <td>2.441863e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>143.030001</td>\n",
       "      <td>1.964758e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>151.965908</td>\n",
       "      <td>2.971615e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>148.215001</td>\n",
       "      <td>3.109900e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>148.096500</td>\n",
       "      <td>2.109962e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>158.946958</td>\n",
       "      <td>2.874213e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>164.800003</td>\n",
       "      <td>1.659110e+07</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Open        Volume\n",
       "Month                          \n",
       "1      119.093499  2.815610e+07\n",
       "2      133.234738  3.026151e+07\n",
       "3      140.362174  2.441863e+07\n",
       "4      143.030001  1.964758e+07\n",
       "5      151.965908  2.971615e+07\n",
       "6      148.215001  3.109900e+07\n",
       "7      148.096500  2.109962e+07\n",
       "8      158.946958  2.874213e+07\n",
       "9      164.800003  1.659110e+07"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Summarize groups based on two columns\n",
    "df_aapl.groupby(by = \"Month\")[\"Open\", \"Volume\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, I get back a data frame with the information from the each month given in the two columns.  \n",
    "\n",
    "What if I want the average open price for each month and each day of the week? I can specify a list of column names for the by argument in the group by to accomplish this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Month  Weekday  \n",
       "1      Friday       119.619999\n",
       "       Monday       119.626666\n",
       "       Thursday     118.972500\n",
       "       Tuesday      118.722000\n",
       "       Wednesday    118.752499\n",
       "2      Friday       132.945004\n",
       "       Monday       133.116669\n",
       "       Thursday     133.170000\n",
       "       Tuesday      134.329998\n",
       "       Wednesday    132.582500\n",
       "3      Friday       140.850000\n",
       "       Monday       139.502499\n",
       "       Thursday     140.982001\n",
       "       Tuesday      140.345001\n",
       "       Wednesday    139.956000\n",
       "Name: Open, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Groubpy multiple columns\n",
    "df_aapl.groupby(by = [\"Month\", \"Weekday\"]).Open.mean().head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I now have a group for every month + weekday combination.  The above is a series with multi-level.  In the PPT I show how to slice the series in this case.  Another way to get around this is to simply reset the index, which will move the multilevel index to columns in your dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Month</th>\n",
       "      <th>Weekday</th>\n",
       "      <th>Open</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Friday</td>\n",
       "      <td>119.619999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "      <td>119.626666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>118.972500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>118.722000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>118.752499</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>Friday</td>\n",
       "      <td>132.945004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2</td>\n",
       "      <td>Monday</td>\n",
       "      <td>133.116669</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>133.170000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>134.329998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>132.582500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>3</td>\n",
       "      <td>Friday</td>\n",
       "      <td>140.850000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3</td>\n",
       "      <td>Monday</td>\n",
       "      <td>139.502499</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>3</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>140.982001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>3</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>140.345001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>3</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>139.956000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Month    Weekday        Open\n",
       "0       1     Friday  119.619999\n",
       "1       1     Monday  119.626666\n",
       "2       1   Thursday  118.972500\n",
       "3       1    Tuesday  118.722000\n",
       "4       1  Wednesday  118.752499\n",
       "5       2     Friday  132.945004\n",
       "6       2     Monday  133.116669\n",
       "7       2   Thursday  133.170000\n",
       "8       2    Tuesday  134.329998\n",
       "9       2  Wednesday  132.582500\n",
       "10      3     Friday  140.850000\n",
       "11      3     Monday  139.502499\n",
       "12      3   Thursday  140.982001\n",
       "13      3    Tuesday  140.345001\n",
       "14      3  Wednesday  139.956000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_aapl.groupby(by = [\"Month\", \"Weekday\"]).Open.mean().reset_index().head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregate\n",
    "\n",
    "I can use the aggregate method to summarize the groups using multiple functions. Let's say I want to know the avg and standard deviation of the open price for each month. After the groupby, I use the agg() method and inside I can specify a list of functions that will be used to summarize each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.093499</td>\n",
       "      <td>1.891817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133.234738</td>\n",
       "      <td>3.410836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>140.362174</td>\n",
       "      <td>1.728701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>143.030001</td>\n",
       "      <td>1.120391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>151.965908</td>\n",
       "      <td>3.476493</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>148.215001</td>\n",
       "      <td>4.450870</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>148.096500</td>\n",
       "      <td>3.479140</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>158.946958</td>\n",
       "      <td>2.926976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>164.800003</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             mean       std\n",
       "Month                      \n",
       "1      119.093499  1.891817\n",
       "2      133.234738  3.410836\n",
       "3      140.362174  1.728701\n",
       "4      143.030001  1.120391\n",
       "5      151.965908  3.476493\n",
       "6      148.215001  4.450870\n",
       "7      148.096500  3.479140\n",
       "8      158.946958  2.926976\n",
       "9      164.800003       NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Using agg\n",
    "df_aapl.groupby(by = \"Month\").Open.agg([np.mean, np.std])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now use agg() where we select both Open and Volume:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">Open</th>\n",
       "      <th colspan=\"2\" halign=\"left\">Volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.093499</td>\n",
       "      <td>1.891817</td>\n",
       "      <td>2.815610e+07</td>\n",
       "      <td>6.572536e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133.234738</td>\n",
       "      <td>3.410836</td>\n",
       "      <td>3.026151e+07</td>\n",
       "      <td>2.055055e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>140.362174</td>\n",
       "      <td>1.728701</td>\n",
       "      <td>2.441863e+07</td>\n",
       "      <td>7.416385e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>143.030001</td>\n",
       "      <td>1.120391</td>\n",
       "      <td>1.964758e+07</td>\n",
       "      <td>4.026376e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>151.965908</td>\n",
       "      <td>3.476493</td>\n",
       "      <td>2.971615e+07</td>\n",
       "      <td>1.014266e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>148.215001</td>\n",
       "      <td>4.450870</td>\n",
       "      <td>3.109900e+07</td>\n",
       "      <td>1.416738e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>148.096500</td>\n",
       "      <td>3.479140</td>\n",
       "      <td>2.109962e+07</td>\n",
       "      <td>4.182316e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>158.946958</td>\n",
       "      <td>2.926976</td>\n",
       "      <td>2.874213e+07</td>\n",
       "      <td>1.037540e+07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>164.800003</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.659110e+07</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Open                  Volume              \n",
       "             mean       std          mean           std\n",
       "Month                                                  \n",
       "1      119.093499  1.891817  2.815610e+07  6.572536e+06\n",
       "2      133.234738  3.410836  3.026151e+07  2.055055e+07\n",
       "3      140.362174  1.728701  2.441863e+07  7.416385e+06\n",
       "4      143.030001  1.120391  1.964758e+07  4.026376e+06\n",
       "5      151.965908  3.476493  2.971615e+07  1.014266e+07\n",
       "6      148.215001  4.450870  3.109900e+07  1.416738e+07\n",
       "7      148.096500  3.479140  2.109962e+07  4.182316e+06\n",
       "8      158.946958  2.926976  2.874213e+07  1.037540e+07\n",
       "9      164.800003       NaN  1.659110e+07           NaN"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Using agg() with\n",
    "result  = df_aapl.groupby(by = \"Month\")[\"Open\", \"Volume\"].agg([np.mean, np.std])\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This gives back a dataframe with multi-level columns, which can be sliced as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.093499</td>\n",
       "      <td>1.891817</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133.234738</td>\n",
       "      <td>3.410836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>140.362174</td>\n",
       "      <td>1.728701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>143.030001</td>\n",
       "      <td>1.120391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>151.965908</td>\n",
       "      <td>3.476493</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>148.215001</td>\n",
       "      <td>4.450870</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>148.096500</td>\n",
       "      <td>3.479140</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>158.946958</td>\n",
       "      <td>2.926976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>164.800003</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             mean       std\n",
       "Month                      \n",
       "1      119.093499  1.891817\n",
       "2      133.234738  3.410836\n",
       "3      140.362174  1.728701\n",
       "4      143.030001  1.120391\n",
       "5      151.965908  3.476493\n",
       "6      148.215001  4.450870\n",
       "7      148.096500  3.479140\n",
       "8      158.946958  2.926976\n",
       "9      164.800003       NaN"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Get the columns for the open price\n",
    "result.loc[:, \"Open\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say that instead of applying both functions to both columns I wanted the mean to be applied to the Open column and a maximum to be applied to the volume column.  To do this, I specify a dictionary in the agg() method where the key is the column and the value is a function to be applied to the groups of that column. In this case, since I specify the columns I want to focus on in the dictionary I do not have to select them after the groupby"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Open</th>\n",
       "      <th>Volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.093499</td>\n",
       "      <td>49201000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>133.234738</td>\n",
       "      <td>111985000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>140.362174</td>\n",
       "      <td>43885000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>143.030001</td>\n",
       "      <td>30379400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>151.965908</td>\n",
       "      <td>50767700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>148.215001</td>\n",
       "      <td>72307300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>148.096500</td>\n",
       "      <td>32476300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>158.946958</td>\n",
       "      <td>69936800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>164.800003</td>\n",
       "      <td>16591100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Open     Volume\n",
       "Month                       \n",
       "1      119.093499   49201000\n",
       "2      133.234738  111985000\n",
       "3      140.362174   43885000\n",
       "4      143.030001   30379400\n",
       "5      151.965908   50767700\n",
       "6      148.215001   72307300\n",
       "7      148.096500   32476300\n",
       "8      158.946958   69936800\n",
       "9      164.800003   16591100"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#agg with a dictionary\n",
    "df_aapl.groupby(by = \"Month\").agg({\"Open\": np.mean,\"Volume\": np.max})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I can even use the agg() with a customer function.  Let's say I wanted the count of the number of days in each month the open price was above the average open price for the month. First I'll write my customer function to be applied to each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def Compare_Open(group):\n",
    "    count = 0\n",
    "    \n",
    "    avg_price = group.mean()\n",
    "    for index in list(group.index):\n",
    "        open_price = group[index]\n",
    "        \n",
    "        if open_price>=avg_price:\n",
    "            \n",
    "            count+=1\n",
    "            \n",
    "    return count\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Open</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Month</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Open\n",
       "Month      \n",
       "1      11.0\n",
       "2      10.0\n",
       "3      10.0\n",
       "4      11.0\n",
       "5      15.0\n",
       "6       7.0\n",
       "7      11.0\n",
       "8      14.0\n",
       "9       1.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_aapl.groupby(by = \"Month\").agg({\"Open\":Compare_Open})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above, the function Compare_Open is called once for each group (each month),  The input group is a series that represents the open price for the given group."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform\n",
    "\n",
    "Instead of aggregating each group we can apply a transformation with transform() method after the groupby.  The agg() method uses takes a column of data and spits out a single number summarizing this column based on the specified groups.  The transform() method takes a column and returns a back a series that is the same length. For example let's say I use the transform method with the sum function on  the column Volume grouping by Month."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    563122000\n",
       "1    563122000\n",
       "2    563122000\n",
       "3    563122000\n",
       "4    563122000\n",
       "Name: Volume, dtype: int64"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "volume_by_month = df_aapl.groupby(by = \"Month\").Volume.transform(np.sum)\n",
    "\n",
    "volume_by_month.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(169,)"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Look at how many rows there ar\n",
    "volume_by_month.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that this series has 169 rows, which is the same number of days that we have stock information for.  What happened is that for each group we compute the sum, but instead of giving a single number for each group, it takes this summarizing number of matches it with the group that each row corresponds to.  So let's say I wanted a column for the fraction of the month's volumne that each day represents.  I can first use a transform as I have done above to get a column for the total volume in each month and then I can do the simple division to get the desired column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Adj Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Month</th>\n",
       "      <th>Weekday</th>\n",
       "      <th>Total_Month_Volumn</th>\n",
       "      <th>Total_Month_Volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017-01-03</td>\n",
       "      <td>115.800003</td>\n",
       "      <td>116.330002</td>\n",
       "      <td>114.760002</td>\n",
       "      <td>116.150002</td>\n",
       "      <td>114.311760</td>\n",
       "      <td>28781900</td>\n",
       "      <td>1</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>115.849998</td>\n",
       "      <td>116.510002</td>\n",
       "      <td>115.750000</td>\n",
       "      <td>116.019997</td>\n",
       "      <td>114.183815</td>\n",
       "      <td>21118100</td>\n",
       "      <td>1</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017-01-05</td>\n",
       "      <td>115.919998</td>\n",
       "      <td>116.860001</td>\n",
       "      <td>115.809998</td>\n",
       "      <td>116.610001</td>\n",
       "      <td>114.764473</td>\n",
       "      <td>22193600</td>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017-01-06</td>\n",
       "      <td>116.779999</td>\n",
       "      <td>118.160004</td>\n",
       "      <td>116.470001</td>\n",
       "      <td>117.910004</td>\n",
       "      <td>116.043915</td>\n",
       "      <td>31751900</td>\n",
       "      <td>1</td>\n",
       "      <td>Friday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017-01-09</td>\n",
       "      <td>117.949997</td>\n",
       "      <td>119.430000</td>\n",
       "      <td>117.940002</td>\n",
       "      <td>118.989998</td>\n",
       "      <td>117.106812</td>\n",
       "      <td>33561900</td>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date        Open        High         Low       Close   Adj Close  \\\n",
       "0 2017-01-03  115.800003  116.330002  114.760002  116.150002  114.311760   \n",
       "1 2017-01-04  115.849998  116.510002  115.750000  116.019997  114.183815   \n",
       "2 2017-01-05  115.919998  116.860001  115.809998  116.610001  114.764473   \n",
       "3 2017-01-06  116.779999  118.160004  116.470001  117.910004  116.043915   \n",
       "4 2017-01-09  117.949997  119.430000  117.940002  118.989998  117.106812   \n",
       "\n",
       "     Volume  Month    Weekday  Total_Month_Volumn  Total_Month_Volume  \n",
       "0  28781900      1    Tuesday           563122000           563122000  \n",
       "1  21118100      1  Wednesday           563122000           563122000  \n",
       "2  22193600      1   Thursday           563122000           563122000  \n",
       "3  31751900      1     Friday           563122000           563122000  \n",
       "4  33561900      1     Monday           563122000           563122000  "
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Use transform to create a column for the total volumn in each month\n",
    "df_aapl[\"Total_Month_Volume\"] = df_aapl.groupby(by = \"Month\").Volume.transform(np.sum)\n",
    "\n",
    "df_aapl.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Adj Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Month</th>\n",
       "      <th>Weekday</th>\n",
       "      <th>Total_Month_Volumn</th>\n",
       "      <th>Total_Month_Volume</th>\n",
       "      <th>Frac_Months_Vol</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017-01-03</td>\n",
       "      <td>115.800003</td>\n",
       "      <td>116.330002</td>\n",
       "      <td>114.760002</td>\n",
       "      <td>116.150002</td>\n",
       "      <td>114.311760</td>\n",
       "      <td>28781900</td>\n",
       "      <td>1</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.051111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>115.849998</td>\n",
       "      <td>116.510002</td>\n",
       "      <td>115.750000</td>\n",
       "      <td>116.019997</td>\n",
       "      <td>114.183815</td>\n",
       "      <td>21118100</td>\n",
       "      <td>1</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.037502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017-01-05</td>\n",
       "      <td>115.919998</td>\n",
       "      <td>116.860001</td>\n",
       "      <td>115.809998</td>\n",
       "      <td>116.610001</td>\n",
       "      <td>114.764473</td>\n",
       "      <td>22193600</td>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.039412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017-01-06</td>\n",
       "      <td>116.779999</td>\n",
       "      <td>118.160004</td>\n",
       "      <td>116.470001</td>\n",
       "      <td>117.910004</td>\n",
       "      <td>116.043915</td>\n",
       "      <td>31751900</td>\n",
       "      <td>1</td>\n",
       "      <td>Friday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.056385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017-01-09</td>\n",
       "      <td>117.949997</td>\n",
       "      <td>119.430000</td>\n",
       "      <td>117.940002</td>\n",
       "      <td>118.989998</td>\n",
       "      <td>117.106812</td>\n",
       "      <td>33561900</td>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.059600</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date        Open        High         Low       Close   Adj Close  \\\n",
       "0 2017-01-03  115.800003  116.330002  114.760002  116.150002  114.311760   \n",
       "1 2017-01-04  115.849998  116.510002  115.750000  116.019997  114.183815   \n",
       "2 2017-01-05  115.919998  116.860001  115.809998  116.610001  114.764473   \n",
       "3 2017-01-06  116.779999  118.160004  116.470001  117.910004  116.043915   \n",
       "4 2017-01-09  117.949997  119.430000  117.940002  118.989998  117.106812   \n",
       "\n",
       "     Volume  Month    Weekday  Total_Month_Volumn  Total_Month_Volume  \\\n",
       "0  28781900      1    Tuesday           563122000           563122000   \n",
       "1  21118100      1  Wednesday           563122000           563122000   \n",
       "2  22193600      1   Thursday           563122000           563122000   \n",
       "3  31751900      1     Friday           563122000           563122000   \n",
       "4  33561900      1     Monday           563122000           563122000   \n",
       "\n",
       "   Frac_Months_Vol  \n",
       "0         0.051111  \n",
       "1         0.037502  \n",
       "2         0.039412  \n",
       "3         0.056385  \n",
       "4         0.059600  "
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Create new column for the fraction of the month's volumen\n",
    "df_aapl[\"Frac_Months_Vol\"] = df_aapl[\"Volume\"]/df_aapl[\"Total_Month_Volume\"]\n",
    "\n",
    "df_aapl.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I can also use transform to create standardize columns for each group.  So let's say I want to create a standardized open price column where for each open price I substract off the mean open price for the month and divide by the standard deviation.  We can use a lambda function to accomplish this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Open</th>\n",
       "      <th>High</th>\n",
       "      <th>Low</th>\n",
       "      <th>Close</th>\n",
       "      <th>Adj Close</th>\n",
       "      <th>Volume</th>\n",
       "      <th>Month</th>\n",
       "      <th>Weekday</th>\n",
       "      <th>Total_Month_Volumn</th>\n",
       "      <th>Total_Month_Volume</th>\n",
       "      <th>Frac_Months_Vol</th>\n",
       "      <th>Standard_Open</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017-01-03</td>\n",
       "      <td>115.800003</td>\n",
       "      <td>116.330002</td>\n",
       "      <td>114.760002</td>\n",
       "      <td>116.150002</td>\n",
       "      <td>114.311760</td>\n",
       "      <td>28781900</td>\n",
       "      <td>1</td>\n",
       "      <td>Tuesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.051111</td>\n",
       "      <td>-1.740917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>115.849998</td>\n",
       "      <td>116.510002</td>\n",
       "      <td>115.750000</td>\n",
       "      <td>116.019997</td>\n",
       "      <td>114.183815</td>\n",
       "      <td>21118100</td>\n",
       "      <td>1</td>\n",
       "      <td>Wednesday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.037502</td>\n",
       "      <td>-1.714490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017-01-05</td>\n",
       "      <td>115.919998</td>\n",
       "      <td>116.860001</td>\n",
       "      <td>115.809998</td>\n",
       "      <td>116.610001</td>\n",
       "      <td>114.764473</td>\n",
       "      <td>22193600</td>\n",
       "      <td>1</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.039412</td>\n",
       "      <td>-1.677489</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017-01-06</td>\n",
       "      <td>116.779999</td>\n",
       "      <td>118.160004</td>\n",
       "      <td>116.470001</td>\n",
       "      <td>117.910004</td>\n",
       "      <td>116.043915</td>\n",
       "      <td>31751900</td>\n",
       "      <td>1</td>\n",
       "      <td>Friday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.056385</td>\n",
       "      <td>-1.222899</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017-01-09</td>\n",
       "      <td>117.949997</td>\n",
       "      <td>119.430000</td>\n",
       "      <td>117.940002</td>\n",
       "      <td>118.989998</td>\n",
       "      <td>117.106812</td>\n",
       "      <td>33561900</td>\n",
       "      <td>1</td>\n",
       "      <td>Monday</td>\n",
       "      <td>563122000</td>\n",
       "      <td>563122000</td>\n",
       "      <td>0.059600</td>\n",
       "      <td>-0.604447</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Date        Open        High         Low       Close   Adj Close  \\\n",
       "0 2017-01-03  115.800003  116.330002  114.760002  116.150002  114.311760   \n",
       "1 2017-01-04  115.849998  116.510002  115.750000  116.019997  114.183815   \n",
       "2 2017-01-05  115.919998  116.860001  115.809998  116.610001  114.764473   \n",
       "3 2017-01-06  116.779999  118.160004  116.470001  117.910004  116.043915   \n",
       "4 2017-01-09  117.949997  119.430000  117.940002  118.989998  117.106812   \n",
       "\n",
       "     Volume  Month    Weekday  Total_Month_Volumn  Total_Month_Volume  \\\n",
       "0  28781900      1    Tuesday           563122000           563122000   \n",
       "1  21118100      1  Wednesday           563122000           563122000   \n",
       "2  22193600      1   Thursday           563122000           563122000   \n",
       "3  31751900      1     Friday           563122000           563122000   \n",
       "4  33561900      1     Monday           563122000           563122000   \n",
       "\n",
       "   Frac_Months_Vol  Standard_Open  \n",
       "0         0.051111      -1.740917  \n",
       "1         0.037502      -1.714490  \n",
       "2         0.039412      -1.677489  \n",
       "3         0.056385      -1.222899  \n",
       "4         0.059600      -0.604447  "
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Using the tranform method to standardize\n",
    "df_aapl[\"Standard_Open\"] = df_aapl.groupby(by = \"Month\").Open.transform(lambda x : (x - x.mean())/x.std())\n",
    "\n",
    "df_aapl.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lambda function above is only called only once for each group where the input x will be a series representing the column Open for each of the Months that we grouped by.  It should be noted that if you subtract a single number from a series, as is the case when this lambda function is called, then pandas knows to subtract this number from all the numbers in the series.  The same game for dividing a series by a single number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering\n",
    "\n",
    "The filter() method after a group by let's us only select rows corresponding to each group where a certain criterion regarding the group as a whole is true is satisfied.  The filter() method must take in a function that returns a boolean.  The function will be run once for each row. For example, lets say I only want to look at rows for days in months where the average opening price for the month was above 140."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 4, 5, 6, 7, 8, 9])"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "above_140 = df_aapl.groupby(by = \"Month\").filter(lambda group: group.Open.mean() >= 140)\n",
    "\n",
    "#Note that we lose Jan and Feb since these months had an avg open below 140.\n",
    "above_140.Month.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The lambda function above is called once for each group where the input group will be dataframe with the rows corresponding to the given month that is being run through the lambda function."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "12px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}