{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas - Part 1\n", "\n", "Pandas will be the primary library that we use for the data analytics section of the course. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. The core data structures that we will use are pandas series and dataframes. In this part we cover:\n", "\n", "- Reading/Writing from/to a csv file.\n", "- Getting basic properties of the dataframe.\n", "- Indexing and Slicing dataframes\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading csv + Basic Attributes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "#How we read in a pandas dataframe. The header=0 means column names are in the first row\n", "df=pd.read_csv(\"Data/Grades.csv\", header=0)\n", "\n", "#The head method returns the first five rows\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'Previous_Part', 'Participation1', 'Mini_Exam1', 'Mini_Exam2',\n", " 'Participation2', 'Mini_Exam3', 'Final', 'Grade'],\n", " dtype='object')" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#There are column names\n", "df.columns" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#And there are row names\n", "list(df.index)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the dimensions of the data frame with shape\n", "dimensions = df.shape\n", "numRows = dimensions[0]\n", "numCols = dimensions[1]\n", "numCols" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Name object\n", "Previous_Part float64\n", "Participation1 int64\n", "Mini_Exam1 float64\n", "Mini_Exam2 float64\n", "Participation2 int64\n", "Mini_Exam3 float64\n", "Final float64\n", "Grade object\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the data type of each column\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Jake\n", "1 Joe\n", "2 Susan\n", "3 Sol\n", "4 Chris\n", "Name: Name, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can pick out a column by referencing its name. The result is a series or one dimensional data frame\n", "df['Name'].head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 A\n", "1 A\n", "2 A-\n", "3 A\n", "4 A\n", "Name: Grade, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#You can similarly pick out columns as attributes with the '.'\n", "df.Grade.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you pick out a single column as we have done above the result is a series, which is essentially a one-dimensional dataframe" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Jake\n", "1 Joe\n", "2 Susan\n", "3 Sol\n", "4 Chris\n", "Name: Name, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name_column = df[\"Name\"]\n", "\n", "name_column.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Jake\n", "1 Joe\n", "2 Susan\n", "Name: Name, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We index and slice a series through the index\n", "name_column[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that when we slice a series, the second entry in non-inclusive." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameGrade
0JakeA
1JoeA
2SusanA-
3SolA
4ChrisA
\n", "
" ], "text/plain": [ " Name Grade\n", "0 Jake A\n", "1 Joe A\n", "2 Susan A-\n", "3 Sol A\n", "4 Chris A" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#You can pick out multiple columns by specifying a list of column names\n", "name_grade = df[['Name', 'Grade']].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Here is how we write a dataframe\n", "name_column.to_csv(\"Name_Grade.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we pick out multiple column, as we have done above, the result is dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slicing and Indexing\n", "\n", "We will be using the .loc (just labels) approach. You can also slice with .iloc (just indicies) or .ix (indices and labels)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's look at the data \n", "df.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Sol'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Pick out a single entry\n", "df.loc[3,\"Name\"]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mini_Exam3FinalGrade
114.032.0A
210.533.0A-
313.034.0A
\n", "
" ], "text/plain": [ " Mini_Exam3 Final Grade\n", "1 14.0 32.0 A\n", "2 10.5 33.0 A-\n", "3 13.0 34.0 A" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Select contiguous rows and columns \n", "df.loc[1:3, \"Mini_Exam3\":\"Grade\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that when we slice with .loc, the endpoint are inclusive." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Previous_PartGrade
032.0A
230.0A-
430.0A
\n", "
" ], "text/plain": [ " Previous_Part Grade\n", "0 32.0 A\n", "2 30.0 A-\n", "4 30.0 A" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Select none continuguous rows\n", "df.loc[[0,2,4], [\"Previous_Part\",\"Grade\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas - Part 2\n", "\n", "In this part, we will cover:\n", "\n", "- Useful built in column methods.\n", "- Creating new columns and deleting existing ones.\n", "- **Bonus**: Dealing with date-time columns." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in the data frame\n", "df=pd.read_csv(\"Data/Grades.csv\", header=0)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "31.81578947368421" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Compute mean of Final column\n", "avg_final = df[\"Final\"].mean()\n", "avg_final" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can convert the type of a column using the **astype()** method." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 1.0\n", "3 1.0\n", "4 1.0\n", "Name: Participation1, dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Returns the Participation 1 column as a float\n", "df[\"Participation1\"].astype(\"float64\").head()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.01.019.520.0110.033.0A
1Joe32.01.020.016.0114.032.0A
2Susan30.01.019.019.0110.533.0A-
3Sol31.01.022.013.0113.034.0A
4Chris30.01.019.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe 32.0 1.0 20.0 16.0 \n", "2 Susan 30.0 1.0 19.0 19.0 \n", "3 Sol 31.0 1.0 22.0 13.0 \n", "4 Chris 30.0 1.0 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Change Participation 1 column\n", "df[\"Participation1\"] = df[\"Participation1\"].astype(\"float64\")\n", "#Note that df has changed and the two Participation column are different\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Name object\n", "Previous_Part float64\n", "Participation1 float64\n", "Mini_Exam1 float64\n", "Mini_Exam2 float64\n", "Participation2 int64\n", "Mini_Exam3 float64\n", "Final float64\n", "Grade object\n", "dtype: object" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the two participation columns have different types because of our conversion.\n", "\n", "The **unique()** method returns an array (think of it as a list) of the unique values in the column" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['A', 'A-', 'B', 'B+', 'A+', 'B-', 'C+'], dtype=object)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's look at how many unique grades there were\n", "list_grades = df[\"Grade\"].unique()\n", "\n", "list_grades" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['A-', 'B'], dtype=object)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can slice list_grades just like a list\n", "list_grades[1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **value_counts()** method returns the counts of each unique value in the column as a series" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "A 8\n", "B 3\n", "B+ 2\n", "A- 2\n", "C+ 2\n", "A+ 1\n", "B- 1\n", "Name: Grade, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grade_breakdown = df[\"Grade\"].value_counts()\n", "grade_breakdown" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Recall that we slice series through their index\n", "grade_breakdown[\"A\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply any of these built in functions to multiple columns." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Final 31.815789\n", "Mini_Exam3 10.526316\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#applying function to multiple rows\n", "df[[\"Final\", \"Mini_Exam3\"]].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the end result is a series, where the column names become the index of the series. The **describe()** method gives you key stats (as a dataframe) for every numeric column." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Previous_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3Final
count19.00000019.00000019.00000019.00000019.019.00000019.000000
mean29.5000000.94736818.02631617.0526321.010.52631631.815789
std2.8480010.2294163.3601992.7532540.02.7961964.913729
min22.0000000.0000007.00000012.0000001.05.50000021.000000
25%28.5000001.00000017.50000015.5000001.09.00000031.000000
50%30.0000001.00000019.00000018.0000001.010.00000032.000000
75%31.5000001.00000020.00000019.0000001.012.75000034.000000
max33.0000001.00000022.00000021.0000001.017.00000042.000000
\n", "
" ], "text/plain": [ " Previous_Part Participation1 Mini_Exam1 Mini_Exam2 Participation2 \\\n", "count 19.000000 19.000000 19.000000 19.000000 19.0 \n", "mean 29.500000 0.947368 18.026316 17.052632 1.0 \n", "std 2.848001 0.229416 3.360199 2.753254 0.0 \n", "min 22.000000 0.000000 7.000000 12.000000 1.0 \n", "25% 28.500000 1.000000 17.500000 15.500000 1.0 \n", "50% 30.000000 1.000000 19.000000 18.000000 1.0 \n", "75% 31.500000 1.000000 20.000000 19.000000 1.0 \n", "max 33.000000 1.000000 22.000000 21.000000 1.0 \n", "\n", " Mini_Exam3 Final \n", "count 19.000000 19.000000 \n", "mean 10.526316 31.815789 \n", "std 2.796196 4.913729 \n", "min 5.500000 21.000000 \n", "25% 9.000000 31.000000 \n", "50% 10.000000 32.000000 \n", "75% 12.750000 34.000000 \n", "max 17.000000 42.000000 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Using the describe() method\n", "summary = df.describe()\n", "summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can index and slice the above dataframe like any other dataframe." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FinalPrevious_Part
min21.022.0
max42.033.0
\n", "
" ], "text/plain": [ " Final Previous_Part\n", "min 21.0 22.0\n", "max 42.0 33.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#slicing summary dataframe\n", "summary.loc[[\"min\", \"max\"], [\"Final\", \"Previous_Part\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we look at how to create new columns" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeFinal_Perc
0Jake32.0119.520.0110.033.0A0.942857
1Joe32.0120.016.0114.032.0A0.914286
2Susan30.0119.019.0110.533.0A-0.942857
3Sol31.0122.013.0113.034.0A0.971429
4Chris30.0119.017.0112.533.5A0.957143
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade Final_Perc \n", "0 1 10.0 33.0 A 0.942857 \n", "1 1 14.0 32.0 A 0.914286 \n", "2 1 10.5 33.0 A- 0.942857 \n", "3 1 13.0 34.0 A 0.971429 \n", "4 1 12.5 33.5 A 0.957143 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Create a New Column that is a function of other columns\n", "df[\"Final_Perc\"] = df[\"Final\"]/35\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#I can then delete it with the drop method\n", "df.drop([\"Final_Perc\"], inplace = True, axis=1)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inplace argument works as follows:\n", "\n", " - inplace = True : The dataframe itself will have the given column(s) deleted.\n", " - inplace = False: Will return a dataframe with the column(s) deleted.\n", " \n", " The axis argument works as follows:\n", " \n", " - axis = 1 : delete columns given\n", " - axis = 0 : delete rows given.\n", " \n", " Let's look at an example where we delete rows" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
1Joe32.0120.016.0114.032.0A
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
5Tarik31.0119.019.018.024.0B
6Malik31.5120.021.019.036.0A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "1 Joe 32.0 1 20.0 16.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "5 Tarik 31.0 1 19.0 19.0 \n", "6 Malik 31.5 1 20.0 21.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "1 1 14.0 32.0 A \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A \n", "5 1 8.0 24.0 B \n", "6 1 9.0 36.0 A " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Delete rows with index 0 and 2\n", "drop_rows = df.drop([0,2], inplace = False, axis=0)\n", "drop_rows.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at df" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that df was not changed! This is what happens when you set inplace." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how we can sort a data frame. The inplace argument has the same affect as the drop method." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
8Ronaldo33.0120.020.0117.042.0A+
9Messi30.5117.018.019.037.0A-
6Malik31.5120.021.019.036.0A
17Vik31.5115.019.0113.035.0A
3Sol31.0122.013.0113.034.0A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "8 Ronaldo 33.0 1 20.0 20.0 \n", "9 Messi 30.5 1 17.0 18.0 \n", "6 Malik 31.5 1 20.0 21.0 \n", "17 Vik 31.5 1 15.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "8 1 17.0 42.0 A+ \n", "9 1 9.0 37.0 A- \n", "6 1 9.0 36.0 A \n", "17 1 13.0 35.0 A \n", "3 1 13.0 34.0 A " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Sort the data frame according tothe Final Column\n", "#By setting inplace= False will just return the sorted dataframe and not chnage df \n", "df.sort_values(by = [\"Final\"], inplace =False, ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's sort by multiple columns, specifying more than one column is essentially specifying a tie break" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
3Sol31.0122.013.0113.034.0A
8Ronaldo33.0120.020.0117.042.0A+
1Joe32.0120.016.0114.032.0A
12Charlie32.0120.017.0111.034.0A
6Malik31.5120.021.019.036.0A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "3 Sol 31.0 1 22.0 13.0 \n", "8 Ronaldo 33.0 1 20.0 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "12 Charlie 32.0 1 20.0 17.0 \n", "6 Malik 31.5 1 20.0 21.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "3 1 13.0 34.0 A \n", "8 1 17.0 42.0 A+ \n", "1 1 14.0 32.0 A \n", "12 1 11.0 34.0 A \n", "6 1 9.0 36.0 A " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Sort by Mini Exam 1 and tie breal with Previous Part\n", "\n", "result_sorted = df.sort_values(by = [\"Mini_Exam1\", \"Previous_Part\"], inplace =False, ascending=False)\n", "result_sorted.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Now let's dive into the datetime column type with Parking data set, where each row corresponds to a different parking ticket given in NYC." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Registration_StateIssue_DateVehicle_Make
00NY6/30/16 2:17TOYOT
11NY7/4/16 1:18ME/BE
22NY7/11/16 6:15LINCO
33NY7/4/16 1:10NISSA
44NY7/1/16 6:30VOLKS
\n", "
" ], "text/plain": [ " Unnamed: 0 Registration_State Issue_Date Vehicle_Make\n", "0 0 NY 6/30/16 2:17 TOYOT\n", "1 1 NY 7/4/16 1:18 ME/BE\n", "2 2 NY 7/11/16 6:15 LINCO\n", "3 3 NY 7/4/16 1:10 NISSA\n", "4 4 NY 7/1/16 6:30 VOLKS" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in parking and let's have a look.\n", "df_parking = pd.read_csv(\"Data/Parking.csv\")\n", "df_parking.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I don't like having that Unnamed column. I can fix this by telling pandas that I want that column to be the index instead of a separate column." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Registration_StateIssue_DateVehicle_Make
0NY6/30/16 2:17TOYOT
1NY7/4/16 1:18ME/BE
2NY7/11/16 6:15LINCO
3NY7/4/16 1:10NISSA
4NY7/1/16 6:30VOLKS
\n", "
" ], "text/plain": [ " Registration_State Issue_Date Vehicle_Make\n", "0 NY 6/30/16 2:17 TOYOT\n", "1 NY 7/4/16 1:18 ME/BE\n", "2 NY 7/11/16 6:15 LINCO\n", "3 NY 7/4/16 1:10 NISSA\n", "4 NY 7/1/16 6:30 VOLKS" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in parking and specify column that will serve as index\n", "df_parking = pd.read_csv(\"Data/Parking.csv\", index_col=0)\n", "df_parking.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's have a look at how pandas read in each column" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Registration_State object\n", "Issue_Date object\n", "Vehicle_Make object\n", "dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#look at how each column is stored\n", "df_parking.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I want Issue_Date to be a datetime and not a string! Let's convert it." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Registration_State object\n", "Issue_Date datetime64[ns]\n", "Vehicle_Make object\n", "dtype: object" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Reset the column Issue Date to be a datetime\n", "df_parking[\"Issue_Date\"] = pd.to_datetime(df_parking[\"Issue_Date\"]) \n", "\n", "#Now its a datetime object\n", "df_parking.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first entry of this column, that is now a datetime column" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Timestamp('2016-06-30 02:17:00')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_entry = df_parking.loc[0,\"Issue_Date\"]\n", "first_entry" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that it is a timestamp. Timestamps have lots of nice attributes that we can extract." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the day\n", "first_entry.day" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the month\n", "first_entry.month" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Thursday'" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can get the weekday name\n", "first_entry.weekday_name" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can even see if the year is a leap year\n", "first_entry.is_leap_year" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to get these attributes for an entire column, then we have to throw in a .dt" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Thursday\n", "1 Monday\n", "2 Monday\n", "3 Monday\n", "4 Friday\n", "Name: Issue_Date, dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the day of the week for the entire column\n", "all_dow = df_parking[\"Issue_Date\"].dt.weekday_name\n", "all_dow.head()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Registration_StateIssue_DateVehicle_MakeDOW
0NY2016-06-30 02:17:00TOYOTThursday
1NY2016-07-04 01:18:00ME/BEMonday
2NY2016-07-11 06:15:00LINCOMonday
3NY2016-07-04 01:10:00NISSAMonday
4NY2016-07-01 06:30:00VOLKSFriday
\n", "
" ], "text/plain": [ " Registration_State Issue_Date Vehicle_Make DOW\n", "0 NY 2016-06-30 02:17:00 TOYOT Thursday\n", "1 NY 2016-07-04 01:18:00 ME/BE Monday\n", "2 NY 2016-07-11 06:15:00 LINCO Monday\n", "3 NY 2016-07-04 01:10:00 NISSA Monday\n", "4 NY 2016-07-01 06:30:00 VOLKS Friday" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Lets add this column in \n", "df_parking[\"DOW\"] = df_parking[\"Issue_Date\"].dt.weekday_name\n", "df_parking.head()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Thursday 81\n", "Tuesday 80\n", "Sunday 73\n", "Friday 73\n", "Wednesday 68\n", "Monday 64\n", "Saturday 60\n", "Name: DOW, dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's see the most frequent days for parking tickets\n", "df_parking[\"DOW\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Timedelta('753 days 09:21:00')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can even do time arithmetic. We get a TimeDelta object\n", "delta = parking.Issue_Date.max() - parking.Issue_Date.min()\n", "delta" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(33660, 753)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can pull out attributes\n", "delta.seconds, delta.days" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2016-06-30 02:17:00\n", "2016-06-29 21:17:00\n" ] } ], "source": [ "#Lets say I wanted to subtract 5 hours from the first parking ticket\n", "firstTicket = parking.loc[0,\"Issue_Date\"]\n", "\n", "#We can create a TimeDelta Object\n", "timeDiff = pd.Timedelta(hours = 5)\n", "\n", "print(firstTicket)\n", "print(firstTicket - timeDiff)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas - Part 3\n", "\n", "In this part, we will a collection of important miscellaneous concepts that include:\n", "\n", "- Changing columns names\n", "- Combining dataframes\n", "- Understanding the index\n", "- Missing Data\n", "- Reading from Excel" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "#Read in the data frame\n", "df=pd.read_csv(\"Data/Grades.csv\", header=0)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that we can get the column names through the attribute column" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'Previous_Part', 'Participation1', 'Mini_Exam1', 'Mini_Exam2',\n", " 'Participation2', 'Mini_Exam3', 'Final', 'Grade'],\n", " dtype='object')" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get the column names\n", "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can change column names through the rename method" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation_1Mini_Exam1Mini_Exam2Participation_2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation_1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation_2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Change the column names\n", "df.rename(columns={\"Participation1\":\"Participation_1\", \"Participation2\":\"Participation_2\"}, inplace=True)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The format for the columns input is {\"old_column_name\":\"new_column_name\"}. It should be noted that the rename method can also be applied to change the index by changing columns to index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to better understand the index or row labels. Currently, for the dataframe df, the index is just the row numbers. But what happens if we want to access the Sol's row? As it stands, we would have to remeber that Sol is in row 3.\n", "\n", "Another approach is to make the Names column the index." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Previous_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
Name
Jake32.0119.520.0110.033.0A
Joe32.0120.016.0114.032.0A
Susan30.0119.019.0110.533.0A-
Sol31.0122.013.0113.034.0A
Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Previous_Part Participation1 Mini_Exam1 Mini_Exam2 Participation2 \\\n", "Name \n", "Jake 32.0 1 19.5 20.0 1 \n", "Joe 32.0 1 20.0 16.0 1 \n", "Susan 30.0 1 19.0 19.0 1 \n", "Sol 31.0 1 22.0 13.0 1 \n", "Chris 30.0 1 19.0 17.0 1 \n", "\n", " Mini_Exam3 Final Grade \n", "Name \n", "Jake 10.0 33.0 A \n", "Joe 14.0 32.0 A \n", "Susan 10.5 33.0 A- \n", "Sol 13.0 34.0 A \n", "Chris 12.5 33.5 A " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Setting the column Name to be the index\n", "df.set_index(\"Name\", inplace = True)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The inplace = True command will change the dataframe df. Now the Name column is our index and we can access Sol's info as easily as: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Previous_Part 31\n", "Participation1 1\n", "Mini_Exam1 22\n", "Mini_Exam2 13\n", "Participation2 1\n", "Mini_Exam3 13\n", "Final 34\n", "Grade A\n", "Name: Sol, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Access Sol's info\n", "df.loc[\"Sol\",:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When setting the index, make sure you choose a column that will uniquely identify each row.\n", "\n", "We can change the the index back to row numbers using the reset_index() method." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Resetting the index\n", "df.reset_index(drop=False, inplace=True)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are back to the original data frame. Setting drop = False (default) adds the old index as a new column in the dataframe instead of just deletiing it.\n", "\n", "Next, we see how to combine or concatenate two (or more) data frames." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#I can combine data frames with concat function\n", "head = df.head()\n", "tail = df.tail()\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Have a look at the variable head\n", "head" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
14Chrinstine29.0113.015.519.031.0B
15Josh23.5117.012.018.523.0C+
16Jackson28.0118.015.517.031.0B
17Vik31.5115.019.0113.035.0A
18Sarah22.0118.013.019.021.0C+
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "14 Chrinstine 29.0 1 13.0 15.5 \n", "15 Josh 23.5 1 17.0 12.0 \n", "16 Jackson 28.0 1 18.0 15.5 \n", "17 Vik 31.5 1 15.0 19.0 \n", "18 Sarah 22.0 1 18.0 13.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "14 1 9.0 31.0 B \n", "15 1 8.5 23.0 C+ \n", "16 1 7.0 31.0 B \n", "17 1 13.0 35.0 A \n", "18 1 9.0 21.0 C+ " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Have a look at the variable head\n", "tail" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
14Chrinstine29.0113.015.519.031.0B
15Josh23.5117.012.018.523.0C+
16Jackson28.0118.015.517.031.0B
17Vik31.5115.019.0113.035.0A
18Sarah22.0118.013.019.021.0C+
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "14 Chrinstine 29.0 1 13.0 15.5 \n", "15 Josh 23.5 1 17.0 12.0 \n", "16 Jackson 28.0 1 18.0 15.5 \n", "17 Vik 31.5 1 15.0 19.0 \n", "18 Sarah 22.0 1 18.0 13.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A \n", "14 1 9.0 31.0 B \n", "15 1 8.5 23.0 C+ \n", "16 1 7.0 31.0 B \n", "17 1 13.0 35.0 A \n", "18 1 9.0 21.0 C+ " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#axis=0 says stack them top to bottom. axis =1 stacks side to side \n", "dfConcat = pd.concat([head,tail], axis =0)\n", "dfConcat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the concat method takes a list of dataframes as the first input and also an axis input for whether you want to stock top to bottom or side to side. Note that after we stack, the index is messed up. Let's use the reset_index method to change the index back to row numbers." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGrade
0Jake32.0119.520.0110.033.0A
1Joe32.0120.016.0114.032.0A
2Susan30.0119.019.0110.533.0A-
3Sol31.0122.013.0113.034.0A
4Chris30.0119.017.0112.533.5A
5Chrinstine29.0113.015.519.031.0B
6Josh23.5117.012.018.523.0C+
7Jackson28.0118.015.517.031.0B
8Vik31.5115.019.0113.035.0A
9Sarah22.0118.013.019.021.0C+
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe 32.0 1 20.0 16.0 \n", "2 Susan 30.0 1 19.0 19.0 \n", "3 Sol 31.0 1 22.0 13.0 \n", "4 Chris 30.0 1 19.0 17.0 \n", "5 Chrinstine 29.0 1 13.0 15.5 \n", "6 Josh 23.5 1 17.0 12.0 \n", "7 Jackson 28.0 1 18.0 15.5 \n", "8 Vik 31.5 1 15.0 19.0 \n", "9 Sarah 22.0 1 18.0 13.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade \n", "0 1 10.0 33.0 A \n", "1 1 14.0 32.0 A \n", "2 1 10.5 33.0 A- \n", "3 1 13.0 34.0 A \n", "4 1 12.5 33.5 A \n", "5 1 9.0 31.0 B \n", "6 1 8.5 23.0 C+ \n", "7 1 7.0 31.0 B \n", "8 1 13.0 35.0 A \n", "9 1 9.0 21.0 C+ " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfConcat.reset_index(inplace= True, drop=True)\n", "\n", "dfConcat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handling Missing Data\n", "\n", "Missing data is common in most data analysis applications. You have a number of options for filtering out missing data. One option is doing it by hand or you can use the *dropna* method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With dataframes objects, things get a little more complex. You may want to drop rows or columns which are all NA or just those containing any NAs. *dropna* by default drops any row containing a missing value." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.0119.520110.033.0A-1
1JoeNaN120.016114.032.0A23
2Sol31.0122.013113.034.0A34
3Chris30.0-119.0not available112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20 \n", "1 Joe NaN 1 20.0 16 \n", "2 Sol 31.0 1 22.0 13 \n", "3 Chris 30.0 -1 19.0 not available \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Here we have two pieces of missing data\n", "df_missing = pd.read_csv(\"Data/Missing_Data.csv\")\n", "df_missing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The isnull() method returns a series or dataframe of booleans corresponding to whether the particular entries are null or not." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 Participation2 \\\n", "0 False False False False False False \n", "1 False True False False False False \n", "2 False False False False False False \n", "3 False False False False False False \n", "\n", " Mini_Exam3 Final Grade Temp \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#isnull method for a data frame\n", "df_missing.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make sure they are all read in as NA values using the na_values input when we read in the file" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.0119.520.0110.033.0A-1
1JoeNaN120.016.0114.032.0A23
2Sol31.0122.013.0113.034.0A34
3Chris30.0-119.0NaN112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1 19.5 20.0 \n", "1 Joe NaN 1 20.0 16.0 \n", "2 Sol 31.0 1 22.0 13.0 \n", "3 Chris 30.0 -1 19.0 NaN \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Notice that here the not available is turned into an NaN value\n", "df_missing_NA = pd.read_csv(\"Data/Missing_Data.csv\", na_values=[\"NaN\", \"not available\"])\n", "df_missing_NA" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "Name: Previous_Part, dtype: bool" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's rerun the isnull() method on the Previous_Part column\n", "df_missing_NA.Previous_Part.isnull()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseTrueFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseTrueFalseFalseFalseFalseFalse
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 Participation2 \\\n", "0 False False False False False False \n", "1 False True False False False False \n", "2 False False False False False False \n", "3 False False False False True False \n", "\n", " Mini_Exam3 Final Grade Temp \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Now on the entire dataframe\n", "df_missing_NA.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we now realize that the -1 in the Participation column is a NA value. If we add -1 to the na_values input, we will also replace the -1 in the Temp column. Luckily, we can give a dictionary to the na_values input which specifies the NA values in each columns " ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
1JoeNaN1.020.016.0114.032.0A23
2Sol31.01.022.013.0113.034.0A34
3Chris30.0NaN19.0NaN112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe NaN 1.0 20.0 16.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "3 Chris 30.0 NaN 19.0 NaN \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Note that the temp column is unaffected\n", "df_missing_NA2 = pd.read_csv(\"Data/Missing_Data.csv\",\\\n", " na_values={\"Previous_Part\":\"NA\", \"Participation1\":-1,\"Mini_Exam2\":\"not available\"})\n", "df_missing_NA2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets see how we can change/replace these NA values" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
2Sol31.01.022.013.0113.034.0A34
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "2 1 13.0 34.0 A 34 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get rid of all rows with an NA\n", "df_missing_NA2.dropna(axis=0)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
1JoeNaN1.020.016.0114.032.0A23
2Sol31.01.022.013.0113.034.0A34
3Chris30.0NaN19.0NaN112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe NaN 1.0 20.0 16.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "3 Chris 30.0 NaN 19.0 NaN \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Passing how='all' will only drop rows that are all NA (doesn't change anything)\n", "df_missing_NA2.dropna(how='all')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
1JoeNaN1.020.016.0114.032.0A23
2Sol31.01.022.013.0113.034.0A34
3Chris30.0NaN19.0NaN112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe NaN 1.0 20.0 16.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "3 Chris 30.0 NaN 19.0 NaN \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Dropping column is just a matter of passing axis=1 (doesn't change anything)\n", "df_missing_NA2.dropna(axis=1,how='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than filtering ou missing data, you may want to fill in the \"holes\" in any number of ways. For most purposes, the *fillna* method with a constant relplaces missing values with that value." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
1Joe0.01.020.016.0114.032.0A23
2Sol31.01.022.013.0113.034.0A34
3Chris30.00.019.00.0112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe 0.0 1.0 20.0 16.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "3 Chris 30.0 0.0 19.0 0.0 \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_missing_NA2.fillna(0)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.0110.033.0A-1
1Joe5.01.020.016.0114.032.0A23
2Sol31.01.022.013.0113.034.0A34
3Chris30.0NaN19.00.5112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.0 \n", "1 Joe 5.0 1.0 20.0 16.0 \n", "2 Sol 31.0 1.0 22.0 13.0 \n", "3 Chris 30.0 NaN 19.0 0.5 \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#You can pass fillna a dict which gives the replacement value for each column\n", "df_missing_NA2.fillna({\"Previous_Part\":5,\"Mini_Exam2\":0.5})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With *fillna* you can do lots of things with a little creativity. For example, you might pass the mean of median value of a series.\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrevious_PartParticipation1Mini_Exam1Mini_Exam2Participation2Mini_Exam3FinalGradeTemp
0Jake32.01.019.520.000000110.033.0A-1
1Joe31.01.020.016.000000114.032.0A23
2Sol31.01.022.013.000000113.034.0A34
3Chris30.01.019.016.333333112.533.5A72
\n", "
" ], "text/plain": [ " Name Previous_Part Participation1 Mini_Exam1 Mini_Exam2 \\\n", "0 Jake 32.0 1.0 19.5 20.000000 \n", "1 Joe 31.0 1.0 20.0 16.000000 \n", "2 Sol 31.0 1.0 22.0 13.000000 \n", "3 Chris 30.0 1.0 19.0 16.333333 \n", "\n", " Participation2 Mini_Exam3 Final Grade Temp \n", "0 1 10.0 33.0 A -1 \n", "1 1 14.0 32.0 A 23 \n", "2 1 13.0 34.0 A 34 \n", "3 1 12.5 33.5 A 72 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Replace with mean\n", "df_missing_NA2.fillna(df_missing_NA2.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Excel Files (.xlsx)\n", "\n", "We can use the read_excel/write_excel method, which both take a sheet name as input. You can use string formatting to access the correct sheet. Lets say I want to read in the the workbook titled \"Excel_Reading.xlsx\" and add in averages at the end of each column." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1000500010000
00.03039920.0234840.023118
10.02368960.0228570.023055
20.02933860.0232600.022905
30.02711260.0250120.023738
40.02646170.0238540.023733
5Not avail0.0249790.023728
60.02885330.0236760.022813
70.02434540.0229290.023441
8NaN0.0230210.022286
90.02532620.0240340.023799
\n", "
" ], "text/plain": [ " 1000 5000 10000\n", "0 0.0303992 0.023484 0.023118\n", "1 0.0236896 0.022857 0.023055\n", "2 0.0293386 0.023260 0.022905\n", "3 0.0271126 0.025012 0.023738\n", "4 0.0264617 0.023854 0.023733\n", "5 Not avail 0.024979 0.023728\n", "6 0.0288533 0.023676 0.022813\n", "7 0.0243454 0.022929 0.023441\n", "8 NaN 0.023021 0.022286\n", "9 0.0253262 0.024034 0.023799" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in an excel file\n", "df1 = pd.read_excel(\"Data/Excel_Reading.xlsx\", \"Sheet1\")\n", "df1" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1000500010000
00.0303990.0234840.023118
10.0236900.0228570.023055
20.0293390.0232600.022905
30.0271130.0250120.023738
40.0264620.0238540.023733
5NaN0.0249790.023728
60.0288530.0236760.022813
70.0243450.0229290.023441
8NaN0.0230210.022286
90.0253260.0240340.023799
\n", "
" ], "text/plain": [ " 1000 5000 10000\n", "0 0.030399 0.023484 0.023118\n", "1 0.023690 0.022857 0.023055\n", "2 0.029339 0.023260 0.022905\n", "3 0.027113 0.025012 0.023738\n", "4 0.026462 0.023854 0.023733\n", "5 NaN 0.024979 0.023728\n", "6 0.028853 0.023676 0.022813\n", "7 0.024345 0.022929 0.023441\n", "8 NaN 0.023021 0.022286\n", "9 0.025326 0.024034 0.023799" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can replace NAs with na_values as we did before\n", "df1 = pd.read_excel(\"Data/Excel_Reading.xlsx\", \"Sheet1\", na_values = [\"NA\", \"Not avail\"])\n", "df1" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1000500010000
00.0303990.0234840.023118
10.0236900.0228570.023055
20.0293390.0232600.022905
30.0271130.0250120.023738
40.0264620.0238540.023733
50.0269410.0249790.023728
60.0288530.0236760.022813
70.0243450.0229290.023441
80.0269410.0230210.022286
90.0253260.0240340.023799
\n", "
" ], "text/plain": [ " 1000 5000 10000\n", "0 0.030399 0.023484 0.023118\n", "1 0.023690 0.022857 0.023055\n", "2 0.029339 0.023260 0.022905\n", "3 0.027113 0.025012 0.023738\n", "4 0.026462 0.023854 0.023733\n", "5 0.026941 0.024979 0.023728\n", "6 0.028853 0.023676 0.022813\n", "7 0.024345 0.022929 0.023441\n", "8 0.026941 0.023021 0.022286\n", "9 0.025326 0.024034 0.023799" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Fill with mean\n", "df1.fillna(df1.mean())" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#We can write the file with to_excel. We can specify a start row and column\n", "df1.to_excel(\"NewFile.xlsx\", \"Sheet1\", startrow=5, startcol=5)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "138px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }