# Practice with apply

Consider the following dataframe

In [9]:
#Create dataframe
names = ["Chapman, Matt", "Olson, Matt", "Davis, Khris"]
positions = ["3B", "1B", "OF"]
hrs = [25,30,45]
df_players = pd.DataFrame({"names":names, "pos":positions, "hr":hrs})
df_players

Unnamed: 0,hr,names,pos
0,25,"Chapman, Matt",3B
1,30,"Olson, Matt",1B
2,45,"Davis, Khris",OF


Use apply to add a column for each players last name.

In [6]:
def Get_Last_Name(name):
    
    return name.split(" ")[-1]

#Write your code here
df_players["last_name"] = df_players.names.apply(Get_Last_Name)
df_players

Unnamed: 0,hr,names,pos,last_name
0,25,"Chapman, Matt",3B,Matt
1,30,"Olson, Matt",1B,Matt
2,45,"Davis, Khris",OF,Khris


Add a column denoted whether the player is an all-star.  We will say a player is an all-star if they play third base or outfield and have at least threshold_hr = 20 homeruns. If the player is an all-star the row in the new column should read "all-star" and otherwise it shoul read "not selected".

In [8]:
#Write your code here
def Check_All_Star(row, threshold_hr):
    
    pos = row["pos"]
    hr = row["hr"]
    if pos in ["3B", "OF"] and hr>threshold_hr:
        return "all-star"
    else:
        return "not selected"
    
df_players["AS"] = df_players.apply(Check_All_Star, threshold_hr = 20, axis = 1)
df_players

Unnamed: 0,hr,names,pos,AS
0,25,"Chapman, Matt",3B,all-star
1,30,"Olson, Matt",1B,not selected
2,45,"Davis, Khris",OF,all-star


# Movie Review Sentiments Analysis

In this practice exercise, we will use logistic regression to classify the sentiment of movie reviews.  The file "Movie_Reviews.tsv" contains sentences from movie review whose sentiment has already been classified.  The "Sentiment" column has the following meaning:

- 0: negative
- 1: somewhat negative
- 2: neutral
- 3: somewhat positive
- 4: positive

In [1]:
#Place imports here
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression

In [3]:
"a" in stopwords

TypeError: argument of type 'WordListCorpusReader' is not iterable

In [1]:
#Write your code here
movie_reviews = pd.read_csv("Data/Movie_Reviews.tsv", delimiter="\t")

movie_reviews.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what...,2
6,7,1,of,2
7,8,1,escapades demonstrating the adage that what is...,2
8,9,1,escapades,2
9,10,1,demonstrating the adage that what is good for ...,2


In order to run a logistic regression, we need to create features.  Let's use something similar to the sentiment scores from a HW #3 as features.  In other words, lets use (# pos words/# total words) and (# neg words/# total words).  To do this, lets write functions again that read in the positive and negative words again in lowercase.

In [10]:
def Read_Words(file_path):
    
    f= open(file_path)
    
    return f.read().lower().strip('\n').split(" ")

354

Let's get rid of all phrases that have fewer than 10 words. Write a function that counts the words and add a column using apply with these counts.

In [40]:
def Count_Words(phrase):
    
    words = word_tokenize(phrase)
    total_words = [word for word in words if word.isalpha()]

    return len(total_words)
    

In [41]:
#Add phrase_length column
movie_reviews["Phrase_Length"] = movie_reviews.Phrase.apply(Count_Words)

movie_reviews.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,Phrase_Length
0,1,1,A series of escapades demonstrating the adage ...,1,35
1,2,1,A series of escapades demonstrating the adage ...,2,14
2,3,1,A series,2,2
3,4,1,A,2,1
4,5,1,series,2,1


In [42]:
#Get ride of phrases that have fewer than 10 words
movie_reviews = movie_reviews.loc[movie_reviews.Phrase_Length>=10,:]

movie_reviews.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,Phrase_Length
0,1,1,A series of escapades demonstrating the adage ...,1,35
1,2,1,A series of escapades demonstrating the adage ...,2,14
5,6,1,of escapades demonstrating the adage that what...,2,12
7,8,1,escapades demonstrating the adage that what is...,2,11
9,10,1,demonstrating the adage that what is good for ...,2,10


Now we need to write a function that counts the fraction of each type of word, which we can then pass to the apply function. This function will first have to parse the phrase and get counts of the total number of words (ignoring stop words) and the number of positive or negative words.

In [43]:
def Get_Fraction(phrase,stop_words, all_sent_words):
    
    words = [word.lower() for word in word_tokenize(phrase)]
    total_words = [word for word in words if word not in stop_words and word.isalpha()]
    sent_words = [word for word in words if word in all_sent_words and word.isalpha()]
    
    return len(sent_words)/len(total_words)
    
    

Use the apply() method to add columns to the data frames with these two proportions.

In [44]:
#Get positive, negative and stop words
pos_words = Read_Words("Data/pos_words.txt")
neg_words = Read_Words("Data/neg_words.txt")
stop_words = stopwords.words('english')

#Add the two new columns
movie_reviews["Pos_Score"] = movie_reviews.Phrase.apply(Get_Fraction, stop_words = stop_words, all_sent_words = pos_words)
movie_reviews["Neg_Score"] = movie_reviews.Phrase.apply(Get_Fraction, stop_words = stop_words, all_sent_words = neg_words)
movie_reviews.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,Phrase_Length,Pos_Score,Neg_Score
0,1,1,A series of escapades demonstrating the adage ...,1,35,0.133333,0.0
1,2,1,A series of escapades demonstrating the adage ...,2,14,0.166667,0.0
5,6,1,of escapades demonstrating the adage that what...,2,12,0.2,0.0
7,8,1,escapades demonstrating the adage that what is...,2,11,0.2,0.0
9,10,1,demonstrating the adage that what is good for ...,2,10,0.25,0.0


Now we will use sklearn's logistic regression package.

In [51]:
feature_list = ["Pos_Score", "Neg_Score"]
X = np.array(movie_reviews.loc[:, feature_list])
y = np.array(movie_reviews.Sentiment)
    
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression(multi_class = "multinomial", solver = 'newton-cg')
model = model.fit(X, y)

Now let's see how it works by giving it a review and seeing how our model classified this review. Let's see what our model predicts for the following two reviews.

In [57]:
review_1 = "This movie was the worst movie I've ever seen.  The acting was horrible!"
review_2 = "Amazing! simply the best movie ever. I would love to see it again."
pos_score_1 = Get_Fraction(review_1 ,stop_words, pos_words)
neg_score_1 = Get_Fraction(review_1 ,stop_words, neg_words)
pos_score_2 = Get_Fraction(review_2 ,stop_words, pos_words)
neg_score_2 = Get_Fraction(review_2 ,stop_words, neg_words)

model.predict(np.array([[pos_score_1, neg_score_1], [pos_score_2, neg_score_2]]))

array([1, 3])