JUNE 2023 Data Science Course

THIS IS THE 4th and the last part of the complete Data Science with Python course started 3 months ago!!

3rd Part of the Content can be accessed here

”’
NLP – Natural Language Processing – analysing review comment to understand
reasons for positive and negative ratings.
concepts like: unigram, bigram, trigram

Steps we generally perform with NLP data:
1. Convert into lowercase
2. decompose (non unicode to unicode)
3. removing accent: encode the content to ascii values
4. tokenization: will break sentence to words
5. Stop words: not important words for analysis
6. Lemmetization (done only on English words): convert the words into dictionary words
7. N-grams: set of one word (unigram), two words (bigram), three words (trigrams)
8. Plot the graph based on the number of occurrences and Evaluate
”’
”’
cardboard mousepad. Going worth price! Not bad
”’

link=“D:/datasets/OnlineRetail/order_reviews.csv”
import pandas as pd
import unicodedata
import nltk
import matplotlib.pyplot as plt
df = pd.read_csv(link)
print(list(df.columns))
”’
[‘review_id’, ‘order_id’, ‘review_score’, ‘review_comment_title’,
‘review_comment_message’, ‘review_creation_date’, ‘review_answer_timestamp’]
”’
#df[‘review_creation_date’] = pd.to_datetime(df[‘review_creation_date’])
#df[‘review_answer_timestamp’] = pd.to_datetime(df[‘review_answer_timestamp’])

# data preprocessing – making data ready for analysis
reviews_df = df[df[‘review_comment_message’].notnull()].copy()
#print(reviews_df)

# remove accents
def remove_accent(text):
return unicodedata.normalize(‘NFKD’,text).encode(‘ascii’,errors=‘ignore’).decode(‘utf-8’)
#STOP WORDS LIST:
STOP_WORDS = set(remove_accent(w) for w in nltk.corpus.stopwords.words(‘portuguese’))

”’
Write a function to perform basic preprocessing steps
”’
def basic_preprocessing(text):
#converting to lower case
txt_pp = text.lower()
#print(txt_pp)

#remove the accent
#txt_pp = unicodedata.normalize(‘NFKD’,txt_pp).encode(‘ascii’,errors=’ignore’).decode(‘utf-8’)
txt_pp =remove_accent(txt_pp)
#print(txt_pp)
#tokenize
txt_token = nltk.tokenize.word_tokenize(txt_pp)
#print(txt_token)

# removing stop words
txt_token = tuple(w for w in txt_token if w not in STOP_WORDS and w.isalpha())
return txt_token



## write a function to creaet unigram, bigram, trigram
def create_ngrams(words):
unigrams,bigrams,trigrams = [],[],[]
for comment in words:
unigrams.extend(comment)
bigrams.extend(‘ ‘.join(bigram) for bigram in nltk.bigrams(comment))
trigrams.extend(‘ ‘.join(trigram) for trigram in nltk.trigrams(comment))


return unigrams, bigrams, trigrams


# applying basic preprocessing:
reviews_df[‘review_comment_words’] = \
reviews_df[‘review_comment_message’].apply(basic_preprocessing)

#get positive reviews – all 5 ratings in review_score
reviews_5 = reviews_df[reviews_df[‘review_score’]==5]

#get negative reviews – all 1 ratings
reviews_1 = reviews_df[reviews_df[‘review_score’]==1]
#create ngrams for rating 5 and rating 1
uni_5, bi_5, tri_5 = create_ngrams(reviews_5[‘review_comment_words’])
print(uni_5)
print(bi_5)
print(tri_5)

# Assignment: perform similar tasks for reviews that are negative (review score = 1)
#uni_1, bi_1, tri_1 = create_ngrams(reviews_1[‘review_comment_words’])
#print(uni_5)

# distribution plot
def plot_dist(words, color):
nltk.FreqDist(words).plot(20,cumulative=False, color=color)

plot_dist(tri_5, “red”)

#NLP – Natural Language processing:
# sentiments: Positive, Neutral, Negative
#
”’
we will use nltk library for NLP:
pip install nltk
”’
import nltk
#1. Convert into lowercase
text = “Product is great but I amn’t liking the colors as they are worst”
text = text.lower()

”’
2. Tokenize the content: break it into words or sentences
”’
text1 = text.split()
#using nltk
from nltk.tokenize import sent_tokenize,word_tokenize
text = word_tokenize(text)
#print(“Text =\n”,text)
#print(“Text =\n”,text1)

”’
3. Removing Stop words: Words which are not significant
for your analysis. E.g. an, a, the, is, are
”’
my_stopwords = [‘is’,‘i’,‘the’]
text1 = text
for w in text1:
   
if w in my_stopwords:
        text.remove(w)
print(“Text after my stopwords:”,text1)

nltk.download(
“stopwords”)
from nltk.corpus import stopwords
nltk_eng_stopwords =
set(stopwords.words(“english”))
#print(“NLTK list of stop words in English: “,nltk_eng_stopwords)
”’
Just for example: we see the word but in the STOP WORDS but
we want to include it, then we need to remove the word from the set
”’
# removing but from the NLTK stop words
nltk_eng_stopwords.remove(‘but’)

for w in text:
   
if w in nltk_eng_stopwords:
        text.remove(w)
print(“Text after NLTK stopwords:”,text)

”’
4. Stemming: changing the word to its root
eg: {help: [help, helped, helping, helper]}

One of the method is Porter stemmer
”’
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
text = [stemmer.stem(w)
for w in text]
”’ above line is like below:
t_list=[]
for w in text:
    a = stemmer.stem(w)
    t_list.append(a)
”’
print(“Text after Stemming:”,text)
”’
5. Part of Speech Tagging (POS Tagging)
grammatical word which deals with the roles they place
like – 8 parts of speeches – noun, verb, …

Reference: https://www.educba.com/nltk-pos-tag/
POS Tagging will give Tags like

CC: It is the conjunction of coordinating
CD: It is a digit of cardinal
DT: It is the determiner
EX: Existential
FW: It is a foreign word
IN: Preposition and conjunction
JJ: Adjective
JJR and JJS: Adjective and superlative
LS: List marker
MD: Modal
NN: Singular noun
NNS, NNP, NNPS: Proper and plural noun
PDT: Predeterminer
WRB: Adverb of wh
WP$: Possessive wh
WP: Pronoun of wh
WDT: Determiner of wp
VBZ: Verb
VBP, VBN, VBG, VBD, VB: Forms of verbs
UH: Interjection
TO: To go
RP: Particle
RBS, RB, RBR: Adverb
PRP, PRP$: Pronoun personal and professional

But to perform this, we need to download any one tagger:
e.g. averaged_perceptron_tagger
nltk.download(‘averaged_perceptron_tagger’)
”’
nltk.download(‘averaged_perceptron_tagger’)

import nltk
from nltk.tag import DefaultTagger
py_tag = DefaultTagger (
‘NN’)
tag_eg1 = py_tag.tag ([
‘Example’, ‘tag’])
print(tag_eg1)

#txt = “Example of nltk pos tag list”
#txt = [‘product’, ‘great’, ‘but’, “not”, ‘like’, ‘color’]
#txt = word_tokenize(txt)
#txt = [‘Example’,’of’,’nltk’,’pos’,’tag’,’list’]
pos_txt = nltk.pos_tag(text)
print(“POS Tagging:”, pos_txt)

”’
6. Lemmetising
takes a word to its core meaning
We need to download:  wordnet
”’
nltk.download(‘wordnet’)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(“Very good = “,lemmatizer.lemmatize(“very good”))
print(“Halves = “,lemmatizer.lemmatize(“halves”))

text =
“Product is great but I amn’t liking the colors as they are worst”
text = word_tokenize(text)
text = [lemmatizer.lemmatize(w)
for w in text]
print(“Text after Lemmatizer: “,text)


# Sentiment analysis – read the sentiments of each sentence
”’
If you need more data for your analysis, this is a good source:
https://github.com/pycaret/pycaret/tree/master/datasets

We will use Amazon.csv for this program

”’
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

link = “https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv”
df = pd.read_csv(link)
print(df)

#Let’s create a function to perform all the preprocessing steps
# of a nlp analysis
def preprocess_nlp(text):
#tokenise
#print(“0”)
text = text.lower() #lowercase
#print(“1”)
text = word_tokenize(text) #tokenize
#print(“2”)
text = [w for w in text if w not in stopwords.words(“english”)]
#lemmatize
#print(“3”)
lemm = WordNetLemmatizer()
#print(“4”)
text = [lemm.lemmatize(w) for w in text]
#print(“5”)
# now join all the words as we are predicting on each line of text
text_out = ‘ ‘.join(text)
#print(“6”)
return text_out

# import Resource vader_lexicon
import nltk
nltk.download(‘vader_lexicon’)


df[‘reviewText’] = df[‘reviewText’].apply(preprocess_nlp)
print(df)

# NLTK Sentiment Analyzer
# we will now define a function get_sentiment() which will return
# 1 for positive and 0 for non-positive
analyzer = SentimentIntensityAnalyzer()
def get_sentiment(text):
score = analyzer.polarity_scores(text)
sentiment = 1 if score[‘pos’] > 0 else 0
return sentiment

df[‘sentiment’] = df[‘reviewText’].apply(get_sentiment)

print(“Dataframe after analyzing the sentiments: \n,df)

#confusion matrix
from sklearn.metrics import confusion_matrix
print(“Confusion matrix:\n,confusion_matrix(df[‘Positive’],df[‘sentiment’]))

”’ RESULT

Confusion matrix:
[[ 1131 3636]
[ 576 14657]]
Accuracy: (1131 + 14657) / (1131 + 14657 + 576 + 3636) = 15788/20000 = 78.94%
”’

# Visualization
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30, histtype=‘stepfilled’, color=“red”)
plt.title(“Histogram Display”)
plt.xlabel(“Marks”)
plt.ylabel(“Number of Students”)
plt.show()
# Analyzing Hotel Bookings data
# https://github.com/swapnilsaurav/Dataset/blob/master/hotel_bookings.csv
link=“https://raw.githubusercontent.com/swapnilsaurav/Dataset/master/hotel_bookings.csv”
import pandas as pd
df = pd.read_csv(link)
#print(“Shape of the data: “,df.shape)
#print(“Data types of the columns:”,df.dtypes)
import numpy as np
df_numeric = df.select_dtypes(include=[np.number])
#print(df_numeric)
numeric_cols = df_numeric.columns.values
#print(“Numeric column names: “,numeric_cols)
df_nonnumeric = df.select_dtypes(exclude=[np.number])
#print(df_nonnumeric)
nonnumeric_cols = df_nonnumeric.columns.values
#print(“Non Numeric column names: “,nonnumeric_cols)

####
#preprocessing the data
import seaborn as sns
import matplotlib.pyplot as plt
colors = [“#091AEA”,“#EA5E09”]
cols = df.columns
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colors))
plt.show()

cols_to_drop = []
for col in cols:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >80:
#print(f”{col} -> {pct_miss}”)
cols_to_drop.append(col) #column list to drop

# remove column since it has more than 80% missing value
df = df.drop(cols_to_drop, axis=1)

for col in df.columns:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >80:
print(f”{col} -> {pct_miss})
# check for rows to see the missing values
missing = df[col].isnull()
num_missing = np.sum(missing)
if num_missing >0:
df[f’{col}_ismissing’] = missing
print(f”Created Missing Indicator for {cols})

### keeping track of the missing values
ismissing_cols = [col for col in df.columns if ‘_ismissing’ in col]
df[‘num_missing’] = df[ismissing_cols].sum(axis=1)
print(df[‘num_missing’])

# drop rows with > 12 missing values
ind_missing = df[df[‘num_missing’] > 12].index
df = df.drop(ind_missing,axis=0) # ROWS DROPPED

#count for missing values
for col in df.columns:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >0:
print(f”{col} -> {pct_miss})

”’
Still we are left with following missing values:
children -> 2.0498257606219004
babies -> 11.311318858061922
meal -> 11.467129071170085
country -> 0.40879238707947996
deposit_type -> 8.232810615199035
agent -> 13.687005763302507
”’
# Analyzing Hotel Bookings data
# https://github.com/swapnilsaurav/Dataset/blob/master/hotel_bookings.csv
link=“https://raw.githubusercontent.com/swapnilsaurav/Dataset/master/hotel_bookings.csv”
import pandas as pd
df = pd.read_csv(link)
#print(“Shape of the data: “,df.shape)
#print(“Data types of the columns:”,df.dtypes)
import numpy as np
df_numeric = df.select_dtypes(include=[np.number])
#print(df_numeric)
numeric_cols = df_numeric.columns.values
print(“Numeric column names: “,numeric_cols)
df_nonnumeric = df.select_dtypes(exclude=[np.number])
#print(df_nonnumeric)
nonnumeric_cols = df_nonnumeric.columns.values
print(“Non Numeric column names: “,nonnumeric_cols)

####
#preprocessing the data
import seaborn as sns
import matplotlib.pyplot as plt
colors = [“#091AEA”,“#EA5E09”]
cols = df.columns
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colors))
plt.show()

cols_to_drop = []
for col in cols:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >80:
#print(f”{col} -> {pct_miss}”)
cols_to_drop.append(col) #column list to drop

# remove column since it has more than 80% missing value
df = df.drop(cols_to_drop, axis=1)

for col in df.columns:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >80:
print(f”{col} -> {pct_miss})
# check for rows to see the missing values
missing = df[col].isnull()
num_missing = np.sum(missing)
if num_missing >0:
df[f’{col}_ismissing’] = missing
#print(f”Created Missing Indicator for {cols}”)

### keeping track of the missing values
ismissing_cols = [col for col in df.columns if ‘_ismissing’ in col]
df[‘num_missing’] = df[ismissing_cols].sum(axis=1)
print(df[‘num_missing’])

# drop rows with > 12 missing values
ind_missing = df[df[‘num_missing’] > 12].index
df = df.drop(ind_missing,axis=0) # ROWS DROPPED

#count for missing values
for col in df.columns:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >0:
print(f”{col} -> {pct_miss})

”’
Still we are left with following missing values:
children -> 2.0498257606219004 # numeric
babies -> 11.311318858061922 #numeric
meal -> 11.467129071170085 # non-numeric
country -> 0.40879238707947996 # non-numeric
deposit_type -> 8.232810615199035 # non-numeric
agent -> 13.687005763302507 #numeric
”’
#HANDLING NUMERIC MISSING VALUES
df_numeric = df.select_dtypes(include=[np.number])
for col in df_numeric.columns.values:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss > 0:
med = df[col].median()
df[col] = df[col].fillna(med)

#HANDLING non-NUMERIC MISSING VALUES
df_nonnumeric = df.select_dtypes(exclude=[np.number])
for col in df_nonnumeric.columns.values:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss > 0:
mode = df[col].describe()[‘top’]
df[col] = df[col].fillna(mode)


print(“#count for missing values”)
for col in df.columns:
pct_miss = np.mean(df[col].isnull()) * 100
if pct_miss >0:
print(f”{col} -> {pct_miss})

#drop duplicate values
print(“Shape before dropping duplicates: “,df.shape)
df = df.drop(‘id’,axis=1).drop_duplicates()
print(“Shape after dropping duplicates: “,df.shape)

DAY 73: Power BI (Coming Soon)

DAY 74: Tableau (Coming soon)

Thats the end of the course - entire content in presented in 4 blog pages