By: Ram on Jul 24, 2022
We get hundreds of emails every day and not all the emails are relevant to us. We often get irritated with the irrelevant emails and wish to segregate the spam emails from the genuine ones. Machine Learning algorithms can do this job and quite effectively.
In this blog, we will consider a sample dataset and develop Spam Classification Model.
An email contains a few information and ideally, we can consider all of them to classify an email to Spam or not. Key email attributes are:
This is an example; the data is textual, and we need to leverage Natural Language Processing (NLP) for developing Machine Learning Model. We will discuss some of the concepts in the relevant sections.
Overall approach to develop the model are:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from textblob import TextBlob
stop_words = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
We need to develop a ML model to classify emails into Spam and Non-Spam. For our algorithm to know which emails look like Spam and how these are different from the non-Spam email, we need emails data along with their labels (whether Spam or Non-Spam)
There are number of Spam Label dataset available for us to use. In this example, we are using a dataset available on Kaggle.
We can download and read the data. There are two folder spam and ham in the downloaded zipped folder. Also, we need to read each of the email saved in the respective folder.
We have used some of the python functions from this workbook
Now, we have the dataset with us, and it is always important to carry out Exploratory Data Analysis (EDA). There may be some insights and idea we get to improve the data.
For a classification problem like Spam Classification, we must always check the label data distribution. We can visualize spam/ham emails using a bar chart.
Since, we have text contents for each of these emails. We need to clean up the text data and some of the basic steps are
Once, we have cleaned text contents and the data is list of words for each of email subject and then list of subjects for the emails.
We have below steps, before create word clouds
Word Cloud helps in understanding the common words across the text corpus. For developing the word cloud, we need to find words and their frequencies. Then plot the relative importance using word cloud. So, we need to find NGram of the text documents.
fig = plt.figure(figsize=(10,10))
words_dict = dict(words_freq[:200])
wordCloud = WordCloud(max_words=100, height=350, width=350,stopwords=stop_words)
plt.title('Word Cloud - Email Subjects')
We can create separate word clouds for Spam and Ham Emails
For the model, the input data should be representation numerically. In this case we will convert the subjects’ textual information to a numerically format. For now, the data will be converted into either Tf-IDF vectors or word vector count. We will use Tf-IDF Vectorizer on the email subject data.
TF-IDF Vectorizer creates Tf-IDF values for every word in the email subjects. Tf-IDF values gives a higher value to words appearing less frequently so that commonly occurring words don’t give importance to them but gives more importance to the meaningful and interesting terms used in the email subjects
Term frequency (TF) = (Number of Occurrences of a word)/(Total words in the document)
IDF(word) = Log((Total number of documents)/(Number of documents containing the word))
y = emails_df['label']
X_train, X_test, y_train, y_test = train_test_split(X_idf, y, test_size=0.2, random_state=0)
print("Length of Train", len(X_train))
print("Length of Train", len(X_test))
Length of Train 2134
Length of Train 534
Training Text Classification Model - Spam Vs Ham
We can use any of the supervised ML method that can be used for classification. Let's first fit our Naïve Bayes model, on the Tf-IDF vector-based x_train and labelled sample - y_train.
#Create a Gaussian Classifier
modelNB = GaussianNB()
# Train the model using the training sets
Precision tells us % of cases correctly identified. For the emails predicted as Spam, % of these are Spam. In the above confusion matrix, 45% of the emailed identified by the model as Spam are correctly classified.
Recall is termed as capture rate. In this example, % of all the Spam emailed captured by the model. In the above example, all the Spam emailed are identified. So, for Spam (label =1), the call is 1.
Since Precision is only 45% for the Spam class, we can find ways to improve the model.
More information on Precision & Recall - Blog
The current data sample is im-balanced -both classes for the label variable is not equal. in this sample, Spam emails are 551 but Ham emails are 2551. We can make the data balanced and then train the model.
We can explore different ML techniques such as SVM, XGBoost or LSTM.
Currently, we have used only subjects of emails. We can also use the content of the emails. Also, we can create additional features from the subject and content.
We have learnt steps to develop an email classification model that can classify emails into Spam or Ham. You can follow the steps to use email content and subject (both together) to improve the accuracy of the model.
Also try any or all the above suggested steps to improve the model.