Sivasankar Blog: SMS Spam detection model using LSTM

Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS Spam detection model using LSTM

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb

SMS spam detection model (LSTM)

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Machine learning model to classify sms texts as ham or spam

· Total 11144 data points (rows) presented In the dataset

· Splited data into train and test (80% and 20%) ratio.

· Used torchtext library for text processing.

· Used NLTK library to word tokenization.

· Built vocabulary using top 10,500 words from the testing data

In this model I used LSTM architecture for to predict. LSTM is improved version of RNN

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 86% but test accuracy 74%.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)