Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS Spam detection model using LSTM

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb


SMS spam detection model (LSTM)


Machine learning model to classify sms texts as ham or spam
·      Total 11144 data points (rows) presented In the dataset
·      Splited data into train and test  (80% and 20%) ratio.
·      Used torchtext library for text processing.
·      Used NLTK library to word tokenization.
·      Built vocabulary using top 10,500 words from the testing data

In this model I used LSTM architecture for to predict. LSTM is improved version of RNN


The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

 

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.


Got training accuracy 86% but test accuracy 74%.

No comments:

Post a Comment

Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...