Wednesday, February 26, 2020

SMS Spam detection model using RNN

SMS Spam detection model using RNN

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset


https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb

SMS spam detection model (RNN)


Machine learning model to classify sms texts as ham or spam
·      Total 11144 data points (rows) presented In the dataset
·      Splited data into train and test  (80% and 20%) ratio.
·      Used torchtext library for text processing.
·      Used NLTK library to word tokenization.
·      Built vocabulary using top 10,500 words from the testing data

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

 

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 85% but test accuracy 69%.

No comments:

Post a Comment

Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...