SMS Spam detection model using LSTM
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb
SMS spam detection model (LSTM)
In this model I used LSTM architecture for to predict. LSTM is improved version of RNN
NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.
Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb
SMS spam detection model (LSTM)
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
Machine learning model to classify sms texts as ham or spam
· Total 11144 data points (rows) presented In the dataset
· Splited data into train and test (80% and 20%) ratio.
· Used torchtext library for text processing.
· Used NLTK library to word tokenization.
· Built vocabulary using top 10,500 words from the testing data
In this model I used LSTM architecture for to predict. LSTM is improved version of RNN
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.
Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.
Got training accuracy 86% but test accuracy 74%.
No comments:
Post a Comment