SMS Spam detection model using RNN
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb
SMS spam detection model (RNN)
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.
Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.
Got training accuracy 85% but test accuracy 69%.
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb
SMS spam detection model (RNN)
DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset
Machine learning model to classify sms texts as ham or spam
· Total 11144 data points (rows) presented In the dataset
· Splited data into train and test (80% and 20%) ratio.
· Used torchtext library for text processing.
· Used NLTK library to word tokenization.
· Built vocabulary using top 10,500 words from the testing data
No comments:
Post a Comment