Sivasankar Blog: February 2020

Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS Spam detection model using LSTM

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb

SMS spam detection model (LSTM)

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Machine learning model to classify sms texts as ham or spam

· Total 11144 data points (rows) presented In the dataset

· Splited data into train and test (80% and 20%) ratio.

· Used torchtext library for text processing.

· Used NLTK library to word tokenization.

· Built vocabulary using top 10,500 words from the testing data

In this model I used LSTM architecture for to predict. LSTM is improved version of RNN

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 86% but test accuracy 74%.

SMS Spam detection model using RNN

SMS Spam detection model using RNN

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb

SMS spam detection model (RNN)

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Machine learning model to classify sms texts as ham or spam

· Total 11144 data points (rows) presented In the dataset

· Splited data into train and test (80% and 20%) ratio.

· Used torchtext library for text processing.

· Used NLTK library to word tokenization.

· Built vocabulary using top 10,500 words from the testing data

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 85% but test accuracy 69%.

Wednesday, February 19, 2020

Ship classification (CNN)

1. Computer Vision Hackathon on Analytics Vidhya in which we were supposed to classify different images of ships into 5 classes – (Cargo, Military, Carrier, Cruise, Tanker)

2. Most of the images captured from very long distance. So high chance of noise in image.

3. Used pre-trained model ‘SENet154’ to train. Un feezed some of the model parameters also and tuned.

GitHub:

click here to get github url

Whale identification (CNN)

1. we are challenged to build an algorithm to identify individual whales in images.

2. we need to analyse Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors.

3. This training data contains thousands of images of humpback whale flukes. Individual whales have been identified by researchers and given an Id.

4. Used Pytorch framework to build CNN model. Built 2 layered NN with BatchNormalization.

GitHub:

github solution url click here

Sivasankar Blog

Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS spam detection model (LSTM)

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

SMS Spam detection model using RNN

SMS spam detection model (RNN)

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Wednesday, February 19, 2020

Ship classification (CNN)

Ship classification (CNN)

GitHub:

Whale identification (CNN)

Whale identification (CNN)

GitHub:

Image noise comparison methods

Search This Blog