Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS Spam detection model using LSTM

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb


SMS spam detection model (LSTM)


Machine learning model to classify sms texts as ham or spam
·      Total 11144 data points (rows) presented In the dataset
·      Splited data into train and test  (80% and 20%) ratio.
·      Used torchtext library for text processing.
·      Used NLTK library to word tokenization.
·      Built vocabulary using top 10,500 words from the testing data

In this model I used LSTM architecture for to predict. LSTM is improved version of RNN


The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

 

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.


Got training accuracy 86% but test accuracy 74%.

SMS Spam detection model using RNN

SMS Spam detection model using RNN

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset


https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb

SMS spam detection model (RNN)


Machine learning model to classify sms texts as ham or spam
·      Total 11144 data points (rows) presented In the dataset
·      Splited data into train and test  (80% and 20%) ratio.
·      Used torchtext library for text processing.
·      Used NLTK library to word tokenization.
·      Built vocabulary using top 10,500 words from the testing data

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

 

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 85% but test accuracy 69%.

Wednesday, February 19, 2020

Ship classification (CNN)


Ship classification (CNN)

1.     Computer Vision Hackathon on Analytics Vidhya in which we were supposed to classify different images of ships into 5 classes – (Cargo, Military, Carrier, Cruise, Tanker)
2.     Most of the images captured from very long distance. So high chance of noise in image.
3.     Used pre-trained model ‘SENet154’ to train. Un feezed some of the model parameters also and tuned.

 

      GitHub:


Whale identification (CNN)


Whale identification (CNN)

1.     we are challenged to build an algorithm to identify individual whales in images.
2.     we need to analyse Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors.
3.     This training data contains thousands of images of humpback whale flukes. Individual whales have been identified by researchers and given an Id.
4.     Used Pytorch framework to build CNN model. Built 2 layered NN with BatchNormalization.

 

      GitHub:


Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...