Sivasankar Blog: 2020

Tuesday, October 6, 2020

Image noise comparison methods

1. using reference image technique

- peak_signal_noise_ratio (PSNR)

- SSI

2. non-reference image technique

- BRISQUE

python packages

1. skimage.metrics

2. sewar

3. pydicon

4. medpy

MRI denoising filters

1. Gaussian

2. Bilateral

3. Median

4. Total Variation

5. Wavelet

6. Non local means

7. Block matching and 3D filtering (BM3D)

Saturday, August 29, 2020

Spark must know concepts

Spark Main Concepts

Apache Spark is an open-source distributed general-purpose cluster-computing framework.

Why spark faster

1. In memory processing

2. Supports data parallelism

3. Lazy execution

Concepts must know

1. RDD

2. Lineage graph

3. DAG Schedular

4. Action

5. Transformation

Saturday, July 18, 2020

Better way writing logs using logger

As a Data Scientist Programmer, you might have to build a project based on machine learning, deep learning, and other related projects. It’s easy to compile and see the output of a project are according to your requirement.

During development phase most of the time we print the output and verify. But in the production phase, it’s very important to keep track of logs in a file while the program is running. Otherwise, you will not be able to know which modules or lines of code have got errors and problems. This leads to high time consuming to fix the errors if the program crashes. In this post, you will know how to use logging in python to keep records of the occurred events within a program.

We created solution works both environments development and production. Basic idea was to maintain configuration file. Read all logging related information from this config. So, it is easy to modify any point of time by developer. In development environment we can keep different config value as per our requirement.

In our solution we have 2 main files

1. Logging.conf

2. Loghelper.py

Logging.conf

Main important part is to define loggers, handlers and formatters

[loggers] # unique name

keys=root,simpleExample,filelogExample,timedRotatingFileHandler

[handlers] # for to define different logs type (like console or file)

keys=consoleHandler,fileHandler,timedRotatingFileHandler

[formatters] # message formatter

keys=simpleFormatter

In my case, I defined 3 types

1. simpleExample to log errors on console

2. filelogExample to log errors on specific log file

3. timedRotatingFileHandler to log errors on to specific file but every midnight creates new file.

Loghelper.py

This is simple file, we are reading conf file here.

import logging

import logging.config

logging.config.fileConfig('logging.conf')

employee.py

Import loghelper, create logger object using getLogger and specify the name of the logger you want to use.

import loghelper

import logging

import traceback

import time

logger = logging.getLogger('filelogExample')

based on the config values all logs goes to python.log file.

logger = logging.getLogger('timedRotatingFileHandler')

exception logging

import loghelper

import logging

import traceback

logger.error("uncaught exception: %s", traceback.format_exc())

logger.exception("exception")

Uploaded files to github, check url here

Thursday, July 16, 2020

Web Scraping - website data scraping using BeautifulSoup

Recent days I worked on scarping data from multiple web sources. BeautifulSoup is useful library. Good thing is very easy to work also. If you have little idea about HTML then its very easy. Before exploring this package I also think that web scraping is very difficult. But once you have clear idea what you want and what format you want then its easy to get.

Step1: Import required packages

import requests

from bs4 import BeautifulSoup

Step2: Define the url which you want to gathere data from

base_site = "https://en.wikipedia.org/wiki/Music"

in my case, I gathered data from Wikipedia music site

Step3: Make a get request and check request success/fail by using status_code

response = requests.get(base_site)

response.status_code

Step4: Extract html content from response object

html = response.content

Step5: From now out BeatifulSoup journey starts

Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily. We can use different parsers like “html.parser”, “lxml”. I have written another article about parsers. Please read here

soup = BeautifulSoup(html, "html.parser")

Step6: Save html object into local file and explore the structure. This is optional step. You can do this on web page itself. This step you can decide based on requirement.

with open('Wiki_response.html', 'wb') as file:

file.write(soup.prettify('utf-8'))

# .prettify() modifies the HTML code with additional indentations for better readability

Step7: We mostly work on two methods “find” and “find_all”

Find returns fist occurrence of element in the markup, it returns None if tag not found in markup.

Here I am finding video tag in soup object

soup.find('video')

Find_all returns all occurrences in the entire markup. It returns list with all items

Here I am finding all anchor tags in soup object

links = soup.find_all('a')

different ways to get required element form soup object

1. By using specific class attribute fetching anchor tags. It returns all anchor tags which had ‘mw-jump-link’ class. Class is special keyword so we have to use underscore.

soup.find_all('a', class_ = 'mw-jump-link')

2. Used class and href attributes

soup.find('a', class_ = 'mw-jump-link', href = '#p-search')

3. We can pass dictionary instead separate attributes

soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })

soup.find('div', {'id' : 'footer'})

In my specific case, extracted content from multiple url’s. I written method to automate my requirement. Please check my github link below.

github: https://github.com/kvsivasankar/web-scraping/blob/master/webscraping2.ipynb

web scraping - Pulling data from public APIs

Sometimes we gather data form API's. JSON is the most common formats we receiving output.

Here I am showing how to call api's and handle errors.

step1: Import required packages

import requests
import json

step2: Using requests call api url. here "base_url" is web service url.

We can make a GET request to this API endpoint with requests.get

base_url = "https://api.exchangeratesapi.io/latest"
response = requests.get(base_url)

This method returns the response from the server

We store this response in a variable for future processing

step3: Verify if the request went through ok or not

response.ok (It returns True if success)
response.status_code (for my case it returned 200). we can check status code meaning
Incase of error 400 code will be return
response.json() will be show error details

step4: Verify data

response.text ( returns output in string format)
response.content (returns the content of the response in byte format)

Uploaded work to github. Please find the link below.
github: https://github.com/kvsivasankar/web-scraping/blob/master/webscraping1.ipynb

Wednesday, July 15, 2020

Difference between "html.parser" vs "lxml"

1. html.parser - BeautifulSoup(htmlmarkup, "html.parser")

Advantages:

Batteries included
Decent speed
Built-in - no extra dependencies needed
Lenient (as of Python 2.7.3 and 3.2.)

Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

2. lxml - BeautifulSoup(htmlmarkup, "lxml")

Advantages:

Batteries included
Very fast
Lenient
Works well when webpage has broken HTML

Disadvantages: External C dependency

Documentation clearly mentioned. check this url here

Wednesday, February 26, 2020

SMS Spam detection model using LSTM

SMS Spam detection model using LSTM

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_LSTM.ipynb

SMS spam detection model (LSTM)

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Machine learning model to classify sms texts as ham or spam

· Total 11144 data points (rows) presented In the dataset

· Splited data into train and test (80% and 20%) ratio.

· Used torchtext library for text processing.

· Used NLTK library to word tokenization.

· Built vocabulary using top 10,500 words from the testing data

In this model I used LSTM architecture for to predict. LSTM is improved version of RNN

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 86% but test accuracy 74%.

SMS Spam detection model using RNN

SMS Spam detection model using RNN

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://github.com/kvsivasankar/RNN-Models/blob/master/SMSSpam_BinaryTextClassification_Words_RNN.ipynb

SMS spam detection model (RNN)

DataSet Source: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Machine learning model to classify sms texts as ham or spam

· Total 11144 data points (rows) presented In the dataset

· Splited data into train and test (80% and 20%) ratio.

· Used torchtext library for text processing.

· Used NLTK library to word tokenization.

· Built vocabulary using top 10,500 words from the testing data

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

NLTK provides a function called word_tokenize() for splitting strings into tokens. It splits tokens based on white space and punctuation.

Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). Each index is used to construct a one-hot vector for each word.

Got training accuracy 85% but test accuracy 69%.

Wednesday, February 19, 2020

Ship classification (CNN)

1. Computer Vision Hackathon on Analytics Vidhya in which we were supposed to classify different images of ships into 5 classes – (Cargo, Military, Carrier, Cruise, Tanker)

2. Most of the images captured from very long distance. So high chance of noise in image.

3. Used pre-trained model ‘SENet154’ to train. Un feezed some of the model parameters also and tuned.

GitHub:

click here to get github url