Sivasankar Blog: 2019

Saturday, October 26, 2019

InstantGratification-NNetwork-model kaggle dataset

kaggle dataset : https://www.kaggle.com/c/instant-gratification/data

GitHub:
https://github.com/kvsivasankar/InstantGratification-NNetwork-model

Wednesday, August 14, 2019

Docker for Data Scientists

Introduction

Docker can be a very powerful tool and you can learn how to use it without going all the way down the rabbit hole. In this guide, you will get just enough Docker knowledge to improve your data science workflow and avoid common pitfalls.

This guide is based on my experiences as an independent consultant, helping data science teams to introduce reproducible and automated machine learning workflows.

What is Docker?

Docker is container platform for deploying isolated applications on Linux/Windows. It includes a tool chain for creating, sharing, and building upon layered application stacks. Docker also forms the basis for more advanced services such as Docker Swarm from Docker Inc. and Kubernetes from Google.

As a Data Scientist, Why Should You Care?

Automation of deployment via Docker containers helps you to focus on your work, and not on maintaining complex software dependencies. By freezing the exact state of a deployed system inside an image, you also get easier reproducibility of your work and collaboration with your colleagues. Finally, you can use resources like Docker Hubto find pre-built recipes (Dockerfiles) from others that you can copy and build on.

Useful docker commands

docker run -it --rm --name ds -p 8888:8888 jupyter/datascience-notebook

above command used to get pre-existed image in docker hub and run with the name "ds" in the port 8888

docker ps

docker build ---- to build image from .Dockerfile

docker images ---shows all dowloaded images

docker rm ---- remove containers

docker rmi --- remove images

docker rmi -f image-name ---- to remove force

docker run -it -p 1234:80 --name hello-world

docker --version

control - c to exit or remove/stop running containers

Advantages:

Return on investment & cost savings

Standardization & productivity

Compatibility & maintainability

Simplicity & faster configurations

Rapid Deployment

Continuous Deployment & Testing

Isolation

Security

Saturday, August 3, 2019

Brest Cancer detection model

kaggle dataset url
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

github rul
https://github.com/kvsivasankar/Breast-Cancer-Detect-Model

Thursday, July 25, 2019

Text auto summarise - NLP

text auto summarise nlp solution

https://github.com/kvsivasankar/Text-AutoSummarize-NLP-Model

from big paragraph to summarised text. You can decide how many number of sentences you want.

get dataset from kaggle
https://www.kaggle.com/uciml/iris

problem solution
github url:
https://github.com/kvsivasankar/Iris-flower-prdiction-model

Titanic Survival Prediction Model

get data from kaggle
https://www.kaggle.com/c/titanic/data

github url:
https://github.com/kvsivasankar/titanic-survival-prediction

Tuesday, July 23, 2019

Credit Card Fraud Detection Model - Kaggle problem

kaggle dataset url
https://www.kaggle.com/mlg-ulb/creditcardfraud

git repo
https://github.com/kvsivasankar/Credit-Card-Fraud-Detection

Monday, July 22, 2019

Decision Tree Important Hyper parameters

Optimization Criteria : Gini impurity or Entropy/Info gain
Max Depth - Build trees with max. d depth deep; d = Number of nodes from top
min_samples_split: The minimum number of samples required to split an internal node
min_samples_leaf: The minimum number of samples required to be at a leaf node
max_features: The number of features to consider when looking for the best split
min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value
class_weight:Weights associated with classes in the form {class_label: weight}

Pros & Cons Decision trees

Pros:

Generation of clear human-understandable classification rules, e.g. "if age <25 and is interested in motorcycles, deny the loan". This property is called interpretability of the model.
Decision trees can be easily visualized, i.e. both the model itself (the tree) and prediction for a certain test object (a path in the tree) can "be interpreted".
Fast training and forecasting.
Small number of model parameters.
Supports both numerical and categorical features.

Cons:

The trees are very sensitive to the noise in input data; the whole model could change if the training set is slightly modified (e.g. remove a feature, add some objects). This impairs the interpretability of the model.
Separating border built by a decision tree has its limitations – it consists of hyperplanes perpendicular to one of the coordinate axes, which is inferior in quality to some other methods, in practice.
We need to avoid overfitting by pruning, setting a minimum number of samples in each leaf, or defining a maximum depth for the tree. Note that overfitting is an issue for all machine learning methods.
Instability. Small changes to the data can significantly change the decision tree. This problem is tackled with decision tree ensembles
The optimal decision tree search problem is NP-complete. Some heuristics are used in practice such as greedy search for a feature with maximum information gain, but it does not guarantee finding the globally optimal tree.
Difficulties to support missing values in the data. Friedman estimated that it took about 50% of the code to support gaps in data in CART (an improved version of this algorithm is implemented in sklearn).
The model can only interpolate but not extrapolate (the same is true for random forests and tree boosting). That is, a decision tree makes constant prediction for the objects that lie beyond the bounding box set by the training set in the feature space.

Sunday, July 14, 2019

Which correlation metric should you use?

Saturday, July 13, 2019

Student classification problem from kaggle

kaggle dataset to download:
https://www.kaggle.com/aljarah/xAPI-Edu-Data

solved problem in my git repositiory:
https://github.com/kvsivasankar/StudentClassification/blob/master/StudentAnalysis.ipynb

Tuesday, July 9, 2019

Need to split a string into multiple columns?

Need to split a string into multiple columns? Use str.split() method. expand=True to return a DataFrame, and assign it to the original DataFrame

Friday, June 21, 2019

Type I and Type II errors in Data Science

Confusion Matrix

Type I and Type II errors

• Type I error, also known as a “false positive”: the error of rejecting a null hypothesis when it is actually true. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when the results can be attributed to chance. Plainly speaking, it occurs when we are observing a difference when in truth there is none (or more specifically - no statistically significant difference). So the probability of making a type I error in a test with rejection region R is 0 P R H ( | is true) .

• Type II error, also known as a "false negative": the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to accept an alternative hypothesis when you don't have adequate power. Plainly speaking, it occurs when we are failing to observe a difference when in truth there is one. So the probability of making a type II error in a test with rejection region R is 1 ( | is true) − P R Ha . The power of the test can be ( | is true) P R Ha .

What is AI | ML | NN

Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram that even deep learning is a subset of Machine Learning. So all three of them AI, machine learning and deep learning are just the subsets of each other. So let us move on and understand how exactly they are different from each other.

“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”

Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned

Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own

Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence

Tuesday, June 18, 2019

Basic Git Commands

Monday, June 17, 2019

Bias-variance strategies in neural network

bias refers to underfitting and variance refers to overfitting
In Neural network, there are few set rules to counter bais and variance issue

High Bias/Underfitting:

Bigger Network: Add more layers and increase nodes in layer to counter underfitting
More Epochs/Train Longer: Increase number of passes over the entire data to counter underfitting
Differnt NN architecture: Try different network architecture

High Variance/Overfitting:

Get more data: Increase data points collected
Make model simpler: Simplify the model in terms of less layers, and less nodes
Regularization: Add l1/l2 regularization to weights
Dropout: Dropout means some % of nodes will be automatically turned off (not trained); generally set to 10%-30%

Sunday, June 16, 2019

Undo accidentally deleted cells in Jupyter notebook

If you need to undo something deleted inside a cell: CTRL/CMD+Z

If you need to recover an entire deleted cell : ESC+Z

Tuesday, June 11, 2019

Crack Data Science (Must know concepts)

Languages : Python or R
R vs Python

Get good grip in python language. It eases and help in many ways

I prefer Python because

Speed
Syntax
ease in production systems

Data Structures in python

Arrays
List
DataFrame
Dictionary
Tuples
Sets

Key packages in python

Numpy
Pandas
Matplotlib
Scipy
Scikit-learn
others also there........
Pytorch (Neural Networks)

Some of must know concepts

BrodCasting
Distributions - Normal, T-distribution, Bernoulli
Hypothesis testing
Central limit theorem
Supervised vs Un-Supervised
Loss functions
Bias / Variance tradeoff
Missing data analysis - Imputation methods
Linear Algebra
what is reshape(-1,1)

Dot product
Validation techniques
k-fold validation
Stratified k fold
Stacking

Linear Regression
Logistic Regression
Tree based models

Adjusted R-squared
P-value explanation
Storing model in pickle file

Hyper parameters
Model parameters

GridSearchCV
RandomizedSearchCV

Regularization
ROC/AUC
Lasso - L1
Ridge - L2
Elastic Net

Decision Trees
Entropy
Information gain
CART

ensemble models

Bagging - Random Forest
Boosting - XGBoost, Lightgbm
Stacking

PCA
Factor analysis

Friday, June 7, 2019

Engineering missing values (NA) categorical variables

Frequent category imputation

def impute_na(df_train, df_test, variable):

most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]

df_train[variable].fillna(most_frequent_category, inplace=True)

df_test[variable].fillna(most_frequent_category, inplace=True)

Random sample imputation

def impute_na(df_train, df_test, variable):

# get the most frequent label and replace NA in train and test set

most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]

df_train[variable+'_frequent'] = df_train[variable].fillna(most_frequent_category)

df_test[variable+'_frequent'] = df_test[variable].fillna(most_frequent_category)

# random sampling

df_train[variable+'_random'] = df_train[variable]

df_test[variable+'_random'] = df_test[variable]

# extract the random sample to fill the na

random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)

random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)

# pandas needs to have the same index in order to merge datasets

random_sample_train.index = df_train[df_train[variable].isnull()].index

random_sample_test.index = df_test[df_test[variable].isnull()].index

df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train

df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test

Adding a variable to capture NA

def impute_na(df_train, df_test, variable):

df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])

df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])

Engineering missing values (NA) in numerical variables

Mean and Median imputation

df[variable+'_median'] = df[variable].fillna(median)

Random sample imputation

# random sampling

df[variable+'_random'] = df[variable]

# extract the random sample to fill the na

random_sample = X_train[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)

# pandas needs to have the same index in order to merge datasets

random_sample.index = df[df[variable].isnull()].index

df.loc[df[variable].isnull(), variable+'_random'] = random_sample

Adding a variable to capture NA

X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)

Arbitrary value imputation

def impute_na(df, variable):

df[variable+'_zero'] = df[variable].fillna(0)

df[variable+'_hundred']= df[variable].fillna(100)

Thursday, June 6, 2019

Machine learning models sensitive to feature scale

Linear and Logistic Regression
Neural Networks
SVM
KNN
K-means clustering
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)

Tree based ML models insensitive to feature scale

Classification and Regression Trees
Random Forests
Gradient Boosted Trees

Common used loss functions in Data Science

Loss Functions

A function that evaluates the goodness of our models
Mainly evaluates the difference between: $Y - \hat{Y}$ (Act - Pred) by assigning some kind of function
Many are already implemented in python's ML library sklearn (EG: RMSE, binary/cross entropy loss etc

For Regression

M e a n S q u a r e d E r r o r = \frac{1}{N} \sum_{i = 1}^{n} (Y_{i} - \hat{Y_{i}})^{2}

R o o t M e a n S q u a r e d E r r o r = \sqrt{\frac{1}{N} \sum_{i = 1}^{n} (Y_{i} - \hat{Y_{i}})^{2}}

M e a n A b s o l u t e E r r o r = \frac{1}{N} \sum_{i = 1}^{n} | (Y_{i} - \hat{Y_{i}}) |

M e a n A b s o l u t e P e r c e n t E r r o r = \frac{100}{N} * \sum_{i = 1}^{N} | \frac{Y_{i} - \hat{Y_{i}}}{Y_{i}} |

Y_{i}

= Target value for

i_{t h}

example

\hat{Y_{i}}

= Predicted value for

i_{t h}

example

For Classification

C r o s s E n t r o p y L o s s (L o g L o s s) = - \frac{1}{n} \sum_{i = 1}^{n} Y_{i} \log ({\hat{Y}}_{i}) + (1 - Y_{i}) l o g (1 - {\hat{Y}}_{i})

Y_{i}

= Target value for

i_{t h}

example

\hat{Y_{i}}

= Predicted value for

i_{t h}

example

ROC, AUC

Precision and Recall