Wednesday, August 14, 2019

Docker for Data Scientists

Introduction

Docker can be a very powerful tool and you can learn how to use it without going all the way down the rabbit hole. In this guide, you will get just enough Docker knowledge to improve your data science workflow and avoid common pitfalls.
This guide is based on my experiences as an independent consultant, helping data science teams to introduce reproducible and automated machine learning workflows.

What is Docker?

Docker is container platform for deploying isolated applications on Linux/Windows. It includes a tool chain for creating, sharing, and building upon layered application stacks. Docker also forms the basis for more advanced services such as Docker Swarm from Docker Inc. and Kubernetes from Google.

As a Data Scientist, Why Should You Care?

Automation of deployment via Docker containers helps you to focus on your work, and not on maintaining complex software dependencies. By freezing the exact state of a deployed system inside an image, you also get easier reproducibility of your work and collaboration with your colleagues. Finally, you can use resources like Docker Hubto find pre-built recipes (Dockerfiles) from others that you can copy and build on.


Useful docker commands

docker run -it --rm --name ds -p 8888:8888 jupyter/datascience-notebook

above command used to get pre-existed image in docker hub and run with the name "ds" in the port 8888

docker ps
docker build  ---- to build image from .Dockerfile

docker images  ---shows all dowloaded images

docker rm   ---- remove containers

docker rmi  --- remove images

docker rmi -f image-name  ---- to remove force

docker run -it -p 1234:80 --name hello-world

docker --version

control - c to exit or remove/stop running containers

Advantages:

  • Return on investment & cost savings
  • Standardization & productivity
  • Compatibility & maintainability
  • Simplicity & faster configurations
  • Rapid Deployment
  • Continuous Deployment & Testing
  • Isolation
  • Security



Monday, July 22, 2019

Decision Tree Important Hyper parameters


  • Optimization Criteria : Gini impurity or Entropy/Info gain
  • Max Depth - Build trees with max. d depth deep; d = Number of nodes from top
  • min_samples_split: The minimum number of samples required to split an internal node
  • min_samples_leaf: The minimum number of samples required to be at a leaf node
  • max_features: The number of features to consider when looking for the best split
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value
  • class_weight:Weights associated with classes in the form {class_label: weight}

Pros & Cons Decision trees

Pros:
  • Generation of clear human-understandable classification rules, e.g. "if age <25 and is interested in motorcycles, deny the loan". This property is called interpretability of the model.
  • Decision trees can be easily visualized, i.e. both the model itself (the tree) and prediction for a certain test object (a path in the tree) can "be interpreted".
  • Fast training and forecasting.
  • Small number of model parameters.
  • Supports both numerical and categorical features.

Cons:
  • The trees are very sensitive to the noise in input data; the whole model could change if the training set is slightly modified (e.g. remove a feature, add some objects). This impairs the interpretability of the model.
  • Separating border built by a decision tree has its limitations – it consists of hyperplanes perpendicular to one of the coordinate axes, which is inferior in quality to some other methods, in practice.
  • We need to avoid overfitting by pruning, setting a minimum number of samples in each leaf, or defining a maximum depth for the tree. Note that overfitting is an issue for all machine learning methods.
  • Instability. Small changes to the data can significantly change the decision tree. This problem is tackled with decision tree ensembles
  • The optimal decision tree search problem is NP-complete. Some heuristics are used in practice such as greedy search for a feature with maximum information gain, but it does not guarantee finding the globally optimal tree.
  • Difficulties to support missing values in the data. Friedman estimated that it took about 50% of the code to support gaps in data in CART (an improved version of this algorithm is implemented in sklearn).
  • The model can only interpolate but not extrapolate (the same is true for random forests and tree boosting). That is, a decision tree makes constant prediction for the objects that lie beyond the bounding box set by the training set in the feature space.


Tuesday, July 9, 2019

Need to split a string into multiple columns?

Need to split a string into multiple columns? Use str.split() method. expand=True to return a DataFrame, and assign it to the original DataFrame


Friday, June 21, 2019

Type I and Type II errors in Data Science


Confusion Matrix



Type I and Type II errors

• Type I error, also known as a “false positive”: the error of rejecting a null hypothesis when it is actually true. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when the results can be attributed to chance. Plainly speaking, it occurs when we are observing a difference when in truth there is none (or more specifically - no statistically significant difference). So the probability of making a type I error in a test with rejection region R is 0 P R H ( | is true) .

• Type II error, also known as a "false negative": the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to accept an alternative hypothesis when you don't have adequate power. Plainly speaking, it occurs when we are failing to observe a difference when in truth there is one. So the probability of making a type II error in a test with rejection region R is 1 ( | is true) − P R Ha . The power of the test can be ( | is true) P R Ha .

What is AI | ML | NN

Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram that even deep learning is a subset of Machine Learning. So all three of them AI, machine learning and deep learning are just the subsets of each other. So let us move on and understand how exactly they are different from each other.

“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”


Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned


Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own

Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence



Monday, June 17, 2019

Bias-variance strategies in neural network




  • bias refers to underfitting and variance refers to overfitting
  • In Neural network, there are few set rules to counter bais and variance issue


  • High Bias/Underfitting:

  1. Bigger Network: Add more layers and increase nodes in layer to counter underfitting
  2. More Epochs/Train Longer: Increase number of passes over the entire data to counter underfitting
  3. Differnt NN architecture: Try different network architecture



  • High Variance/Overfitting:

  1. Get more data: Increase data points collected
  2. Make model simpler: Simplify the model in terms of less layers, and less nodes
  3. Regularization: Add l1/l2 regularization to weights
  4. Dropout: Dropout means some % of nodes will be automatically turned off (not trained); generally set to 10%-30%

Sunday, June 16, 2019

Undo accidentally deleted cells in Jupyter notebook

If you need to undo something deleted inside a cell: CTRL/CMD+Z

If you need to recover an entire deleted cell : ESC+Z

Tuesday, June 11, 2019

Crack Data Science (Must know concepts)

Languages : Python or R
R vs Python

Get good grip in python language. It eases and help in many ways

I prefer Python because

  • Speed
  • Syntax
  • ease in production systems

Data Structures in python

  • Arrays
  • List
  • DataFrame
  • Dictionary
  • Tuples
  • Sets

Key packages in python

  • Numpy
  • Pandas
  • Matplotlib
  • Scipy
  • Scikit-learn
  • others also there........
  • Pytorch (Neural Networks)
Some of must know concepts

BrodCasting
Distributions - Normal, T-distribution, Bernoulli
Hypothesis testing
Central limit theorem
Supervised vs Un-Supervised
Loss functions
Bias / Variance tradeoff
Missing data analysis - Imputation methods
Linear Algebra
what is reshape(-1,1)

Dot product
Validation techniques
k-fold validation
Stratified k fold
Stacking

Linear Regression
Logistic Regression
Tree based models

Adjusted R-squared
P-value explanation
Storing model in pickle file


Hyper parameters
Model parameters

GridSearchCV
RandomizedSearchCV

Regularization
ROC/AUC
Lasso - L1
Ridge - L2
Elastic Net


Decision Trees
Entropy
Information gain
CART

ensemble models

Bagging - Random Forest
Boosting - XGBoost, Lightgbm
Stacking

PCA
Factor analysis

















Friday, June 7, 2019

Engineering missing values (NA) categorical variables


  • Frequent category imputation

def impute_na(df_train, df_test, variable):
    most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
    
df_train[variable].fillna(most_frequent_category, inplace=True)
    df_test[variable].fillna(most_frequent_category, inplace=True)

  • Random sample imputation
def impute_na(df_train, df_test, variable):
    # get the most frequent label and replace NA in train and test set
    most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
    df_train[variable+'_frequent'] = df_train[variable].fillna(most_frequent_category)
    df_test[variable+'_frequent'] = df_test[variable].fillna(most_frequent_category)
    
    # random sampling
    df_train[variable+'_random'] = df_train[variable]
    df_test[variable+'_random'] = df_test[variable]
    
    # extract the random sample to fill the na
    random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)
    random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)
    
    # pandas needs to have the same index in order to merge datasets
    random_sample_train.index = df_train[df_train[variable].isnull()].index
    random_sample_test.index = df_test[df_test[variable].isnull()].index
    
    df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train
    df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test
  • Adding a variable to capture NA
def impute_na(df_train, df_test, variable):
    df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])
    df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])

Engineering missing values (NA) in numerical variables


  • Mean and Median imputation
df[variable+'_median'] = df[variable].fillna(median)
  • Random sample imputation

    # random sampling
    df[variable+'_random'] = df[variable]
    # extract the random sample to fill the na
    random_sample = X_train[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample

  • Adding a variable to capture NA 
X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
  • Arbitrary value imputation

def impute_na(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred']= df[variable].fillna(100)



Thursday, June 6, 2019

Machine learning models sensitive to feature scale


Machine learning models sensitive to feature scale

  • Linear and Logistic Regression
  • Neural Networks
  • SVM
  • KNN
  • K-means clustering
  • Linear Discriminant Analysis (LDA)
  • Principal Component Analysis (PCA)

Tree based ML models insensitive to feature scale


  • Classification and Regression Trees
  • Random Forests
  • Gradient Boosted Trees

Common used loss functions in Data Science

Loss Functions
  • A function that evaluates the goodness of our models
  • Mainly evaluates the difference between:  (Act - Pred) by assigning some kind of function
  • Many are already implemented in python's ML library sklearn (EG: RMSE, binary/cross entropy loss etc


For Regression

  •  = Target value for  example
  •  = Predicted value for  example

For Classification

  •  = Target value for  example
  •  = Predicted value for  example

ROC, AUC 
    Precision and Recall







    Image noise comparison methods

     1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...