Friday, June 21, 2019

Type I and Type II errors in Data Science


Confusion Matrix



Type I and Type II errors

• Type I error, also known as a “false positive”: the error of rejecting a null hypothesis when it is actually true. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when the results can be attributed to chance. Plainly speaking, it occurs when we are observing a difference when in truth there is none (or more specifically - no statistically significant difference). So the probability of making a type I error in a test with rejection region R is 0 P R H ( | is true) .

• Type II error, also known as a "false negative": the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to accept an alternative hypothesis when you don't have adequate power. Plainly speaking, it occurs when we are failing to observe a difference when in truth there is one. So the probability of making a type II error in a test with rejection region R is 1 ( | is true) − P R Ha . The power of the test can be ( | is true) P R Ha .

What is AI | ML | NN

Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram that even deep learning is a subset of Machine Learning. So all three of them AI, machine learning and deep learning are just the subsets of each other. So let us move on and understand how exactly they are different from each other.

“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”


Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned


Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own

Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence



Monday, June 17, 2019

Bias-variance strategies in neural network




  • bias refers to underfitting and variance refers to overfitting
  • In Neural network, there are few set rules to counter bais and variance issue


  • High Bias/Underfitting:

  1. Bigger Network: Add more layers and increase nodes in layer to counter underfitting
  2. More Epochs/Train Longer: Increase number of passes over the entire data to counter underfitting
  3. Differnt NN architecture: Try different network architecture



  • High Variance/Overfitting:

  1. Get more data: Increase data points collected
  2. Make model simpler: Simplify the model in terms of less layers, and less nodes
  3. Regularization: Add l1/l2 regularization to weights
  4. Dropout: Dropout means some % of nodes will be automatically turned off (not trained); generally set to 10%-30%

Sunday, June 16, 2019

Undo accidentally deleted cells in Jupyter notebook

If you need to undo something deleted inside a cell: CTRL/CMD+Z

If you need to recover an entire deleted cell : ESC+Z

Tuesday, June 11, 2019

Crack Data Science (Must know concepts)

Languages : Python or R
R vs Python

Get good grip in python language. It eases and help in many ways

I prefer Python because

  • Speed
  • Syntax
  • ease in production systems

Data Structures in python

  • Arrays
  • List
  • DataFrame
  • Dictionary
  • Tuples
  • Sets

Key packages in python

  • Numpy
  • Pandas
  • Matplotlib
  • Scipy
  • Scikit-learn
  • others also there........
  • Pytorch (Neural Networks)
Some of must know concepts

BrodCasting
Distributions - Normal, T-distribution, Bernoulli
Hypothesis testing
Central limit theorem
Supervised vs Un-Supervised
Loss functions
Bias / Variance tradeoff
Missing data analysis - Imputation methods
Linear Algebra
what is reshape(-1,1)

Dot product
Validation techniques
k-fold validation
Stratified k fold
Stacking

Linear Regression
Logistic Regression
Tree based models

Adjusted R-squared
P-value explanation
Storing model in pickle file


Hyper parameters
Model parameters

GridSearchCV
RandomizedSearchCV

Regularization
ROC/AUC
Lasso - L1
Ridge - L2
Elastic Net


Decision Trees
Entropy
Information gain
CART

ensemble models

Bagging - Random Forest
Boosting - XGBoost, Lightgbm
Stacking

PCA
Factor analysis

















Friday, June 7, 2019

Engineering missing values (NA) categorical variables


  • Frequent category imputation

def impute_na(df_train, df_test, variable):
    most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
    
df_train[variable].fillna(most_frequent_category, inplace=True)
    df_test[variable].fillna(most_frequent_category, inplace=True)

  • Random sample imputation
def impute_na(df_train, df_test, variable):
    # get the most frequent label and replace NA in train and test set
    most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
    df_train[variable+'_frequent'] = df_train[variable].fillna(most_frequent_category)
    df_test[variable+'_frequent'] = df_test[variable].fillna(most_frequent_category)
    
    # random sampling
    df_train[variable+'_random'] = df_train[variable]
    df_test[variable+'_random'] = df_test[variable]
    
    # extract the random sample to fill the na
    random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)
    random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)
    
    # pandas needs to have the same index in order to merge datasets
    random_sample_train.index = df_train[df_train[variable].isnull()].index
    random_sample_test.index = df_test[df_test[variable].isnull()].index
    
    df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train
    df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test
  • Adding a variable to capture NA
def impute_na(df_train, df_test, variable):
    df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])
    df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])

Engineering missing values (NA) in numerical variables


  • Mean and Median imputation
df[variable+'_median'] = df[variable].fillna(median)
  • Random sample imputation

    # random sampling
    df[variable+'_random'] = df[variable]
    # extract the random sample to fill the na
    random_sample = X_train[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample

  • Adding a variable to capture NA 
X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
  • Arbitrary value imputation

def impute_na(df, variable):
    df[variable+'_zero'] = df[variable].fillna(0)
    df[variable+'_hundred']= df[variable].fillna(100)



Thursday, June 6, 2019

Machine learning models sensitive to feature scale


Machine learning models sensitive to feature scale

  • Linear and Logistic Regression
  • Neural Networks
  • SVM
  • KNN
  • K-means clustering
  • Linear Discriminant Analysis (LDA)
  • Principal Component Analysis (PCA)

Tree based ML models insensitive to feature scale


  • Classification and Regression Trees
  • Random Forests
  • Gradient Boosted Trees

Common used loss functions in Data Science

Loss Functions
  • A function that evaluates the goodness of our models
  • Mainly evaluates the difference between:  (Act - Pred) by assigning some kind of function
  • Many are already implemented in python's ML library sklearn (EG: RMSE, binary/cross entropy loss etc


For Regression

  •  = Target value for  example
  •  = Predicted value for  example

For Classification

  •  = Target value for  example
  •  = Predicted value for  example

ROC, AUC 
    Precision and Recall







    Image noise comparison methods

     1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...