Saturday, October 26, 2019
Wednesday, August 14, 2019
Docker for Data Scientists
Introduction
Docker can be a very powerful tool and you can learn how to use it without going all the way down the rabbit hole. In this guide, you will get just enough Docker knowledge to improve your data science workflow and avoid common pitfalls.
This guide is based on my experiences as an independent consultant, helping data science teams to introduce reproducible and automated machine learning workflows.
What is Docker?
Docker is container platform for deploying isolated applications on Linux/Windows. It includes a tool chain for creating, sharing, and building upon layered application stacks. Docker also forms the basis for more advanced services such as Docker Swarm from Docker Inc. and Kubernetes from Google.
As a Data Scientist, Why Should You Care?
Automation of deployment via Docker containers helps you to focus on your work, and not on maintaining complex software dependencies. By freezing the exact state of a deployed system inside an image, you also get easier reproducibility of your work and collaboration with your colleagues. Finally, you can use resources like Docker Hubto find pre-built recipes (Dockerfiles) from others that you can copy and build on.
Useful docker commands
docker run -it --rm --name ds -p 8888:8888 jupyter/datascience-notebook
above command used to get pre-existed image in docker hub and run with the name "ds" in the port 8888
docker ps
docker build ---- to build image from .Dockerfile
docker images ---shows all dowloaded images
docker rm ---- remove containers
docker rmi --- remove images
docker rmi -f image-name ---- to remove force
docker run -it -p 1234:80 --name hello-world
docker --version
control - c to exit or remove/stop running containers
Advantages:
- Return on investment & cost savings
- Standardization & productivity
- Compatibility & maintainability
- Simplicity & faster configurations
- Rapid Deployment
- Continuous Deployment & Testing
- Isolation
- Security
above command used to get pre-existed image in docker hub and run with the name "ds" in the port 8888
- Return on investment & cost savings
- Standardization & productivity
- Compatibility & maintainability
- Simplicity & faster configurations
- Rapid Deployment
- Continuous Deployment & Testing
- Isolation
- Security
Saturday, August 3, 2019
Thursday, July 25, 2019
Text auto summarise - NLP
text auto summarise nlp solution
https://github.com/kvsivasankar/Text-AutoSummarize-NLP-Model
from big paragraph to summarised text. You can decide how many number of sentences you want.
https://github.com/kvsivasankar/Text-AutoSummarize-NLP-Model
from big paragraph to summarised text. You can decide how many number of sentences you want.
get dataset from kaggle
https://www.kaggle.com/uciml/iris
problem solution
github url:
https://github.com/kvsivasankar/Iris-flower-prdiction-model
https://www.kaggle.com/uciml/iris
problem solution
github url:
https://github.com/kvsivasankar/Iris-flower-prdiction-model
Tuesday, July 23, 2019
Monday, July 22, 2019
Decision Tree Important Hyper parameters
Optimization Criteria : Gini impurity or Entropy/Info gain
Max Depth - Build trees with max. d depth deep; d = Number of nodes from top
min_samples_split: The minimum number of samples required to split an internal node
min_samples_leaf: The minimum number of samples required to be at a leaf node
max_features: The number of features to consider when looking for the best split
min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value
class_weight:Weights associated with classes in the form {class_label: weight}
Pros & Cons Decision trees
Pros:
Generation of clear human-understandable classification rules, e.g. "if age <25 and is interested in motorcycles, deny the loan". This property is called interpretability of the model.
Decision trees can be easily visualized, i.e. both the model itself (the tree) and prediction for a certain test object (a path in the tree) can "be interpreted".
Fast training and forecasting.
Small number of model parameters.
Supports both numerical and categorical features.
Cons:
The trees are very sensitive to the noise in input data; the whole model could change if the training set is slightly modified (e.g. remove a feature, add some objects). This impairs the interpretability of the model.
Separating border built by a decision tree has its limitations – it consists of hyperplanes perpendicular to one of the coordinate axes, which is inferior in quality to some other methods, in practice.
We need to avoid overfitting by pruning, setting a minimum number of samples in each leaf, or defining a maximum depth for the tree. Note that overfitting is an issue for all machine learning methods.
Instability. Small changes to the data can significantly change the decision tree. This problem is tackled with decision tree ensembles
The optimal decision tree search problem is NP-complete. Some heuristics are used in practice such as greedy search for a feature with maximum information gain, but it does not guarantee finding the globally optimal tree.
Difficulties to support missing values in the data. Friedman estimated that it took about 50% of the code to support gaps in data in CART (an improved version of this algorithm is implemented in sklearn).
The model can only interpolate but not extrapolate (the same is true for random forests and tree boosting). That is, a decision tree makes constant prediction for the objects that lie beyond the bounding box set by the training set in the feature space.
Sunday, July 14, 2019
Saturday, July 13, 2019
Student classification problem from kaggle
kaggle dataset to download:
https://www.kaggle.com/aljarah/xAPI-Edu-Data
solved problem in my git repositiory:
https://github.com/kvsivasankar/StudentClassification/blob/master/StudentAnalysis.ipynb
https://www.kaggle.com/aljarah/xAPI-Edu-Data
solved problem in my git repositiory:
https://github.com/kvsivasankar/StudentClassification/blob/master/StudentAnalysis.ipynb
Tuesday, July 9, 2019
Friday, June 21, 2019
Type I and Type II errors in Data Science
Confusion Matrix
Type I and Type II errors
• Type I error, also known as a “false positive”: the error of rejecting a null hypothesis when it is actually true. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when the results can be attributed to chance. Plainly speaking, it occurs when we are observing a difference when in truth there is none (or more specifically - no statistically significant difference). So the probability of making a type I error in a test with rejection region R is 0 P R H ( | is true) .
• Type II error, also known as a "false negative": the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature. In other words, this is the error of failing to accept an alternative hypothesis when you don't have adequate power. Plainly speaking, it occurs when we are failing to observe a difference when in truth there is one. So the probability of making a type II error in a test with rejection region R is 1 ( | is true) − P R Ha . The power of the test can be ( | is true) P R Ha .
What is AI | ML | NN
Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram that even deep learning is a subset of Machine Learning. So all three of them AI, machine learning and deep learning are just the subsets of each other. So let us move on and understand how exactly they are different from each other.
“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”
Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned
Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own
Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence
“Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions”
Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned
Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own
Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence
Tuesday, June 18, 2019
Monday, June 17, 2019
Bias-variance strategies in neural network
- bias refers to underfitting and variance refers to overfitting
- In Neural network, there are few set rules to counter bais and variance issue
- High Bias/Underfitting:
- Bigger Network: Add more layers and increase nodes in layer to counter underfitting
- More Epochs/Train Longer: Increase number of passes over the entire data to counter underfitting
- Differnt NN architecture: Try different network architecture
- High Variance/Overfitting:
- Get more data: Increase data points collected
- Make model simpler: Simplify the model in terms of less layers, and less nodes
- Regularization: Add l1/l2 regularization to weights
- Dropout: Dropout means some % of nodes will be automatically turned off (not trained); generally set to 10%-30%
Sunday, June 16, 2019
Undo accidentally deleted cells in Jupyter notebook
If you need to undo something deleted inside a cell: CTRL/CMD+Z
If you need to recover an entire deleted cell : ESC+Z
If you need to recover an entire deleted cell : ESC+Z
Tuesday, June 11, 2019
Crack Data Science (Must know concepts)
Languages : Python or R
R vs Python
Get good grip in python language. It eases and help in many ways
I prefer Python because
Data Structures in python
Key packages in python
BrodCasting
Distributions - Normal, T-distribution, Bernoulli
Hypothesis testing
Central limit theorem
Supervised vs Un-Supervised
Loss functions
Bias / Variance tradeoff
Missing data analysis - Imputation methods
Linear Algebra
what is reshape(-1,1)
Dot product
Validation techniques
k-fold validation
Stratified k fold
Stacking
Linear Regression
Logistic Regression
Tree based models
Adjusted R-squared
P-value explanation
Storing model in pickle file
Hyper parameters
Model parameters
GridSearchCV
RandomizedSearchCV
Regularization
ROC/AUC
Lasso - L1
Ridge - L2
Elastic Net
Decision Trees
Entropy
Information gain
CART
ensemble models
Bagging - Random Forest
Boosting - XGBoost, Lightgbm
Stacking
PCA
Factor analysis
R vs Python
Get good grip in python language. It eases and help in many ways
I prefer Python because
- Speed
- Syntax
- ease in production systems
Data Structures in python
- Arrays
- List
- DataFrame
- Dictionary
- Tuples
- Sets
Key packages in python
- Numpy
- Pandas
- Matplotlib
- Scipy
- Scikit-learn
- others also there........
- Pytorch (Neural Networks)
BrodCasting
Distributions - Normal, T-distribution, Bernoulli
Hypothesis testing
Central limit theorem
Supervised vs Un-Supervised
Loss functions
Bias / Variance tradeoff
Missing data analysis - Imputation methods
Linear Algebra
what is reshape(-1,1)
Dot product
Validation techniques
k-fold validation
Stratified k fold
Stacking
Linear Regression
Logistic Regression
Tree based models
Adjusted R-squared
P-value explanation
Storing model in pickle file
Hyper parameters
Model parameters
GridSearchCV
RandomizedSearchCV
Regularization
ROC/AUC
Lasso - L1
Ridge - L2
Elastic Net
Decision Trees
Entropy
Information gain
CART
ensemble models
Bagging - Random Forest
Boosting - XGBoost, Lightgbm
Stacking
PCA
Factor analysis
Friday, June 7, 2019
Engineering missing values (NA) categorical variables
- Frequent category imputation
def impute_na(df_train, df_test, variable):
most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
df_train[variable].fillna(most_frequent_category, inplace=True)
df_test[variable].fillna(most_frequent_category, inplace=True)
- Random sample imputation
def impute_na(df_train, df_test, variable):
# get the most frequent label and replace NA in train and test set
most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
df_train[variable+'_frequent'] = df_train[variable].fillna(most_frequent_category)
df_test[variable+'_frequent'] = df_test[variable].fillna(most_frequent_category)
# random sampling
df_train[variable+'_random'] = df_train[variable]
df_test[variable+'_random'] = df_test[variable]
# extract the random sample to fill the na
random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)
random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)
# pandas needs to have the same index in order to merge datasets
random_sample_train.index = df_train[df_train[variable].isnull()].index
random_sample_test.index = df_test[df_test[variable].isnull()].index
df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train
df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test
- Adding a variable to capture NA
def impute_na(df_train, df_test, variable):
df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 'Missing', df_train[variable])
df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 'Missing', df_test[variable])
Engineering missing values (NA) in numerical variables
- Mean and Median imputation
df[variable+'_median'] = df[variable].fillna(median)
- Random sample imputation
# random sampling
df[variable+'_random'] = df[variable]
# extract the random sample to fill the na
random_sample = X_train[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
# pandas needs to have the same index in order to merge datasets
random_sample.index = df[df[variable].isnull()].index
df.loc[df[variable].isnull(), variable+'_random'] = random_sample
- Adding a variable to capture NA
X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
- Arbitrary value imputation
def impute_na(df, variable):
df[variable+'_zero'] = df[variable].fillna(0)
df[variable+'_hundred']= df[variable].fillna(100)
Thursday, June 6, 2019
Machine learning models sensitive to feature scale
Machine learning models sensitive to feature scale
- Linear and Logistic Regression
- Neural Networks
- SVM
- KNN
- K-means clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)
Tree based ML models insensitive to feature scale
- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees
Common used loss functions in Data Science
Loss Functions
- A function that evaluates the goodness of our models
- Mainly evaluates the difference between:
Y−Ŷ (Act - Pred) by assigning some kind of function - Many are already implemented in python's ML library sklearn (EG: RMSE, binary/cross entropy loss etc
Subscribe to:
Posts (Atom)
Image noise comparison methods
1. using reference image technique - peak_signal_noise_ratio (PSNR) - SSI 2. non-reference image technique - BRISQUE python pac...
-
No Risk Details 1 Version Disclosure (ASP.NET) Description: This information can be found in HTTP Response Header w...
-
Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram th...
-
Recently, I had a requirement to copy users from one SharePoint group to another group. Unfortunately, SharePoint doesn't support nest...