Monday, July 22, 2019

Decision Tree Important Hyper parameters


  • Optimization Criteria : Gini impurity or Entropy/Info gain
  • Max Depth - Build trees with max. d depth deep; d = Number of nodes from top
  • min_samples_split: The minimum number of samples required to split an internal node
  • min_samples_leaf: The minimum number of samples required to be at a leaf node
  • max_features: The number of features to consider when looking for the best split
  • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value
  • class_weight:Weights associated with classes in the form {class_label: weight}

Pros & Cons Decision trees

Pros:
  • Generation of clear human-understandable classification rules, e.g. "if age <25 and is interested in motorcycles, deny the loan". This property is called interpretability of the model.
  • Decision trees can be easily visualized, i.e. both the model itself (the tree) and prediction for a certain test object (a path in the tree) can "be interpreted".
  • Fast training and forecasting.
  • Small number of model parameters.
  • Supports both numerical and categorical features.

Cons:
  • The trees are very sensitive to the noise in input data; the whole model could change if the training set is slightly modified (e.g. remove a feature, add some objects). This impairs the interpretability of the model.
  • Separating border built by a decision tree has its limitations – it consists of hyperplanes perpendicular to one of the coordinate axes, which is inferior in quality to some other methods, in practice.
  • We need to avoid overfitting by pruning, setting a minimum number of samples in each leaf, or defining a maximum depth for the tree. Note that overfitting is an issue for all machine learning methods.
  • Instability. Small changes to the data can significantly change the decision tree. This problem is tackled with decision tree ensembles
  • The optimal decision tree search problem is NP-complete. Some heuristics are used in practice such as greedy search for a feature with maximum information gain, but it does not guarantee finding the globally optimal tree.
  • Difficulties to support missing values in the data. Friedman estimated that it took about 50% of the code to support gaps in data in CART (an improved version of this algorithm is implemented in sklearn).
  • The model can only interpolate but not extrapolate (the same is true for random forests and tree boosting). That is, a decision tree makes constant prediction for the objects that lie beyond the bounding box set by the training set in the feature space.


Tuesday, July 9, 2019

Need to split a string into multiple columns?

Need to split a string into multiple columns? Use str.split() method. expand=True to return a DataFrame, and assign it to the original DataFrame


Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...