My Progress

[Advanced Learning Algorithms] Decision Trees - 5 본문

AI/ML Specialization

[Advanced Learning Algorithms] Decision Trees - 5

ghwangbo 2024. 1. 22. 20:28
반응형

1. Intuition


Ex) Cat Classification example

We have three different features(Ear shape, Face shape, Whiskers) of cats to determine whether the given image of an animal is a cat or not. 

 

Terminology

Decision tree algorithm in machine learning shares the same knowledge of the tree data structure in computer science. The top node of a tree represents the root node. Nodes at the very bottom of the tree represent the leaf nodes. Nodes that do not correspond to the root and the leaf nodes are decision nodes.

 

2. Building a Decision Tree


Then how are we going to build a decision tree to determine if an animal is a cat or not?
In other words, how are we going to efficiently build a decision tree to solve the classification problem with high accuracy?

First, we need measurements to quantify how the tree efficiently split the dataset. We have to measure the purity of the dataset, the closeness of data to a single category, by solving the entropy, a measure of purity.

 

Step 1: Decide what feature to use for the root node
Step 2: Decide what features to use for the subset


Decision 1: How to choose what feature to split on at each node?

  • By maximizing or minimizing impurity

Purity : the closeness of data to all cats or all dogs

 

Decision 2: When do we stop splitting?

  • When a node is consisted of a single class(100% purity)
  • When splitting a node will result in the tree exceeding a maximum depth
    • We set the maximum depth to make sure that the tree isn't wide and big. It might cause overfitting.
  • When improvements in purity score are below a threshold
  • When number of examples in a node is below a threshold

3. Decision Tree Learning


3.1 How to measure purity

 Measuring Purity

- using entropy function

- p1 fraction of examples that are cats

 

Let p0 be the fraction of examples that are NOT cats.

p0 = 1- p1

 

H(p1) = -p1* log2(p1) - p0 logs(p0)

          = -p1* log2(p1) - (1 - p1) * log2(1 - p1)

 

If we were to graph this function, it would be the graph below.

3.2 Choosing a split

 

We want to find the feature that gives us the low entropy, the measure of impurity. We can compute the entropy by taking a weighted average of the entropy of the splitted nodes. If we subtract the weighed average of the entropy at the leaf nodes from the root node, we get a value called Information gain.

 

3.3 Information Gain:

= Measures a reduction in entropy

= H(p1 root) - (w left * H(p1 left) + w right* H (p1 right)) 

 

Decision tree Learning Process

  • Start with all examples at the root node
  • Calculate information gain for all possible features, and pick the one with the highest information gain
  • Split dataset according to selected feature, and create left and right branches of the tree
  • Keep repeating splitting process until stopping criteria is met:
    • When a node is 100% one class
    • When splitting a node exceeds a maximum depth
    • Information gain from additional splits is less than the threshold
    • When the number of examples in a node is below a threshold

The key is recursively finding the feature to use for splitting by computing the information gain of each feature.   

 

3.4 Features with more than 2 values.

- We can illustrate the value using a one-hot encoding method.

We can illustrate data using a vector using one-hot encoding.

 

Then how do we illustrate a continuous variable?

To split a continuous variable, we have to decide the threshold that will separate the variables. The information gain value determines the best threshold to split. 

 

Regression with Decision Trees: Predicting a Number

Choosing a split

- Unlike classification problems using impurity to determine a variable to split, regression problems use variance.

- We compute weighted variance of each variance and find the greatest reduction in variance.

 

4. Tree Ensembles


4.1 Intuition

Tree is very sensitive to the dataset. A change in a single data can change the whole tree by changing the entropy and information gain.

- We use multiple trees for the model to be less sensitive to the dataset, and we call this Tree Ensemble.

 

4.2 Creating Tree Ensemble

Sampling with replacement

- This is a method for creating a new dataset.

- The intuition is you randomly choose a data from the dataset from the original dataset. When the data is picked, it is not excluded from the dataset to be picked next time. You do this to make the dataset the same size as the original one. This sampling method is called Sampling with Replacement

 

Bagged tree algorithm

for b = 1 to B

    Use sampling with replacement to create a new training set of size m.

    Train a decision tree on the new dataset

 

- Sometimes we use the same feature at the root node and a few of the nodes near the root node.

- We can randomize the feature choice by allowing the algorithm to only choose from the random subset of features.

- This algorithm is called Random Forest Algorithm

Boosting tree algorithm

 

Intuition: Efficient learning by creating a dataset using sampling with replacement with more likely to pick misclassified examples from previously trained trees rather than randomly picking the dataset.

 

ex)XGBoost(eXtreme Gradient Boosting)

5.  Decision Tree vs Neural Networks


Decision Tree

- Work well on structured data

- Not recommended for unstructured data(images, audio, text);

- Fast

- Small decision trees may be human interpretable

Neural Networks

- works well on all types of data, including structured and instructed data

- may be slower than a decision tree

- works with transfer learning

- When building a system of multiple models working together, it might be easier to string together multiple neurla networks

 

 

반응형