My Progress

CH 2 ML with scikit-learn 본문

AI/파이썬 머신러닝 완벽 가이드

CH 2 ML with scikit-learn

ghwangbo 2025. 2. 12. 23:13
반응형

1. Intro


First ML project - Predicting the species of iris flowers

 

Uses classfication method, a typical method of supervised learning, to predict the species of iris flower.

Also going to use decision tree for the algorithm.

 

 

Dataset

 

We have to first have to take a glance at the dataset.

It has features of sepal length, width, petal length, width, and label. Label represents the species of sample data, having value 0, 1, and 2, each representing different species which are Stosa, Verticolor, and Virginica.

 

Splitting Dataset

 

We have to split the dataset into training and testing set in order to determine the performance of the trained model. 

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size = 0.2, random_state = 11)

we use a train_test_split() function to split the dataset.

test size parameter determines the proportion of the test dataset of the dataset. 

Random state is an arbitrary integer use choose to make a fixed train and test data set all the time. 

 

Training and testing the dataset

 

Then we are going to use an algorithm to train the model and test its performance.

dt_clf = DecisionTreeClassifier(random_state = 11)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)

 

This is the basic process of classification using ML algorithm.

 

Understanding Scikit-learn's dataset

 

The data you loadede from scikit-learn is a dictionary. You first have to take a look its key to understand the dataset.

 

2. Model Selection Modules


 

Cross Validation

 

We have previously splitted the dataset into training and testing sets. However, this might causes an overfitting problem in which the trained model has a lower accuracy on the new dataset that has not been used to train. To avoid this situation, we use the cross-validation method. 

How?
We split the training dataset into a training and validation set. A general method is the K-fold cross-validation method.

 

K Fold Cross Validation.

 

In this method, the dataset is split into K parts, and (k - 1) parts are trained into the model, and the last set that was not used for training is used for validating the performance. The validation dataset on the next turn will be the other split dataset. This step takes on K times. After this process ends, the result for the cross-validation is an average of those K test accuracies. 

 

Stratified K Fold

 

If the dataset is split into K parts without any conditions, the training dataset might not be a sample for the original dataset and might not represent it. Therefore, the data should be balanced. Stratified K Flod can fix this issue by properly splitting the dataset.

 

GridSearchCV

 

By using GridSearchCv, we can find the most optimized parameters for the model. 

 

3. Data Preprocessing


Data preprocessing is crucial in machine learning. For high accuracy of the model, we have to manipulate the data. For example, we replace NULL with a reasonable value or drop the vector with numerous NULL values. Also, replacing string values with numbers is helpful to analyze the feature, which is called feature vectorization.

 

3.1 Data Encoding

There are two popular methods of encoding: Label encoding and Hot encoding.

 

Label Encoding

- Label Endoing substitutes a string into an integer which is a key representing the string. 

- Size of the number is not the factor when it's applied to an algorithm

- It is not usually used in machine learning algorithms for such reasons; rather, it is used in Tree algorithms.

 

One Hot Encoding

- One-Hot encoding is illustrating a string into a vector, having a value of 1 only in that specific string's vector.

 

 

3.2 Feature Scaling 

Two popular methods are Standardization and Normalization

 

Standardization

- Standardization is transferring data into values with its mean to be 0 and standard deviation to be 1.

 

Standard Scaler is a class that standardizes data.

 

Normalization

- Normalization is to normalize features with different sizes. For example, a feature A representing the distance, ranging from 0 ~ 100km and a feature B, representing money, ranging from 0 ~ 100,000,000,000won are given.

To compare these two different features in a single unit, we use normalization

 

MinMaxScaler is a class that normalizes data.

 

3.3 Caution when transforming scaling trained and test data

 

When we are scaling both trained and test data, we utilize fit and transform functions. These two functions each have a specific role. Fit function sets the scaler in accordance to the data, and Transform function uses its setted scaler to scale the data. Since the scaler was set in accordance to the trained data at first, it is not appropriate to use for the tested data. The scaler we use for train and test data should be the same.

 

Solution for the problem:

- To solve this issue, we can use scaling before we split the dataset.
- Use the scaler that was used for the training data.

 

 

4. Titanic Survival Prediction


https://colab.research.google.com/drive/1YiLnu5kk550pkhpB9lr_l3q4M19jkSSh

반응형