CH 2 ML with scikit-learn

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

My Progress

CH 2 ML with scikit-learn 본문

AI/파이썬 머신러닝 완벽 가이드

CH 2 ML with scikit-learn

ghwangbo 2025. 2. 12. 23:13

1. Intro

First ML project - Predicting the species of iris flowers

Uses classfication method, a typical method of supervised learning, to predict the species of iris flower.

Also going to use decision tree for the algorithm.

Dataset

We have to first have to take a glance at the dataset.

It has features of sepal length, width, petal length, width, and label. Label represents the species of sample data, having value 0, 1, and 2, each representing different species which are Stosa, Verticolor, and Virginica.

Splitting Dataset

We have to split the dataset into training and testing set in order to determine the performance of the trained model.

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size = 0.2, random_state = 11)

we use a train_test_split() function to split the dataset.

test size parameter determines the proportion of the test dataset of the dataset.

Random state is an arbitrary integer use choose to make a fixed train and test data set all the time.

Training and testing the dataset

Then we are going to use an algorithm to train the model and test its performance.

dt_clf = DecisionTreeClassifier(random_state = 11)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)

This is the basic process of classification using ML algorithm.

Understanding Scikit-learn's dataset

The data you loadede from scikit-learn is a dictionary. You first have to take a look its key to understand the dataset.

2. Model Selection Modules

Cross Validation

We have previously splitted the dataset into training and testing sets. However, this might causes an overfitting problem in which the trained model has a lower accuracy on the new dataset that has not been used to train. To avoid this situation, we use the cross-validation method.

How?
We split the training dataset into a training and validation set. A general method is the K-fold cross-validation method.

K Fold Cross Validation.

In this method, the dataset is split into K parts, and (k - 1) parts are trained into the model, and the last set that was not used for training is used for validating the performance. The validation dataset on the next turn will be the other split dataset. This step takes on K times. After this process ends, the result for the cross-validation is an average of those K test accuracies.

Stratified K Fold

If the dataset is split into K parts without any conditions, the training dataset might not be a sample for the original dataset and might not represent it. Therefore, the data should be balanced. Stratified K Flod can fix this issue by properly splitting the dataset.

GridSearchCV

By using GridSearchCv, we can find the most optimized parameters for the model.

3. Data Preprocessing

Data preprocessing is crucial in machine learning. For high accuracy of the model, we have to manipulate the data. For example, we replace NULL with a reasonable value or drop the vector with numerous NULL values. Also, replacing string values with numbers is helpful to analyze the feature, which is called feature vectorization.

3.1 Data Encoding

There are two popular methods of encoding: Label encoding and Hot encoding.

Label Encoding

- Label Endoing substitutes a string into an integer which is a key representing the string.

- Size of the number is not the factor when it's applied to an algorithm

- It is not usually used in machine learning algorithms for such reasons; rather, it is used in Tree algorithms.

One Hot Encoding

- One-Hot encoding is illustrating a string into a vector, having a value of 1 only in that specific string's vector.

3.2 Feature Scaling

Two popular methods are Standardization and Normalization

Standardization

- Standardization is transferring data into values with its mean to be 0 and standard deviation to be 1.

Standard Scaler is a class that standardizes data.

Normalization

- Normalization is to normalize features with different sizes. For example, a feature A representing the distance, ranging from 0 ~ 100km and a feature B, representing money, ranging from 0 ~ 100,000,000,000won are given.

To compare these two different features in a single unit, we use normalization

MinMaxScaler is a class that normalizes data.

3.3 Caution when transforming scaling trained and test data

When we are scaling both trained and test data, we utilize fit and transform functions. These two functions each have a specific role. Fit function sets the scaler in accordance to the data, and Transform function uses its setted scaler to scale the data. Since the scaler was set in accordance to the trained data at first, it is not appropriate to use for the tested data. The scaler we use for train and test data should be the same.

Solution for the problem:

- To solve this issue, we can use scaling before we split the dataset.
- Use the scaler that was used for the training data.

4. Titanic Survival Prediction

https://colab.research.google.com/drive/1YiLnu5kk550pkhpB9lr_l3q4M19jkSSh

My Progress

CH 2 ML with scikit-learn 본문

CH 2 ML with scikit-learn

1. Intro

2. Model Selection Modules

3. Data Preprocessing

3.1 Data Encoding

3.2 Feature Scaling

3.3 Caution when transforming scaling trained and test data

4. Titanic Survival Prediction

티스토리툴바