使用Scikit-learn管道优化机器学习工作流程

2024年03月11日由 daydream 发表 303 0

使用Scikit-learn的管道可以简化你的预处理和建模步骤，减少代码复杂性，确保数据预处理的一致性，帮助进行超参数调优，并使你的工作流程更有组织且更易于维护。通过将多个转换和最终模型整合成一个实体，管道提高了可复现性，并使一切都更加高效。

微信截图_20240311112330

在本教程中，我们将使用Kaggle上的Bank Churn数据集来训练一个随机森林分类器。我们将比较传统的数据预处理和模型训练方法，以及使用Scikit-learn的管道和ColumnTransformers的更高效方法。

数据处理管道

在数据处理管道中，我们将学习如何分别转换分类和数值列。我们将从传统的代码风格开始，然后展示一种更好的方法来执行类似的处理。

从zip文件中提取数据后，使用“id”作为索引列加载train.csv文件。删除不必要的列并打乱数据集。

import pandas as pd



bank_df = pd.read_csv("train.csv", index_col="id")

bank_df = bank_df.drop(['CustomerId', 'Surname'], axis=1)

bank_df = bank_df.sample(frac=1)

bank_df.head()

我们有分类、整数和浮点数列。数据集看起来相当干净。

微信截图_20240311112458

简单的Scikit-learn代码

我们的目标是填补分类和数值特征中的缺失值。为此，我们将使用带有不同策略的SimpleImputer。

在填充缺失值后，我们将把分类特征转换为整数，并对数值特征应用最小-最大缩放。

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler



cat_col = [1,2]

num_col = [0,3,4,5,6,7,8,9]



# Filling missing categorical values

cat_impute = SimpleImputer(strategy="most_frequent")

bank_df.iloc[:,cat_col] = cat_impute.fit_transform(bank_df.iloc[:,cat_col])



# Filling missing numerical values

num_impute = SimpleImputer(strategy="median")

bank_df.iloc[:,num_col] = num_impute.fit_transform(bank_df.iloc[:,num_col])





# Encode categorical features as an integer array.

cat_encode = OrdinalEncoder()

bank_df.iloc[:,cat_col] = cat_encode.fit_transform(bank_df.iloc[:,cat_col])





# Scaling numerical values.

scaler = MinMaxScaler()

bank_df.iloc[:,num_col] = scaler.fit_transform(bank_df.iloc[:,num_col])



bank_df.head()

结果，我们得到了一个只包含整数或浮点数值的干净且转换过的数据集。

微信截图_20240311112530

使用Scikit-learn管道的代码

让我们使用Pipeline和ColumnTransformer来转换上面的代码。而不是应用预处理技术，我们将创建两个管道。一个是用于数值列，另一个是用于分类列。

在数值管道中，我们使用了简单填充策略中的“均值”填充缺失值，并应用了最小-最大缩放器进行归一化。

在分类管道中，我们使用简单填充策略中的“最频繁”选项来填充缺失值，并使用原始编码器将分类转换为数值。

我们使用ColumnTransformer将这两个管道组合起来，并为每个管道提供列索引。这有助于我们在特定列上应用这些管道。例如，分类转换器管道将仅应用于第1和第2列。

注意：remainder="passthrough"意味着那些未处理的列将在最后添加。在我们的例子中，这就是目标列。

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline





# Identify numerical and categorical columns

cat_col = [1,2]

num_col = [0,3,4,5,6,7,8,9]



# Transformers for numerical data

numerical_transformer = Pipeline(steps=[

    ('imputer', SimpleImputer(strategy='mean')),

    ('scaler', MinMaxScaler())

])



# Transformers for categorical data

categorical_transformer = Pipeline(steps=[

    ('imputer', SimpleImputer(strategy='most_frequent')),

    ('encoder', OrdinalEncoder())

])



# Combine transformers into a ColumnTransformer

preproc_pipe = ColumnTransformer(

    transformers=[

        ('num', numerical_transformer, num_col),

        ('cat', categorical_transformer, cat_col)

    ],

    remainder="passthrough"

)



# Apply the preprocessing pipeline

bank_df = preproc_pipe.fit_transform(bank_df)

bank_df[0]

转换后，得到的数组在开头包含数值转换值，在结尾包含分类转换值，这取决于列转换器中管道的顺序。

array([0.712     , 0.24324324, 0.6       , 0.        , 0.33333333,

       1.        , 1.        , 0.76443485, 2.        , 0.        ,

       0.        ])

你可以在Jupyter Notebook中运行管道对象以可视化管道。请确保你使用的是最新版本的Scikit-learn。

preproc_pipe

微信截图_20240311112603

数据训练管道

为了训练和评估我们的模型，我们需要将数据集拆分为两个子集：训练集和测试集。

为此，我们首先将依赖变量和独立变量创建出来，并将它们转换为NumPy数组。然后，我们将使用train_test_split函数将数据集拆分为两个子集。

from sklearn.model_selection import train_test_split



X = bank_df.drop("Exited", axis=1).values

y = bank_df.Exited.values



X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.3, random_state=125

)

简单的Scikit-learn代码

编写训练代码的常规方法是首先使用SelectKBest进行特征选择，然后将新特征提供给我们的随机森林分类器模型。

我们将首先使用训练集训练模型，并使用测试数据集评估结果。

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.ensemble import RandomForestClassifier



KBest = SelectKBest(chi2, k="all")

X_train = KBest.fit_transform(X_train, y_train)

X_test = KBest.transform(X_test)



model = RandomForestClassifier(n_estimators=100, random_state=125)



model.fit(X_train,y_train)



model.score(X_test, y_test)

我们获得了相当不错的准确率得分。

0.8613035487063481

使用Scikit-learn管道的代码

让我们使用Pipeline函数将两个训练步骤组合成一个管道。然后，我们可以在训练集上拟合模型，并在测试集上评估它。

KBest = SelectKBest(chi2, k="all")

model = RandomForestClassifier(n_estimators=100, random_state=125)



train_pipe = Pipeline(

    steps=[

        ("KBest", KBest),

        ("RFmodel", model),

    ]

)



train_pipe.fit(X_train,y_train)



train_pipe.score(X_test, y_test)

我们获得了相似的结果，但代码看起来更加高效和直观。向训练管道中添加或删除新步骤变得相当容易。

0.8613035487063481

运行管道对象以可视化管道。

train_pipe

微信截图_20240311112645

合并处理与训练管道

现在，我们将通过创建另一个管道并将这两个管道添加进去，来合并预处理和训练管道。

以下是完整的代码：

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.ensemble import RandomForestClassifier



#loading the data

bank_df = pd.read_csv("train.csv", index_col="id")

bank_df = bank_df.drop(['CustomerId', 'Surname'], axis=1)

bank_df = bank_df.sample(frac=1)





# Splitting data into training and testing sets

X = bank_df.drop(["Exited"],axis=1)

y = bank_df.Exited



X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.3, random_state=125

)



# Identify numerical and categorical columns

cat_col = [1,2]

num_col = [0,3,4,5,6,7,8,9]



# Transformers for numerical data

numerical_transformer = Pipeline(steps=[

    ('imputer', SimpleImputer(strategy='mean')),

    ('scaler', MinMaxScaler())

])



# Transformers for categorical data

categorical_transformer = Pipeline(steps=[

    ('imputer', SimpleImputer(strategy='most_frequent')),

    ('encoder', OrdinalEncoder())

])



# Combine pipelines using ColumnTransformer

preproc_pipe = ColumnTransformer(

    transformers=[

        ('num', numerical_transformer, num_col),

        ('cat', categorical_transformer, cat_col)

    ],

    remainder="passthrough"

)



# Selecting the best features

KBest = SelectKBest(chi2, k="all")



# Random Forest Classifier

model = RandomForestClassifier(n_estimators=100, random_state=125)



# KBest and model pipeline

train_pipe = Pipeline(

    steps=[

        ("KBest", KBest),

        ("RFmodel", model),

    ]

)



# Combining the preprocessing and training pipelines

complete_pipe = Pipeline(

    steps=[

       

        ("preprocessor", preproc_pipe),

        ("train", train_pipe),

    ]

)



# running the complete pipeline

complete_pipe.fit(X_train,y_train)



# model accuracy

complete_pipe.score(X_test, y_test)

输出：

0.8592837955201874

可视化完整的管道。

complete_pipe

微信截图_20240311112714

保存和加载模型

使用管道的一个主要优势是，你可以将模型与管道一起保存。在推理期间，你只需要加载管道对象，它将准备好处理原始数据并为你提供准确的预测。你不需要在应用程序文件中重新编写处理和转换函数，因为它可以立即工作。这使得机器学习工作流程更加高效并节省了时间。

让我们首先使用skops-dev/skops库来保存管道。

import skops.io as sio



sio.dump(complete_pipe, "bank_pipeline.skops")

然后，加载保存的管道并显示它。

new_pipe = sio.load("bank_pipeline.skops", trusted=True)

new_pipe

我们可以看到，我们已经成功加载了管道。

微信截图_20240311113319

为了评估我们加载的管道，我们将在测试集上进行预测，然后计算准确率和F1分数。

from sklearn.metrics import accuracy_score, f1_score



predictions = new_pipe.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

f1 = f1_score(y_test, predictions, average="macro")



print("Accuracy:", str(round(accuracy, 2) * 100) + "%", "F1:", round(f1, 2))

事实证明，我们需要关注少数类别以提高F1分数。

Accuracy: 86.0% F1: 0.76

项目文件和代码可在Deepnote Workspace中找到。该工作区包含两个笔记本：一个使用Scikit-learn管道，另一个不使用。

结论

在本教程中，我们学习了如何使用Scikit-learn管道将数据转换和模型训练的序列串联起来，从而简化机器学习工作流程。通过将预处理和模型训练合并为一个单独的Pipeline对象，我们可以简化代码，确保数据转换的一致性，并使我们的工作流程更加有序和可重复。

文章来源：https://www.kdnuggets.com/streamline-your-machine-learning-workflow-with-scikit-learn-pipelines

标签：

机器学习数据处理模型训练

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇如何构建LLM申请（三）：用于生成文本的各种嵌入模型

下一篇如何构建LLM申请（四）：矢量数据库

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来