交叉验证完整指南

2025年01月13日由佚名发表 371 0

微信截图_20250114111305

作者提供的图片

机器学习模型通常需要大量数据，但它们如何在实时中处理新数据至关重要。交叉验证是一种通过将数据分成多个部分，使用部分数据训练模型，剩余部分测试模型来测试模型效果的方法。这有助于发现过拟合和欠拟合，提供模型在实际情况中表现的一个概念。

本指南将带您了解交叉验证的基础知识、类型以及最佳使用方法，以提高您的机器学习效果。

先决条件

在开始实际的交叉验证之前，请确保您对以下内容有良好的理解：

机器学习基础：了解概念如过拟合和欠拟合，以及如何衡量模型的效果
Python技能：熟练掌握Python基础并使用工具如Scikit-learn、Pandas和NumPy
准备您的数据：知道如何将数据分割为训练集和测试集，以及我们这样做的原因

要跟随我们的示例，请确保您的计算机上安装了Python和以下库：

pip install numpy pandas scikit-learn matplotlib

什么是交叉验证？

让我们从头开始：交叉验证是最广泛使用的数据重采样方法之一，用于评估预测模型的泛化能力并防止过拟合。与简单的训练-测试分割不同，它通过轮换训练和测试集提供更全面的理解。这有助于确保每个数据点都有机会被测试，并有助于获得可靠的性能指标。

交叉验证的关键点：

一致地评估模型性能
通过在多样化的数据子集上测试来最小化偏差
通过重复验证循环优化超参数

交叉验证有多种类型，每种类型适用于不同的数据结构和技术。让我们来看看最常用的技术。

1. K折交叉验证

过程：

数据集被分成k个子集（折叠）
模型在k-1个折叠上进行训练，并在剩余的折叠上进行测试
这个过程对每个折叠重复，确保每个子集都用于测试
最终的性能指标是所有测试运行的平均值

优点：

适用于大多数数据集
通过平均结果减少方差

注意事项：

选择合适的k值（通常为5或10）是至关重要的

代码示例：

from sklearn.model_selection import KFoldfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoreimport numpy as np# 示例数据X = np.random.rand(100, 5)  # 100个样本，5个特征y = np.random.randint(0, 2, 100)  # 二元目标变量kf = KFold(n_splits=5)model = LogisticRegression()accuracies = []for train_index, test_index in kf.split(X):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        model.fit(X_train, y_train)    predictions = model.predict(X_test)    accuracies.append(accuracy_score(y_test, predictions))print("平均准确率:", np.mean(accuracies))

2. 分层K折交叉验证

分层k折交叉验证类似于普通的k折，但它确保每个折叠具有与整个数据集相同的类别分布。这使得它特别适合于不平衡的数据集。

代码示例：

from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=5)accuracies = []for train_index, test_index in skf.split(X, y):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        model.fit(X_train, y_train)    predictions = model.predict(X_test)    accuracies.append(accuracy_score(y_test, predictions))print("分层平均准确率:", np.mean(accuracies))

留一法交叉验证

在每次运行中，留一法交叉验证（LOOC）使用一个数据点进行测试，其余数据用于训练。这个过程非常彻底，但对于大型数据集来说计算量很大。

代码示例：

from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=5)accuracies = []for train_index, test_index in skf.split(X, y):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        model.fit(X_train, y_train)    predictions = model.predict(X_test)    accuracies.append(accuracy_score(y_test, predictions))print("分层平均准确率:", np.mean(accuracies))

时间序列交叉验证

时间序列交叉验证具有以下特点：

专为时间相关数据设计
在较早的时间段进行训练，在较晚的时间段进行测试
有助于保持数据的时间顺序

代码示例：

from sklearn.model_selection import TimeSeriesSplittscv = TimeSeriesSplit(n_splits=3)accuracies = []for train_index, test_index in tscv.split(X):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        model.fit(X_train, y_train)    predictions = model.predict(X_test)    accuracies.append(accuracy_score(y_test, predictions))print("时间序列交叉验证准确率:", np.mean(accuracies))

组K折交叉验证

以下是组k折交叉验证与其他交叉验证方法的不同之处：

确保数据点的组（例如，来自同一用户或批次）要么完全在训练集，要么完全在测试集中
它还防止了分组数据集中的数据泄漏，训练在较早的时间段进行，测试在较晚的时间段进行

代码示例：

from sklearn.model_selection import GroupKFold# 模拟分组数据groups = np.random.randint(0, 5, len(X))  # 5个独特的组gkf = GroupKFold(n_splits=5)accuracies = []for train_index, test_index in gkf.split(X, y, groups):    X_train, X_test = X[train_index], X[test_index]    y_train, y_test = y[train_index], y[test_index]        model.fit(X_train, y_train)    predictions = model.predict(X_test)    accuracies.append(accuracy_score(y_test, predictions))print("组K折交叉验证准确率:", np.mean(accuracies))

交叉验证的好处

为什么我们要进行这种重采样过程来构建模型？以下是一些重要原因：

提高模型可靠性：提供更稳健的性能测量
防止过拟合：除了过拟合，它多次在不同的数据分割上测试模型，从而增强模型的泛化能力
优化超参数：有助于更好地调整超参数以获得最佳性能
广泛评估：确保所有数据点都用于训练和测试
减少方差：多次训练/测试分割产生更可靠的性能指标
适用于各种模型：对简单和复杂模型都适用

最佳实践

以下是使用交叉验证的一些重要最佳实践：

选择合适的交叉验证技术：技术应与数据集和问题类型相对应
注意数据泄漏：测试数据不应泄漏到训练数据中
结合网格搜索：交叉验证可用于优化超参数
平衡计算成本和彻底性：像LOOCV这样的技术计算量很大
使用可视化：图表将有助于可视化跨折叠的性能趋势

可视化示例：

import matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import KFoldX, y = make_classification(n_samples=100,                            n_features=2,                            n_classes=2,                            random_state=42,                            n_informative=2,                            n_redundant=0)kf = KFold(n_splits=5)plt.figure(figsize=(10, 6))for i, (train_index, test_index) in enumerate(kf.split(X)):    plt.scatter(X[test_index, 0], X[test_index, 1], label=f'折叠 {i + 1}')plt.title('K折交叉验证分割')plt.xlabel('特征 1')plt.ylabel('特征 2')plt.legend()plt.show()

上面的可视化展示了数据集如何被分成5个不同的折叠用于交叉验证。每种颜色代表在每个折叠中分配给测试集的数据点。

结论

交叉验证是机器学习中的一个重要工具，确保模型能够很好地泛化，并提供对模型性能的洞察。通过正确的交叉验证技术和最佳实践，你可以在构建用于实际应用的稳健、可靠模型时多一个工具。

参考文献与进一步阅读

Scikit-Learn 文档: scikit-learn 的官方文档，涵盖交叉验证技术和示例
Jake VanderPlas 的《Python 数据科学手册》: 一本关于 Python 数据科学的综合指南，包括模型评估和验证
Andrew Ng 的《机器学习实战》: 一本免费书籍，解释了模型评估策略和最佳实践。

文章来源：https://www.kdnuggets.com/complete-guide-cross-validation

标签：

机器学习

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇人工通用智能（AGI）如何永远重塑社会

下一篇如果我能重新开始，我会在2025年如何学习Python

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来