这个数据集是从 openML 中的各种数据集中策划出来的,旨在评估各种机器学习算法的性能。
基准是由各种表格数据学习任务组成的,包括:
这个数据集包含四个拆分(文件夹),根据任务和包含在任务中的数据集进行拆分。
根据您想要加载的数据集,您可以通过将 task_name/dataset_name 传递给 load_dataset 的 data_files 参数来加载数据集,如以下示例所示:
from datasets import load_dataset dataset = load_dataset("inria-soda/tabular-benchmark", data_files="reg_cat/house_sales.csv")
这个数据集是为了评估树模型的性能而策划的,与神经网络进行对比。在筛选用于策划的数据集的过程中,论文中提到的条件如下:
数值分类
dataset_name | n_samples | n_features | original_link | new_link |
---|---|---|---|---|
electricity | 38474.0 | 7.0 | 1234321 | 1235321 |
covertype | 566602.0 | 10.0 | 1236321 | 1237321 |
pol | 10082.0 | 26.0 | 1238321 | 1239321 |
house_16H | 13488.0 | 16.0 | 12310321 | 12311321 |
MagicTelescope | 13376.0 | 10.0 | 12312321 | 12313321 |
bank-marketing | 10578.0 | 7.0 | 12314321 | 12315321 |
Bioresponse | 3434.0 | 419.0 | 12316321 | 12317321 |
MiniBooNE | 72998.0 | 50.0 | 12318321 | 12319321 |
default-of-credit-card-clients | 13272.0 | 20.0 | 12320321 | 12321321 |
Higgs | 940160.0 | 24.0 | 12322321 | 12323321 |
eye_movements | 7608.0 | 20.0 | 12324321 | 12325321 |
Diabetes130US | 71090.0 | 7.0 | 12326321 | 12327321 |
jannis | 57580.0 | 54.0 | 12328321 | 12329321 |
heloc | 10000.0 | 22.0 | " 12330321 | 12331321 |
credit | 16714.0 | 10.0 | " 12332321 | 12333321 |
california | 20634.0 | 8.0 | " 12334321 | 12335321 |
分类分类
dataset_name | n_samples | n_features | original_link | new_link |
---|---|---|---|---|
electricity | 38474.0 | 8.0 | 1234321 | 12337321 |
eye_movements | 7608.0 | 23.0 | 12324321 | 12339321 |
covertype | 423680.0 | 54.0 | 12340321 | 12341321 |
albert | 58252.0 | 31.0 | 12342321 | 12343321 |
compas-two-years | 4966.0 | 11.0 | 12344321 | 12345321 |
default-of-credit-card-clients | 13272.0 | 21.0 | 12320321 | 12347321 |
road-safety | 111762.0 | 32.0 | 12348321 | 12349321 |
数值回归
dataset_name | n_samples | n_features | original_link | new_link |
---|---|---|---|---|
cpu_act | 8192.0 | 21.0 | 12350321 | 12351321 |
pol | 15000.0 | 26.0 | 12352321 | 12353321 |
elevators | 16599.0 | 16.0 | 12354321 | 12355321 |
wine_quality | 6497.0 | 11.0 | 12356321 | 12357321 |
Ailerons | 13750.0 | 33.0 | 12358321 | 12359321 |
yprop_4_1 | 8885.0 | 42.0 | 12360321 | 12361321 |
houses | 20640.0 | 8.0 | 12362321 | 12363321 |
house_16H | 22784.0 | 16.0 | 12364321 | 12365321 |
delays_zurich_transport | 5465575.0 | 9.0 | 12366321 | 12367321 |
diamonds | 53940.0 | 6.0 | 12368321 | 12369321 |
Brazilian_houses | 10692.0 | 8.0 | 12370321 | 12371321 |
Bike_Sharing_Demand | 17379.0 | 6.0 | 12372321 | 12373321 |
nyc-taxi-green-dec-2016 | 581835.0 | 9.0 | 12374321 | 12375321 |
house_sales | 21613.0 | 15.0 | 12376321 | 12377321 |
sulfur | 10081.0 | 6.0 | 12378321 | 12379321 |
medical_charges | 163065.0 | 5.0 | 12380321 | 12381321 |
MiamiHousing2016 | 13932.0 | 14.0 | 12382321 | 12383321 |
superconduct | 21263.0 | 79.0 | 12384321 | 12385321 |
分类回归
dataset_name | n_samples | n_features | original_link | new_link |
---|---|---|---|---|
topo_2_1 | 8885.0 | 255.0 | 12386321 | 12387321 |
analcatdata_supreme | 4052.0 | 7.0 | 12388321 | 12389321 |
visualizing_soil | 8641.0 | 4.0 | 12390321 | 12391321 |
delays_zurich_transport | 5465575.0 | 12.0 | 12366321 | 12393321 |
diamonds | 53940.0 | 9.0 | 12368321 | 12395321 |
Allstate_Claims_Severity | 188318.0 | 124.0 | 12396321 | 12397321 |
Mercedes_Benz_Greener_Manufacturing | 4209.0 | 359.0 | 12398321 | 12399321 |
Brazilian_houses | 10692.0 | 11.0 | 12370321 | 123101321 |
Bike_Sharing_Demand | 17379.0 | 11.0 | 12372321 | 123103321 |
Airlines_DepDelay_1M | 1000000.0 | 5.0 | 123104321 | 123105321 |
nyc-taxi-green-dec-2016 | 581835.0 | 16.0 | 12374321 | 123107321 |
abalone | 4177.0 | 8.0 | 123108321 | 123109321 |
house_sales | 21613.0 | 17.0 | 12376321 | 123111321 |
seattlecrime6 | 52031.0 | 4.0 | 123112321 | 123113321 |
medical_charges | 163065.0 | 5.0 | 12380321 | 123115321 |
particulate-matter-ukair-2017 | 394299.0 | 6.0 | 123116321 | 123117321 |
SGEMM_GPU_kernel_performance | 241600.0 | 9.0 | 123118321 | 123119321 |
Léo Grinsztajn,Edouard Oyallon,Gaël Varoquaux。
[需要更多信息]
Léo Grinsztajn,Edouard Oyallon,Gaël Varoquaux。Why do tree-based models still outperform deeplearning on typical tabular data?. NeurIPS 2022 Datasets and Benchmarks Track,Nov 2022,New Orleans,United States. ffhal-03723551v2f