数据集:

yuweiyin/FinBench

中文

Dataset Card for FinBench

Dataset Statistics

[Introduction]

Task Statistics

The following table reports the task description, dataset name (for datasets loading), the number and positive ratio of train/validation/test sets, the number of classification classes (all is 2), and the number of features.

Task Description Dataset #Classes #Features #Train [Pos%] #Val [Pos%] #Test [Pos%]
Credit-card Default Predict whether a user will default on the credit card or not. cd1 2 9 2738 [7.0%] 305 [6.9%] 1305 [6.2%]
cd2 2 23 18900 [22.3%] 2100 [22.3%] 9000 [21.8%]
Loan Default Predict whether a user will default on the loan or not. ld1 2 12 2118 [8.9%] 236 [8.5%] 1010 [9.0%]
ld2 2 11 18041 [21.7%] 2005 [20.8%] 8592 [21.8%]
ld3 2 35 142060 [21.6%] 15785 [21.3%] 67648 [22.1%]
Credit-card Fraud Predict whether a user will commit fraud or not. cf1 2 19 5352 [0.67%] 595 [1.1%] 2550 [0.90%]
cf2 2 120 5418 [6.0%] 603 [7.3%] 2581 [6.0%]
Customer Churn Predict whether a user will churn or not. (customer attrition) cc1 2 9 4189 [23.5%] 466 [22.7%] 1995 [22.4%]
cc2 2 10 6300 [20.8%] 700 [20.6%] 3000 [19.47%]
cc3 2 21 4437 [26.1%] 493 [24.9%] 2113 [27.8%]
Task #Train #Val #Test
Credit-card Default 21638 2405 10305
Loan Default 162219 18026 77250
Credit-card Fraud 10770 1198 5131
Customer Churn 14926 1659 7108
Total 209553 23288 99794

Data Source

Task Dataset Source
Credit-card Default cd1 Kaggle
cd2 Kaggle
Loan Default ld1 Kaggle
ld2 Kaggle
ld3 Kaggle
Credit-card Fraud cf1 Kaggle
cf2 Kaggle
Customer Churn cc1 Kaggle
cc2 Kaggle
cc3 Kaggle
  • Language: English

Dataset Structure

Data Fields

import datasets

datasets.Features(
    {
        "X_ml": [datasets.Value(dtype="float")],  # (The tabular data array of the current instance)
        "X_ml_unscale": [datasets.Value(dtype="float")],  # (Scaled tabular data array of the current instance)
        "y": datasets.Value(dtype="int64"),  # (The label / ground-truth)
        "num_classes": datasets.Value("int64"),  # (The total number of classes)
        "num_features": datasets.Value("int64"),  # (The total number of features)
        "num_idx": [datasets.Value("int64")],  # (The indices of the numerical datatype columns)
        "cat_idx": [datasets.Value("int64")],  # (The indices of the categorical datatype columns)
        "cat_dim": [datasets.Value("int64")],  # (The dimension of each categorical column)
        "cat_str": [[datasets.Value("string")]],  # (The category names of categorical columns)
        "col_name": [datasets.Value("string")],  # (The name of each column)
        "X_instruction_for_profile": datasets.Value("string"),  # instructions (from tabular data) for profiles
        "X_profile": datasets.Value("string"),  # customer profiles built from instructions via LLMs
    }
)

Data Loading

HuggingFace Login (Optional)

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

Loading a Dataset

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)

Loading the Splits

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)

train_set = dataset["train"] if "train" in dataset else []
validation_set = dataset["validation"] if "validation" in dataset else []
test_set = dataset["test"] if "test" in dataset else []

Loading the Instances

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)
train_set = dataset["train"] if "train" in dataset else []

for train_instance in train_set:
    X_ml = train_instance["X_ml"]  # List[float] (The tabular data array of the current instance)
    X_ml_unscale = train_instance["X_ml_unscale"]  # List[float] (Scaled tabular data array of the current instance)
    y = train_instance["y"]  # int (The label / ground-truth)
    num_classes = train_instance["num_classes"]  # int (The total number of classes)
    num_features = train_instance["num_features"]  # int (The total number of features)
    num_idx = train_instance["num_idx"]  # List[int] (The indices of the numerical datatype columns)
    cat_idx = train_instance["cat_idx"]  # List[int] (The indices of the categorical datatype columns)
    cat_dim = train_instance["cat_dim"]  # List[int] (The dimension of each categorical column)
    cat_str = train_instance["cat_str"]  # List[List[str]] (The category names of categorical columns)
    col_name = train_instance["col_name"]  # List[str] (The name of each column)
    X_instruction_for_profile = train_instance["X_instruction_for_profile"]  # instructions for building profiles
    X_profile = train_instance["X_profile"]  # customer profiles built from instructions via LLMs

Contributions

[Contributions]

Citation

yin2023finbench

References

[References]