ccdv/arxiv-classification | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

ccdv/arxiv-classification

任务:

文本分类

子任务:

multi-class-classification topic-classification

语言:

大小:

10K<n<100K

其他:

long context long+context

数据集介绍文件清单

中文

Arxiv Classification: a classification of Arxiv Papers (11 classes).

This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning"

@ARTICLE{8675939,
  author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao},
  journal={IEEE Access}, 
  title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, 
  year={2019},
  volume={7},
  number={},
  pages={40707-40718},
  doi={10.1109/ACCESS.2019.2907992}
  }

It contains 11 slightly unbalanced classes, 33k Arxiv Papers divided into 3 splits: train (28k), val (2.5k) and test (2.5k).

2 configs:

default
no_ref, removes references to the class inside the document (eg: [cs.LG] -> [])

Compatible with run_glue.py script:

export MODEL_NAME=roberta-base
export MAX_SEQ_LENGTH=512

python run_glue.py \
  --model_name_or_path $MODEL_NAME \
  --dataset_name ccdv/arxiv-classification  \
  --do_train \
  --do_eval \
  --max_seq_length $MAX_SEQ_LENGTH \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --max_eval_samples 500 \
  --output_dir tmp/arxiv

作者:

ccdv

数据集大小:

1.87 GB