数据集:

ccdv/patent-classification

中文

Patent Classification: a classification of Patents and abstracts (9 classes).

This dataset is intended for long context classification (non abstract documents are longer that 512 tokens). Data are sampled from "BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization." by Eva Sharma, Chen Li and Lu Wang

It contains 9 unbalanced classes, 35k Patents and abstracts divided into 3 splits: train (25k), val (5k) and test (5k).

Note that documents are uncased and space separated (by authors)

Compatible with run_glue.py script:

export MODEL_NAME=roberta-base
export MAX_SEQ_LENGTH=512

python run_glue.py \
  --model_name_or_path $MODEL_NAME \
  --dataset_name ccdv/patent-classification  \
  --do_train \
  --do_eval \
  --max_seq_length $MAX_SEQ_LENGTH \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --max_eval_samples 500 \
  --output_dir tmp/patent