模型:
microsoft/deberta-v2-xxlarge
DeBERTa 使用解缠注意力和增强的掩码解码器对BERT和RoBERTa模型进行改进。在使用80GB训练数据的大多数NLU任务上,其性能优于BERT和RoBERTa。
请查看 official repository 获取更多详细信息和更新。
这是DeBERTa V2 xxlarge模型,具有48层,1536隐藏大小。总参数为15亿,并使用160GB原始数据进行训练。
我们呈现了SQuAD 1.1/2.0和几个GLUE基准任务的开发结果。
Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m/mm | SST-2 | QNLI | CoLA | RTE | MRPC | QQP | STS-B |
---|---|---|---|---|---|---|---|---|---|---|
F1/EM | F1/EM | Acc | Acc | Acc | MCC | Acc | Acc/F1 | Acc/F1 | P/S | |
BERT-Large | 90.9/84.1 | 81.8/79.0 | 86.6/- | 93.2 | 92.3 | 60.6 | 70.4 | 88.0/- | 91.3/- | 90.0/- |
RoBERTa-Large | 94.6/88.9 | 89.4/86.5 | 90.2/- | 96.4 | 93.9 | 68.0 | 86.6 | 90.9/- | 92.2/- | 92.4/- |
XLNet-Large | 95.1/89.7 | 90.6/87.9 | 90.8/- | 97.0 | 94.9 | 69.0 | 85.9 | 90.8/- | 92.3/- | 92.5/- |
1235321 1 | 95.5/90.1 | 90.7/88.0 | 91.3/91.1 | 96.5 | 95.3 | 69.5 | 91.0 | 92.6/94.6 | 92.3/- | 92.8/92.5 |
1236321 1 | -/- | -/- | 91.5/91.2 | 97.0 | - | - | 93.1 | 92.1/94.3 | - | 92.9/92.7 |
1237321 1 | 95.8/90.8 | 91.4/88.9 | 91.7/91.6 | 97.5 | 95.8 | 71.1 | 93.9 | 92.0/94.2 | 92.3/89.8 | 92.9/92.9 |
1238321 1,2 | 96.1/91.4 | 92.2/89.7 | 91.7/91.9 | 97.2 | 96.0 | 72.0 | 93.5 | 93.1/94.9 | 92.7/90.3 | 93.2/93.1 |
使用 Deepspeed 运行,
pip install datasets pip install deepspeed # Download the deepspeed config file wget https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/ds_config.json -O ds_config.json export TASK_NAME=mnli output_dir="ds_results" num_gpus=8 batch_size=8 python -m torch.distributed.launch --nproc_per_node=${num_gpus} \\ run_glue.py \\ --model_name_or_path microsoft/deberta-v2-xxlarge \\ --task_name $TASK_NAME \\ --do_train \\ --do_eval \\ --max_seq_length 256 \\ --per_device_train_batch_size ${batch_size} \\ --learning_rate 3e-6 \\ --num_train_epochs 3 \\ --output_dir $output_dir \\ --overwrite_output_dir \\ --logging_steps 10 \\ --logging_dir $output_dir \\ --deepspeed ds_config.json
您还可以使用 --sharded_ddp 运行
cd transformers/examples/text-classification/ export TASK_NAME=mnli python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \\ --task_name $TASK_NAME --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 8 \\ --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
如果您认为DeBERTa对您的工作有用,请引用以下论文:
@inproceedings{ he2021deberta, title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }