模型:
microsoft/deberta-v2-xlarge
DeBERTa 使用分离的注意力和增强的掩码解码器改进了BERT和RoBERTa模型。在80GB的训练数据上,它在大多数NLU任务上优于BERT和RoBERTa。
请查看 official repository 了解更多详细信息和更新。
这是DeBERTa V2 xlarge模型,具有24层,1536隐藏大小。总参数为900M,使用160GB原始数据进行训练。
我们在SQuAD 1.1/2.0和几个GLUE基准任务上展示了开发结果。
Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m/mm | SST-2 | QNLI | CoLA | RTE | MRPC | QQP | STS-B |
---|---|---|---|---|---|---|---|---|---|---|
F1/EM | F1/EM | Acc | Acc | Acc | MCC | Acc | Acc/F1 | Acc/F1 | P/S | |
BERT-Large | 90.9/84.1 | 81.8/79.0 | 86.6/- | 93.2 | 92.3 | 60.6 | 70.4 | 88.0/- | 91.3/- | 90.0/- |
RoBERTa-Large | 94.6/88.9 | 89.4/86.5 | 90.2/- | 96.4 | 93.9 | 68.0 | 86.6 | 90.9/- | 92.2/- | 92.4/- |
XLNet-Large | 95.1/89.7 | 90.6/87.9 | 90.8/- | 97.0 | 94.9 | 69.0 | 85.9 | 90.8/- | 92.3/- | 92.5/- |
1234321 1 | 95.5/90.1 | 90.7/88.0 | 91.3/91.1 | 96.5 | 95.3 | 69.5 | 91.0 | 92.6/94.6 | 92.3/- | 92.8/92.5 |
1235321 1 | -/- | -/- | 91.5/91.2 | 97.0 | - | - | 93.1 | 92.1/94.3 | - | 92.9/92.7 |
1236321 1 | 95.8/90.8 | 91.4/88.9 | 91.7/91.6 | 97.5 | 95.8 | 71.1 | 93.9 | 92.0/94.2 | 92.3/89.8 | 92.9/92.9 |
1237321 1,2 | 96.1/91.4 | 92.2/89.7 | 91.7/91.9 | 97.2 | 96.0 | 72.0 | 93.5 | 93.1/94.9 | 92.7/90.3 | 93.2/93.1 |
cd transformers/examples/text-classification/ export TASK_NAME=mrpc python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \\\\ --task_name $TASK_NAME --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 4 \\\\ --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
如果您发现DeBERTa对您的工作有用,请引用以下论文:
@inproceedings{ he2021deberta, title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }