模型:
microsoft/xtremedistil-l12-h384-uncased
XtremeDistilTransformers是一个经过蒸馏的任务无关的transformer模型,利用任务迁移来学习一个小型的通用模型,可以应用于任意任务和语言,如在论文中所述 XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation 。
我们利用任务迁移和来自 XtremeDistil: Multi-stage Distillation for Massive Multilingual Models 和 MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers 论文的多任务蒸馏技术,使用以下 Github code 进行实验。
这个包含6层,384隐藏大小,12注意力头的l6-h384检查点对应于2200万个参数,比BERT-base的速度提高了5.3倍。
其他可用的检查点有: xtremedistil-l6-h256-uncased 和 xtremedistil-l6-h384-uncased
下表显示了在GLUE dev数据集和SQuAD-v2上的结果。
Models | #Params | Speedup | MNLI | QNLI | QQP | RTE | SST | MRPC | SQUAD2 | Avg |
---|---|---|---|---|---|---|---|---|---|---|
BERT | 109 | 1x | 84.5 | 91.7 | 91.3 | 68.6 | 93.2 | 87.3 | 76.8 | 84.8 |
DistilBERT | 66 | 2x | 82.2 | 89.2 | 88.5 | 59.9 | 91.3 | 87.5 | 70.7 | 81.3 |
TinyBERT | 66 | 2x | 83.5 | 90.5 | 90.6 | 72.2 | 91.6 | 88.4 | 73.1 | 84.3 |
MiniLM | 66 | 2x | 84.0 | 91.0 | 91.0 | 71.5 | 92.0 | 88.4 | 76.4 | 84.9 |
MiniLM | 22 | 5.3x | 82.8 | 90.3 | 90.6 | 68.9 | 91.3 | 86.6 | 72.9 | 83.3 |
XtremeDistil-l6-h256 | 13 | 8.7x | 83.9 | 89.5 | 90.6 | 80.1 | 91.2 | 90.0 | 74.1 | 85.6 |
XtremeDistil-l6-h384 | 22 | 5.3x | 85.4 | 90.3 | 91.0 | 80.9 | 92.3 | 90.0 | 76.6 | 86.6 |
XtremeDistil-l12-h384 | 33 | 2.7x | 87.2 | 91.9 | 91.3 | 85.6 | 93.1 | 90.4 | 80.2 | 88.5 |
在tensorflow 2.3.1,transformers 4.1.1和torch 1.6.0进行了测试
如果您在您的工作中使用了这个检查点,请引用:
@misc{mukherjee2021xtremedistiltransformers, title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao}, year={2021}, eprint={2106.04563}, archivePrefix={arXiv}, primaryClass={cs.CL} }