模型:

microsoft/xtremedistil-l6-h384-uncased

英文

XtremeDistilTransformers用于蒸馏大规模神经网络

XtremeDistilTransformers是一个蒸馏的任务无关transformer模型,利用任务迁移学习一个小型通用模型,可应用于任意任务和语言,如 XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation 论文所述。

我们利用 XtremeDistil: Multi-stage Distillation for Massive Multilingual Models MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers 论文中的任务迁移和多任务蒸馏技术,具体内容见 Github code

此l6-h384检查点具有6个层,384个隐藏大小,12个注意力头,对应于2200万参数,比BERT-base快5.3倍。

其他可用的检查点: xtremedistil-l6-h256-uncased xtremedistil-l12-h384-uncased

下表显示了GLUE开发集和SQuAD-v2的结果。

Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
XtremeDistil-l6-h256 13 8.7x 83.9 89.5 90.6 80.1 91.2 90.0 74.1 85.6
XtremeDistil-l6-h384 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6
XtremeDistil-l12-h384 33 2.7x 87.2 91.9 91.3 85.6 93.1 90.4 80.2 88.5

使用tensorflow 2.3.1,transformers 4.1.1,torch 1.6.0进行测试

如果您在工作中使用此检查点,请引用:

@misc{mukherjee2021xtremedistiltransformers,
      title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, 
      author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
      year={2021},
      eprint={2106.04563},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}