模型:

microsoft/xtremedistil-l6-h256-uncased

英文

XtremeDistilTransformers用于筛选大规模神经网络

XtremeDistilTransformers是一个经过蒸馏的任务无关的Transformer模型,利用任务转移来学习一个可以应用于任意任务和语言的小型通用模型,如论文所述 XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

我们利用来自论文 XtremeDistil: Multi-stage Distillation for Massive Multilingual Models MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers 的多任务蒸馏技术结合任务转移,并具有以下的 Github code

这个l6-h384检查点具有6个层,384个隐藏单元,12个注意力头,对应于2200万个参数,比BERT-base快5.3倍。

其他可用的检查点: xtremedistil-l6-h384-uncased xtremedistil-l12-h384-uncased

下表显示了在GLUE dev集和SQuAD-v2上的结果。

Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
XtremeDistil-l6-h256 13 8.7x 83.9 89.5 90.6 80.1 91.2 90.0 74.1 85.6
XtremeDistil-l6-h384 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6
XtremeDistil-l12-h384 33 2.7x 87.2 91.9 91.3 85.6 93.1 90.4 80.2 88.5

测试使用的tensorflow 2.3.1,transformers 4.1.1,torch 1.6.0

如果您在工作中使用了这个检查点,请引用:

@misc{mukherjee2021xtremedistiltransformers,
      title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, 
      author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
      year={2021},
      eprint={2106.04563},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}