模型:
distilbert-base-cased
该模型是 BERT base model 的精简版本。它在 this paper 引入。有关蒸馏过程的代码可以在 here 中找到。该模型是有大小写的:它区分英文和English。
所有关于预训练、用途、限制和潜在偏见的培训详细信息(下面包括)与 DistilBERT-base-uncased 相同。如果你想了解更多信息,我们强烈推荐查看它。
DistilBERT是一个transformers模型,比BERT更小更快,它以自监督的方式与BERT基础模型相同的语料库进行预训练。这意味着它只是在原始文本上进行预训练,而没有以任何方式对其进行人类标记(这就是为什么它可以使用大量公开可用的数据),使用BERT基础模型从这些文本中生成输入和标签的自动化过程。更确切地说,它用三个目标进行了预训练:
通过这种方式,该模型学习了与其教师模型相同的英语语言内部表示,同时在推理或下游任务中更快。
您可以使用原始模型进行掩码语言建模或下一句预测,但它主要用于在下游任务上进行微调。参见 model hub ,查找您感兴趣的任务的微调版本。
请注意,该模型主要用于对使用整个句子(可能被遮盖)进行决策的任务进行微调,例如序列分类、标记分类或问题回答。对于文本生成之类的任务,则应该查看像GPT2这样的模型。
您可以直接使用此模型进行掩码语言建模的流水线:
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased') >>> unmasker("Hello I'm a [MASK] model.") [{'sequence': "[CLS] hello i'm a role model. [SEP]", 'score': 0.05292855575680733, 'token': 2535, 'token_str': 'role'}, {'sequence': "[CLS] hello i'm a fashion model. [SEP]", 'score': 0.03968575969338417, 'token': 4827, 'token_str': 'fashion'}, {'sequence': "[CLS] hello i'm a business model. [SEP]", 'score': 0.034743521362543106, 'token': 2449, 'token_str': 'business'}, {'sequence': "[CLS] hello i'm a model model. [SEP]", 'score': 0.03462274372577667, 'token': 2944, 'token_str': 'model'}, {'sequence': "[CLS] hello i'm a modeling model. [SEP]", 'score': 0.018145186826586723, 'token': 11643, 'token_str': 'modeling'}]
使用PyTorch获取给定文本的特征的示例如下:
from transformers import DistilBertTokenizer, DistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
使用 TensorFlow:
from transformers import DistilBertTokenizer, TFDistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
即使该模型使用的训练数据可以被视为相当中立的,但该模型可能有偏见的预测。它也继承了 the bias of its teacher model 的一些特点。
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased') >>> unmasker("The White man worked as a [MASK].") [{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]', 'score': 0.1235365942120552, 'token': 20987, 'token_str': 'blacksmith'}, {'sequence': '[CLS] the white man worked as a carpenter. [SEP]', 'score': 0.10142576694488525, 'token': 10533, 'token_str': 'carpenter'}, {'sequence': '[CLS] the white man worked as a farmer. [SEP]', 'score': 0.04985016956925392, 'token': 7500, 'token_str': 'farmer'}, {'sequence': '[CLS] the white man worked as a miner. [SEP]', 'score': 0.03932540491223335, 'token': 18594, 'token_str': 'miner'}, {'sequence': '[CLS] the white man worked as a butcher. [SEP]', 'score': 0.03351764753460884, 'token': 14998, 'token_str': 'butcher'}] >>> unmasker("The Black woman worked as a [MASK].") [{'sequence': '[CLS] the black woman worked as a waitress. [SEP]', 'score': 0.13283951580524445, 'token': 13877, 'token_str': 'waitress'}, {'sequence': '[CLS] the black woman worked as a nurse. [SEP]', 'score': 0.12586183845996857, 'token': 6821, 'token_str': 'nurse'}, {'sequence': '[CLS] the black woman worked as a maid. [SEP]', 'score': 0.11708822101354599, 'token': 10850, 'token_str': 'maid'}, {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]', 'score': 0.11499975621700287, 'token': 19215, 'token_str': 'prostitute'}, {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]', 'score': 0.04722772538661957, 'token': 22583, 'token_str': 'housekeeper'}]
这种偏见也会影响到该模型的所有微调版本。
DistilBERT在与BERT相同的数据上进行了预训练,该数据是 BookCorpus ,一个由11,038本未发表的书和 English Wikipedia (不包括列表、表和标题)组成的数据集。
文本经过小写和使用WordPiece进行标记化,并使用词汇表大小为30,000。模型的输入形式为:
[CLS] Sentence A [SEP] Sentence B [SEP]
有50%的概率,句子A和句子B对应于原始语料库中的两个连续句子,其他情况下它是语料库中的另一句随机句子。注意,在这里所谓的句子是一段连续的文本,通常比单个句子长。唯一的约束是两个"句子"的结果的长度小于512个标记。
对于每个句子的掩码过程的详细信息如下:
该模型在8个16 GB V100上进行了90小时的训练。有关所有超参数的详细信息,请参见 training code 。
在微调下游任务时,该模型实现了以下结果:
Glue测试结果:
Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
---|---|---|---|---|---|---|---|---|
81.5 | 87.8 | 88.2 | 90.4 | 47.2 | 85.5 | 85.6 | 60.6 |
@article{Sanh2019DistilBERTAD, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, journal={ArXiv}, year={2019}, volume={abs/1910.01108} }