模型:
bert-large-cased-whole-word-masking
预训练模型,使用遮盖语言建模(MLM)目标在英文语言上。该模型在 this paper 年提出, this repository 年首次发布。该模型区分大小写:它区分english和English。
与其他BERT模型不同,该模型使用了一种新技术:整词遮盖。在这种情况下,与一个单词对应的所有标记一次性被遮盖。整体遮盖率保持不变。
训练过程相同 - 每个遮盖的WordPiece标记独立预测。
免责声明:BERT发布团队未为此模型编写模型卡片,所以此模型卡片是由Hugging Face团队编写的。
BERT是一个在大量英文数据上进行自监督训练的transformers模型。这意味着它仅在原始文本上进行预训练,没有人以任何方式标记它们(这就是它可以使用大量公开可用数据的原因),它使用自动过程从这些文本中生成输入和标签。更确切地说,它的预训练有两个目标:
通过这种方式,该模型学习到了英语语言的内在表示,可以用于提取对下游任务有用的特征:例如,如果您拥有一组带标签的句子数据,可以使用BERT模型产生的特征作为输入训练标准分类器。
该模型采用以下配置:
您可以在掩码语言建模或下一句预测方面直接使用原始模型,但它主要用于在下游任务上进行微调。查看 model hub 以寻找您感兴趣的任务的基于BERT的微调版本。
请注意,该模型主要旨在进行整个句子(可能遮盖)来进行决策的任务上进行微调,例如序列分类、标记分类或问题回答。对于文本生成等任务,您应该查看像GPT2这样的模型。
您可以直接使用此模型进行遮盖语言建模的管道:
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='bert-large-cased-whole-word-masking') >>> unmasker("Hello I'm a [MASK] model.") [ { "sequence":"[CLS] Hello I'm a fashion model. [SEP]", "score":0.1474294513463974, "token":4633, "token_str":"fashion" }, { "sequence":"[CLS] Hello I'm a magazine model. [SEP]", "score":0.05430116504430771, "token":2435, "token_str":"magazine" }, { "sequence":"[CLS] Hello I'm a male model. [SEP]", "score":0.039395421743392944, "token":2581, "token_str":"male" }, { "sequence":"[CLS] Hello I'm a former model. [SEP]", "score":0.036936815828084946, "token":1393, "token_str":"former" }, { "sequence":"[CLS] Hello I'm a professional model. [SEP]", "score":0.03663451969623566, "token":1848, "token_str":"professional" } ]
以下是如何在PyTorch中使用此模型获取给定文本的特征:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-large-cased-whole-word-masking') model = BertModel.from_pretrained("bert-large-cased-whole-word-masking") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
以及在TensorFlow中的使用方式:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('bert-large-cased-whole-word-masking') model = TFBertModel.from_pretrained("bert-large-cased-whole-word-masking") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
即使用于该模型的训练数据可以被看作是相对中立的,但该模型可能具有偏见的预测结果:
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='bert-large-cased-whole-word-masking') >>> unmasker("The man worked as a [MASK].") [ { "sequence":"[CLS] The man worked as a carpenter. [SEP]", "score":0.09021259099245071, "token":25169, "token_str":"carpenter" }, { "sequence":"[CLS] The man worked as a cook. [SEP]", "score":0.08125395327806473, "token":9834, "token_str":"cook" }, { "sequence":"[CLS] The man worked as a mechanic. [SEP]", "score":0.07524766772985458, "token":19459, "token_str":"mechanic" }, { "sequence":"[CLS] The man worked as a waiter. [SEP]", "score":0.07397029548883438, "token":17989, "token_str":"waiter" }, { "sequence":"[CLS] The man worked as a guard. [SEP]", "score":0.05848982185125351, "token":3542, "token_str":"guard" } ] >>> unmasker("The woman worked as a [MASK].") [ { "sequence":"[CLS] The woman worked as a maid. [SEP]", "score":0.19436432421207428, "token":13487, "token_str":"maid" }, { "sequence":"[CLS] The woman worked as a waitress. [SEP]", "score":0.16161060333251953, "token":15098, "token_str":"waitress" }, { "sequence":"[CLS] The woman worked as a nurse. [SEP]", "score":0.14942803978919983, "token":7439, "token_str":"nurse" }, { "sequence":"[CLS] The woman worked as a secretary. [SEP]", "score":0.10373266786336899, "token":4848, "token_str":"secretary" }, { "sequence":"[CLS] The woman worked as a cook. [SEP]", "score":0.06384387612342834, "token":9834, "token_str":"cook" } ]
此偏见也会影响该模型的所有基于微调的版本。
BERT模型在 BookCorpus 上进行了预训练,该数据集包含11,038本未发布的书籍和 English Wikipedia (不包括列表、表格和标题)。
文本经过小写处理并使用WordPiece进行标记化,词汇量为30,000。模型的输入形式如下:
[CLS] Sentence A [SEP] Sentence B [SEP]
概率为0.5,句子A和句子B对应于原始语料库中的两个连续句子,在其他情况下,它是语料库中的另一个随机句子。注意,在这里被视为句子的内容通常比一个单独句子要长。唯一的限制是两个“句子”的结果长度总共少于512个标记。
每个句子的遮盖过程的细节如下:
该模型使用4个云TPU在Pod配置下(总共16个TPU芯片)进行100万步的训练,批量大小为256。90%的步骤将序列长度限制为128个标记,剩下的10%为512个标记。使用Adam优化器,学习率为1e-4,β1 = 0.9,β2 = 0.999,权重衰减为0.01,在训练的前10,000个步骤中进行学习率预热,然后线性衰减学习率。
当在下游任务上进行微调时,该模型实现了以下结果:
Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy |
---|---|---|
BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46 |
@article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }