请使用 'Bert' 相关的函数来加载这个模型！

这个仓库包含了我们论文“重新审视用于中文自然语言处理的预训练模型”的资源，将发表在“ Findings of EMNLP ”中。您可以通过ACL文献数据库或 arXiv pre-print 阅读我们的最终版论文。

Revisiting Pre-trained Models for Chinese Natural Language Processing Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu

您可能还对以下内容感兴趣：

中国BERT系列： https://github.com/ymcui/Chinese-BERT-wwm
中国ELECTRA： https://github.com/ymcui/Chinese-ELECTRA
中国XLNet： https://github.com/ymcui/Chinese-XLNet
知识蒸馏工具包-TextBrewer： https://github.com/airaria/TextBrewer

HFL提供的更多资源： https://github.com/ymcui/HFL-Anthology

简介

MacBERT是一种改进的BERT模型，采用了新颖的MLM(Masked Language Modeling)纠正预训练任务，减小了预训练和微调之间的差异。

我们提议使用类似单词来进行掩蔽，而不是使用[MASK]标记进行掩蔽，因为在微调阶段[MASK]标记从未出现过。类似单词是通过基于word2vec（Mikolov等，2013）相似度计算的方法得到的。如果选择掩蔽一个N-gram，我们将单独找到类似的单词进行替换。在少数情况下，如果没有类似的单词，我们将使用随机单词进行替换。

这里是我们预训练任务的一个例子。

Example
Original Sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram masking	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

除了新的预训练任务，我们还采用了以下技术：

Whole Word Masking (WWM)
N-gram掩蔽
句子顺序预测（SOP）

请注意，我们的MacBERT可以直接替换原始的BERT，因为主要神经架构没有区别。

有关更多技术细节，请参阅我们的论文： Revisiting Pre-trained Models for Chinese Natural Language Processing

引用

如果您认为我们的资源或论文有用，请考虑在您的论文中包含以下引用。

https://arxiv.org/abs/2004.13922

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

作者:

Joint Laboratory of HIT and iFLYTEK Research (HFL)

数据集大小:

3.73 GB