模型:

uer/roberta-large-wwm-chinese-cluecorpussmall

任务:

填充掩码

类库:

PyTorch Transformers

数据集:

CLUECorpusSmall 3ACLUECorpusSmall

语言:

其他:

bert AutoTrain Compatible

预印本库:

arxiv:1909.05658 arxiv:1908.08962

模型介绍文件清单

英文

中文全词掩码 RoBERTa 迷你模型

模型描述

这是由 UER-py 预训练的一组中文全词掩码 RoBERTa 迷你模型。

根据 Turc et al. 的研究表明，标准的 BERT 模型在各种不同大小的模型上都是有效的。根据他们的论文，我们发布了6个中文全词掩码 RoBERTa 模型。为了方便用户重现结果，我们使用了公开可用的语料库和词语分割工具，并提供了所有的训练细节。

您可以从以下链接中直接下载6个中文 RoBERTa 迷你模型：

Link
Tiny	12312321
Mini	12313321
Small	12314321
Medium	12315321
Base	12316321
Large	12317321

这是六个中文任务开发集上的得分：

Model	Score	book_review	chnsenticorp	lcqmc	tnews(CLUE)	iflytek(CLUE)	ocnli(CLUE)
RoBERTa-Tiny-WWM	72.2	83.6	91.8	81.8	62.1	55.4	58.6
RoBERTa-Mini-WWM	76.3	86.2	93.0	86.8	64.4	58.7	68.8
RoBERTa-Small-WWM	77.6	88.1	93.8	87.2	65.2	59.6	71.4
RoBERTa-Medium-WWM	78.6	89.5	94.4	88.8	66.0	59.9	73.2
RoBERTa-Base-WWM	80.2	90.3	95.8	89.4	67.5	61.8	76.2
RoBERTa-Large-WWM	81.1	91.3	95.8	90.0	68.5	62.1	79.1

对于每个任务，我们从下面的列表中选择了最佳的微调超参数，并使用序列长度为128进行训练：

epochs: 3, 5, 8
batch sizes: 32, 64
learning rates: 3e-5, 1e-4, 3e-4

如何使用

您可以使用该模型直接进行掩码语言建模任务：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='uer/roberta-tiny-wwm-chinese-cluecorpussmall')
>>> unmasker("北京是[MASK]国的首都。")
[
    {'score': 0.294228732585907, 
     'token': 704, 
     'token_str': '中', 
     'sequence': '北 京 是 中 国 的 首 都 。'},
    {'score': 0.19691626727581024, 
     'token': 1266, 
     'token_str': '北', 
     'sequence': '北 京 是 北 国 的 首 都 。'},
    {'score': 0.1070084273815155, 
     'token': 7506, 
     'token_str': '韩', 
     'sequence': '北 京 是 韩 国 的 首 都 。'},
    {'score': 0.031527262181043625, 
     'token': 2769, 
     'token_str': '我', 
     'sequence': '北 京 是 我 国 的 首 都 。'},
    {'score': 0.023054633289575577, 
     'token': 1298, 
     'token_str': '南', 
     'sequence': '北 京 是 南 国 的 首 都 。'}
]

以下是如何使用该模型来获取给定文本的特征（在PyTorch中）：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('uer/roberta-base-wwm-chinese-cluecorpussmall')
model = BertModel.from_pretrained("uer/roberta-base-wwm-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

以及在 TensorFlow 中：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('uer/roberta-base-wwm-chinese-cluecorpussmall')
model = TFBertModel.from_pretrained("uer/roberta-base-wwm-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

训练数据

使用 CLUECorpusSmall 作为训练数据。

训练过程

模型由 UER-py 在 Tencent Cloud 上进行预训练。我们使用序列长度为128进行了1,000,000步的预训练，然后使用序列长度为512进行了额外的250,000步预训练。我们在不同的模型大小上使用了相同的超参数。

词语分割工具使用了 jieba 。

以 Whole Word Masking RoBERTa-Medium 为例

阶段1：

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path cluecorpussmall_seq128_dataset.pt \
                      --processes_num 32 --seq_length 128 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_wwm_roberta_medium_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-4 --batch_size 64 \
                    --whole_word_masking \
                    --data_processor mlm --target mlm

阶段2：

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path cluecorpussmall_seq512_dataset.pt \
                      --processes_num 32 --seq_length 512 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/cluecorpussmall_wwm_roberta_medium_seq128_model.bin-1000000 \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_wwm_roberta_medium_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
                    --learning_rate 5e-5 --batch_size 16 \
                    --whole_word_masking \
                    --data_processor mlm --target mlm

最后，我们将预训练模型转换为 Huggingface 的格式：

python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_wwm_roberta_medium_seq512_model.bin \
                                                        --output_model_path pytorch_model.bin \
                                                        --layers_num 8 --type mlm

BibTeX 引文和引用信息

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

作者:

UER

数据集大小:

1.21 GB