中国CPT-Base

新闻

2022年12月30日

发布了更新的CPT和Chinese BART版本。在新版本中，我们对以下部分进行了更改：

词汇表：我们用来自训练数据的更大词汇表（大小为51271）替换了旧有的BERT词汇表。在新词汇表中，我们进行了如下改动：
添加了6800多个缺失的中文字符（其中大部分是繁体中文字符）；
删除了冗余的标记（例如具有“##”前缀的中文字符标记）；
添加了一些英文标记以减少OOV。
位置嵌入：我们将max_position_embeddings从512扩展到1024。

我们使用旧版本的检查点进行了新版本模型的初始化，并进行了词汇对齐。将在旧检查点中找到的标记嵌入进行复制。其他新添加的参数将随机初始化。接着，我们使用批量大小2048、最大序列长度1024、峰值学习率2e-5和预热比例0.1来训练新的CPT和Chinese BART，训练步数为50000。

与前一个检查点相比，结果如下：

AFQMC	IFLYTEK	CSL-sum	LCSTS	AVG
Previous
bart-base	73.0	60	62.1	37.8	58.23
cpt-base	75.1	60.5	63.0	38.2	59.20
bart-large	75.7	62.1	64.2	40.6	60.65
cpt-large	75.9	61.8	63.7	42.0	60.85
Updataed
bart-base	73.03	61.25	61.51	38.78	58.64
cpt-base	74.40	61.23	62.09	38.81	59.13
bart-large	75.81	61.52	64.62	40.90	60.71
cpt-large	75.97	61.63	63.83	42.08	60.88

结果表明，更新的模型在性能上与之前的检查点持平。仍然存在一些情况，更新的模型稍微差于之前的模型，原因如下：

训练额外的几步并没有带来显著的性能改进；

一些下游任务不受新增的标记和较长编码序列的影响，而受微调超参数的影响。

注意：要使用更新的模型，请更新模型文件modeling_cpt.py（新版本下载链接 Here ）和词汇表（刷新缓存）。

模型描述

这是CPT-Base的实现。要使用CPT，请将定义CPT架构的modeling_cpt.py文件（下载链接 Here ）导入到您的项目中。

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

韶云帆、耿智超、刘轶涛、戴俊琦、杨飞、李喆、包华军、邱希鹏

Github链接： https://github.com/fastnlp/CPT

用法

>>> from modeling_cpt import CPTForConditionalGeneration
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("fnlp/cpt-base")
>>> model = CPTForConditionalGeneration.from_pretrained("fnlp/cpt-base")
>>> inputs = tokenizer.encode("北京是[MASK]的首都", return_tensors='pt')
>>> pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
>>> print(tokenizer.convert_ids_to_tokens(pred_ids[i]))
    ['[SEP]', '[CLS]', '北', '京', '是', '中', '国', '的', '首', '都', '[SEP]']

注意：请使用BertTokenizer进行模型的词汇处理。请勿使用原始的BartTokenizer。

引用

@article{shao2021cpt,
  title={CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation}, 
  author={Yunfan Shao and Zhichao Geng and Yitao Liu and Junqi Dai and Fei Yang and Li Zhe and Hujun Bao and Xipeng Qiu},
  journal={arXiv preprint arXiv:2109.05729},
  year={2021}
}

作者:

Fudan NLP

数据集大小:

553.36 MB