roberta-long-japanese（jumanpp + sentencepiece，mC4日语）

这是一个更长输入版本的 RoBERTa 日语模型，预训练模型由大约2亿个日语句子进行预训练。将max_position_embeddings增加到1282，使其能够处理比基本RoBERTa模型更长的输入。

令牌化模型和逻辑与 nlp-waseda/roberta-base-japanese 完全相同。输入文本应由 Juman++ v2.0.0-rc3 进行预分割，然后将应用于由空格分隔的令牌序列。详细信息请参见tokenizer_config.json。

如何使用

请提前安装Juman++ v2.0.0-rc3和SentencePiece。

您可以通过AutoModel和AutoTokenizer分别加载模型和令牌化器。

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("megagonlabs/roberta-long-japanese")
tokenizer = AutoTokenizer.from_pretrained("megagonlabs/roberta-long-japanese")
model(**tokenizer("まさに オール マイ ティー な 商品 だ 。", return_tensors="pt")).last_hidden_state
tensor([[[ 0.1549, -0.7576,  0.1098,  ...,  0.7124,  0.8062, -0.9880],
         [-0.6586, -0.6138, -0.5253,  ...,  0.8853,  0.4822, -0.6463],
         [-0.4502, -1.4675, -0.4095,  ...,  0.9053, -0.2017, -0.7756],
         ...,
         [ 0.3505, -1.8235, -0.6019,  ..., -0.0906, -0.5479, -0.6899],
         [ 1.0524, -0.8609, -0.6029,  ...,  0.1022, -0.6802,  0.0982],
         [ 0.6519, -0.2042, -0.6205,  ..., -0.0738, -0.0302, -0.1955]]],
       grad_fn=<NativeLayerNormBackward0>)

模型架构

除了将max_position_embeddings增加到1282外，模型架构与 nlp-waseda/roberta-base-japanese 几乎相同；12层，768个隐藏状态维度和12个注意力头。

训练数据和库

该模型基于 mC4 Common Crawl的多语种网络抓取语料库中的日语文本进行训练。我们使用了 Sudachi 分割文本成句子，并应用了一种简单的基于规则的过滤器来删除mC4多语种语料库中的非语言段落。提取的文本总共包含超过6亿个句子，我们使用了大约2亿个句子进行预训练。

我们使用 huggingface/transformers RoBERTa implementation 进行预训练。使用GCP A100 8gpu实例启用Automatic Mixed Precision，预训练所需的时间约为700小时。

许可证

预训练模型根据 MIT License 的条款分发。

引用

含有来自mC4的信息，可在 ODC Attribution License 下获得。

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

作者:

Megagon Labs

数据集大小:

429.12 MB