google/reformer-enwik8 | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

google/reformer-enwik8

任务:

文本生成

类库:

PyTorch Transformers

其他:

reformer

模型介绍文件清单

英文

Reformer字符级别的语言模型，基于enwik8数据集进行训练。

enwik8是一份基于维基百科的数据集，常被用于衡量模型压缩数据的能力，例如在Hutter奖中的应用： https://en.wikipedia.org/wiki/Hutter_Prize 。

reformer-enwik8是在enwik8的前9000万个字符上进行预训练的，文本被切分成大小为65536字符（=2^16）的批次。模型的权重取自 https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 ，并转换为Hugging Face的PyTorch ReformerLM模型ReformerModelWithLMHead。

这个模型是一个基于字符操作的语言模型，因此不需要使用tokenizer。下面的函数可以用来进行编码和解码：

import torch

# Encoding
def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])

    # create emtpy tensors
    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

    for idx, string in enumerate(list_of_strings):
        # make sure string is in byte format
        if not isinstance(string, bytes):
            string = str.encode(string)

        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
        attention_masks[idx, :len(string)] = 1

    return input_ids, attention_masks
    
# Decoding
def decode(outputs_ids):
    decoded_outputs = []
    for output_ids in outputs_ids.tolist():
        # transform id back to char IDs < 2 are simply transformed to ""
        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
    return decoded_outputs

可以通过以下方式生成文本：

from transformers import ReformerModelWithLMHead

model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))

# gives:
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro

注意：使用ReformerModelWithLMHead进行语言生成还未经过优化，速度较慢。

作者:

Google AI

数据集大小:

569.13 MB