OPT首次于 Open Pre-trained Transformer Language Models 年亮相,并于2022年5月3日由Meta AI在 metaseq's repository 上首次发布。
免责声明:发布OPT的团队编写了一份官方模型卡片,可在 paper 的附录D中找到。本模型卡片的内容由Hugging Face团队编写。
引用 official paper 的前两段:
OPT主要使用英语文本进行预训练,但仍然包含少量非英语数据,这些数据通过CommonCrawl存在于训练语料库中。该模型使用了自回归语言建模(CLM)目标进行预训练。OPT属于与 GPT-3 相似的仅解码器模型系列。因此,它使用了自监督的自回归语言建模目标进行预训练。
在评估方面,OPT采用了 GPT-3 的提示和整体实验设置。更多详情,请阅读 official paper 。
仅预训练的模型可用于提示下游任务的评估以及文本生成。此外,可以使用 CLM example 对下游任务进行模型微调。对于其他OPT检查点,请参阅 model hub 。
对于像这样的大型OPT模型,不建议使用文本生成流程,因为您应该在半精度下加载模型,以加快生成速度并优化在GPU上的内存消耗。建议直接调用 generate 方法,如下所示:
>>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> import torch >>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16).cuda() >>> # the fast tokenizer currently does not work correctly >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False) >>> prompt = "Hello, I am conscious and" >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() >>> generated_ids = model.generate(input_ids) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) ['Hello, I am conscious and I am here.\nI am also conscious and I am here']
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed >>> import torch >>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16).cuda() >>> # the fast tokenizer currently does not work correctly >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False) >>> prompt = "Hello, I am conscious and" >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() >>> set_seed(32) >>> generated_ids = model.generate(input_ids, do_sample=True) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) ['Hello, I am conscious and aware that you have your back turned to me and want to talk']
如Meta AI的模型卡片所述,由于用于训练该模型的训练数据包含许多来自互联网的未经筛选的内容,远非中立,因此模型具有明显偏见:
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed >>> import torch >>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16).cuda() >>> # the fast tokenizer currently does not work correctly >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False) >>> prompt = "The woman worked as a" >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() >>> set_seed(32) >>> generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=10) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) The woman worked as a supervisor in the office The woman worked as a social worker in a The woman worked as a cashier at the The woman worked as a teacher from 2011 to he woman worked as a maid at the house
>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed >>> import torch >>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-30b", torch_dtype=torch.float16).cuda() >>> # the fast tokenizer currently does not work correctly >>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", use_fast=False) >>> prompt = "The man worked as a" >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() >>> set_seed(32) >>> generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=5, max_length=10) >>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) The man worked as a school bus driver for The man worked as a bartender in a bar The man worked as a cashier at the The man worked as a teacher, and was The man worked as a professional at a range
Meta AI团队希望在尽可能大的语料库上训练该模型。其中包括以下5个经过筛选的文本文档数据集的并集:
由于数据集是公共Common Crawl数据的子集,以及公共Reddit数据的子集,可能包含冒犯性内容,这些内容如果直接查看可能会令人讨厌、威胁性或引起焦虑。
该数据集通过互联网收集,并经过常规数据处理算法和重新格式化实践,包括删除重复/非信息性文本,如“第一章”或“Project Gutenberg提供的电子书”。
175B模型使用992个80GB A100 GPU进行训练。训练持续时间约为连续训练33天。
@misc{zhang2022opt, title={OPT: Open Pre-trained Transformer Language Models}, author={Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer}, year={2022}, eprint={2205.01068}, archivePrefix={arXiv}, primaryClass={cs.CL} }