模型:
akoksal/LongForm-OPT-2.7B
LongForm数据集是通过增强指令来利用英语语料库示例创建的。我们从现有的语料库(如C4和维基百科)中选择了一组多样性的人类撰写文件,并通过LLMs为这些文件生成指令。然后,我们使用结构化语料库示例(如Stack Exchange和WikiHow)和任务示例(如问题回答、电子邮件撰写、语法错误更正、故事/诗歌生成和文本摘要)扩展这些示例。
Github Repo: https://github.com/akoksal/LongForm
LongForm- T5-XL: https://huggingface.co/akoksal/LongForm-T5-XL
LongForm- OPT-6.7B: https://huggingface.co/akoksal/LongForm-OPT-6.7B
import torch from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("akoksal/LongForm-OPT-2.7B") tokenizer = AutoTokenizer.from_pretrained("akoksal/LongForm-OPT-2.7B") instruction = "Write an essay about meditation. [EOI]" torch.manual_seed(42) input_ids = tokenizer(instruction, return_tensors="pt").input_ids target_ids = model.generate(input_ids, do_sample=True, max_new_tokens=50, top_p=0.9) tokenizer.decode(target_ids[0], skip_special_tokens=True) # Output: # > Write an essay about meditation. [EOI]Do you need some inspiration to\ # meditate? Do you know someone who is a great meditator but you aren't sure\ # what to say to them? This might be the perfect opportunity to tell them.\ # The ability to listen and learn and grow can
我们在论文中对LongForm模型和基线进行了深入评估。我们提供了模型在域外数据集中的METEOR分数。在所有任务中,食谱生成(RGen)、长篇问题回答(ELI5)、短篇故事生成(WritingPrompts/WP)方面,LongForm模型优于先前调节指令的模型。
All | Recipe Generation | ELI5 | Writing Prompts | |
---|---|---|---|---|
T0++ | 10.9 | 18.7 | 3.8 | 10.2 |
Tk-Instruct | 6.3 | 12.9* | 3.6 | 2.4 |
Flan-T5 | 10.6 | 20.9* | 3.5 | 7.4 |
Alpaca-LLaMA-7B | 14.6 | 19.5 | 12.5 | 11.8 |
OPT-30B | 11.1 | 18.6 | 12.2 | 2.6 |
1235321 | 16.3 | 20.2 | 18.3 | 10.6 |
1236321 | 17.8 | 15.5 | 17.9 | 19.9 |
1237321 | 17.7 | 16.9 | 17.2 | 19.0 |
1238321 ‡ | 19.7 | 21.7 | 18.6 | 18.9 |
LongForm-OPT模型的较小版本也可用:
‡: 由于LLaMA模型的限制,我们只能公开发布LongForm-LLaMA-7B与预训练LLaMA-7B之间的差异。
LongForm数据集和模型主要关注长文本生成,并对NLP中的结构预测任务存在一些限制。此外,我们观察到LongForm模型可能会出现类似LLMs中的幻觉问题。
LongForm项目受MIT许可证的约束,其中还有由OpenAI(用于指令生成部分)以及语言模型(OPT、LLaMA和T5)强加的自定义限制。
@misc{koksal2023longform, title={LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction}, author={Abdullatif Köksal and Timo Schick and Anna Korhonen and Hinrich Schütze}, year={2023}, eprint={2304.08460}, archivePrefix={arXiv}, primaryClass={cs.CL} }