数据集:

MBZUAI/LaMini-instruction

任务:

文生文

语言:

en

大小:

1M<n<10M

预印本库:

arxiv:2304.14402
英文

"LaMini-Instruction" 数据集卡片

Minghao Wu , Abdul Waheed , Chiyu Zhang , Muhammad Abdul-Mageed , Alham Fikri Aji ,

数据集描述

我们通过进行句子/离线蒸馏 (Kim and Rush, 2016) 从大型语言模型中提取知识。我们使用基于几个现有提示资源,包括 gpt-3.5-turbo (Wang et al., 2022), P3 (Sanh et al., 2022), FLAN (Longpre et al., 2023) 以及 Alpaca (Taori et al., 2023),生成了总共 2.58M 对指令和回复。关于生成我们的指令数据集的过程的更多信息,请参考 our paper

我们使用 LaMini-Instruction 对各种小型语言模型进行微调,从而得到 LaMini-LM model series

每个样本包含以下两个字段:

  • instruction : string , 描述模型应该执行的任务。
  • response : string , 由 gpt-3.5-turbo 生成的对指令的回复。
  • instruction_source : string , 指令的来源。可能的值为 {'alpaca', 'generated_flan', 'generated_p3', 'original_flan', 'original_p3', 'self_instruct_with_topic', 'self_instruct_without_topic'}。

我们 LaMini-Instruction 数据集中的两个例子:

{
    "instruction": "What is the capital of Australia?\n\nIs Canberra the only city chosen to serve as a capital due to a compromise between Sydney and Melbourne?", 
    "response": "The capital of Australia is Canberra. Yes, Canberra was chosen as the capital of Australia in 1908 due to a compromise between Sydney and Melbourne, as neither city wanted the other to become the capital.",
    "instruction_source": "generated_flan"
}
{
    "instruction": "Create a poem that has the word \"ocean\" in every other line.", 
    "response": "The ocean's depth is a mystery,\nIts waves crashing with energy.\nSalt air fills the coastal scene,\nThe ocean's blue hues pure and clean.\n\nSeagulls cry out a melancholy song,\nThe ocean's rhythm steady and strong.\nSunsets paint the water gold,\nThe ocean's secrets forever untold.",
    "instruction_source": "self_instruct_without_topic"
}

使用此数据的注意事项

由于输出是从 ChatGPT 中蒸馏而来,这个数据包含 ChatGPT 产生的错误和偏差。使用此数据集训练的模型将继承这些错误和偏差。我们鼓励用户谨慎使用此数据,并提出新的方法来过滤或改进这些不完善之处。

许可信息

该数据集可在 Creative Commons NonCommercial (CC BY-NC 4.0) 下获得。

引用信息

如果您使用我们的数据或模型,请引用我们的文献。

@article{lamini-lm,
  author       = {Minghao Wu and
                  Abdul Waheed and
                  Chiyu Zhang and
                  Muhammad Abdul-Mageed and
                  Alham Fikri Aji
                  },
  title        = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
  journal      = {CoRR},
  volume       = {abs/2304.14402},
  year         = {2023},
  url          = {https://arxiv.org/abs/2304.14402},
  eprinttype   = {arXiv},
  eprint       = {2304.14402}
}