模型:

databricks/dolly-v2-12b

任务:

文本生成

类库:

PyTorch Transformers

数据集:

databricks/databricks-dolly-15k 3Adatabricks/databricks-dolly-15k

语言:

其他:

gpt_neox text-generation-inference

许可:

mit

模型介绍文件清单

英文

dolly-v2-12b模型卡片

概述

Databricks的dolly-v2-12b是一个基于Databricks机器学习平台训练的用于商业用途的指令跟随大型语言模型。它基于pythia-12b，在InstructGPT论文中的能力域（包括头脑风暴、分类、封闭型问答、生成、信息提取、开放型问答和摘要）中，由Databricks员工生成了约15,000个指令/响应的微调记录 databricks-dolly-15k 进行训练。dolly-v2-12b 不是一款最先进的模型，但它展示了与其基础模型不同的、令人惊讶的高质量指令跟随行为。

Dolly v2还提供以下规模较小的模型：

dolly-v2-7b ，基于pythia-6.9b的69亿参数
dolly-v2-3b ，基于pythia-2.8b的28亿参数

有关在各种GPU配置下运行推断的提示，请参阅 dolly GitHub repo 。

拥有者：Databricks, Inc.

模型概述

dolly-v2-12b是由Databricks开发的一个120亿参数的因果语言模型，源自于 EleutherAI's 和 Pythia-12b ，并在按CC-BY-SA许可下由Databricks员工生成的一个 ~15K record instruction corpus 进行了微调。

用法

要在具有GPU的机器上使用transformers库和accelerate库，请先确保已安装这些库。在Databricks笔记本中，您可以运行：

%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"

可以使用pipeline函数加载指令跟随管道，如下所示。它加载了模型存储库 here 中的自定义InstructionTextGenerationPipeline，因此需要设置trust_remote_code=True。通常建议在支持torch.bfloat16的情况下包括torch_dtype=torch.bfloat16，以减少内存使用量。如果内存足够，则可以删除它，它似乎不会影响输出质量。

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

然后可以使用该管道来回答指令：

res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

或者，如果您不想使用trust_remote_code=True，可以下载 instruct_pipeline.py 并将其存储在笔记本旁边，然后根据加载的模型和标记器构建自己的管道：

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

LangChain用法

若要在LangChain中使用该管道，必须设置return_full_text=True，因为LangChain希望返回完整的文本，而管道的默认行为只返回新文本。

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16,
                         trust_remote_code=True, device_map="auto", return_full_text=True)

您可以创建只包含指令或包含指令和上下文的提示：

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

# template for an instrution with no input
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}")

# template for an instruction with input
prompt_with_context = PromptTemplate(
    input_variables=["instruction", "context"],
    template="{instruction}\n\nInput:\n{context}")

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)

使用简单指令进行预测的示例：

print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())

使用带有上下文的指令进行预测的示例：

context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""

print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())

已知限制

性能限制

dolly-v2-12b并非一款最先进的生成语言模型，尽管定量基准测试仍在进行中，但它不设计与更现代的模型架构或训练规模更大的模型相竞争。

Dolly模型系列正在积极开发中，因此任何缺点列表可能不会详尽无遗，但我们在这里列出已知的限制和问题，以便记录和与社区分享我们的初步发现。具体而言，dolly-v2-12b在以下方面存在困难：语法复杂的提示、编程问题、数学运算、事实错误、日期和时间、开放式问答、虚幻现象、特定长度列表的枚举、风格模仿、幽默感等。此外，我们发现dolly-v2-12b没有基础模型中的某些功能，例如格式良好的信函写作。

数据集限制

与所有语言模型一样，dolly-v2-12b反映了其训练语料库的内容和限制。

The Pile: GPT-J的预训练语料库主要包含从公共互联网收集的内容，与大多数网络规模数据集一样，它包含许多用户会发现有问题的内容。因此，模型可能会明确地反映出这些缺点，例如在明确要求生成问题内容时，也可能以不明显的方式反映出有偏见或有害的隐含联系。
databricks-dolly-15k: dolly-v2-12b的微调训练数据是由Databricks员工生成的，包括从维基百科等参考文章中选取的指令类别（如封闭型问答和摘要）。据我们所知，它不包含淫秽内容、非公众人物的知识产权或个人身份信息，但可能包含拼写错误和事实错误。该数据集可能还反映了维基百科中存在的偏见。最后，该数据集可能反映了Databricks员工的兴趣和语义选择，这个群体不代表全球人口的整体。

Databricks致力于持续进行研究和开发工作，开发有益、诚实和无害的AI技术，以最大程度地发挥个人和组织的潜力。

基准指标

下面您将找到各种模型在 EleutherAI LLM Evaluation Harness 上的基准性能；模型结果按几何平均值排序，以获得清晰的排序。如上所述，这些结果表明dolly-v2-12b并非最先进，事实上在某些评估基准中甚至不如dolly-v1-6b。我们认为这是由于底层微调数据集的组成和规模造成的，但要对这些差异的来源进行深入研究才能得出有力的结论。

model	openbookqa	arc_easy	winogrande	hellaswag	arc_challenge	piqa	boolq	gmean
EleutherAI/pythia-2.8b	0.348	0.585859	0.589582	0.591217	0.323379	0.73395	0.638226	0.523431
EleutherAI/pythia-6.9b	0.368	0.604798	0.608524	0.631548	0.343857	0.761153	0.6263	0.543567
databricks/dolly-v2-3b	0.384	0.611532	0.589582	0.650767	0.370307	0.742655	0.575535	0.544886
EleutherAI/pythia-12b	0.364	0.627104	0.636148	0.668094	0.346416	0.760065	0.673394	0.559676
EleutherAI/gpt-j-6B	0.382	0.621633	0.651144	0.662617	0.363481	0.761153	0.655963	0.565936
databricks/dolly-v2-12b	0.408	0.63931	0.616417	0.707927	0.388225	0.757889	0.568196	0.56781
databricks/dolly-v2-7b	0.392	0.633838	0.607735	0.686517	0.406997	0.750816	0.644037	0.573487
databricks/dolly-v1-6b	0.41	0.62963	0.643252	0.676758	0.384812	0.773667	0.687768	0.583431
EleutherAI/gpt-neox-20b	0.402	0.683923	0.656669	0.7142	0.408703	0.784004	0.695413	0.602236

引用

@online{DatabricksBlog2023DollyV2,
    author    = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
    title     = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
    year      = {2023},
    url       = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
    urldate   = {2023-06-30}
}

快乐编程！

作者:

Databricks

数据集大小:

22.2 GB