模型:

TheBloke/falcon-40b-instruct-3bit-GPTQ

任务:

文本生成

类库:

Transformers

数据集:

tiiuae/falcon-refinedweb 3Atiiuae/falcon-refinedweb

语言:

其他:

RefinedWeb custom_code text-generation-inference

预印本库:

arxiv:2205.14135 arxiv:1911.02150 arxiv:2005.14165 arxiv:2104.09864

许可:

apache-2.0

模型介绍文件清单

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

隼-40B-指导 3bit GPTQ

这个仓库包含一个实验性的GPTQ 3bit模型，用于 Falcon-40B-Instruct .

它是使用 AutoGPTQ 进行3bit量化的结果。

可用的仓库

实验性

请注意，这是一个实验性的GPTQ模型。目前对它的支持非常有限。

也预计会非常慢。目前无法避免这个问题，但正在研究中。

这是一个3bit模型，旨在能够在24GB VRAM上加载。根据我的测试，在返回512个标记时，它不会超过24GB VRAM。但超过24GB后可能会出现问题。

请注意，目前在40B隼GPTQ上可以预期每秒约 0.7 个标记。

AutoGPTQ

必须安装AutoGPTQ: pip install auto-gptq

AutoGPTQ提供了适用于 Windows 和 Linux 的预编译 wheels，以及支持 CUDA toolkit 11.7 或 11.8。

如果你正在运行 CUDA toolkit 12.x，你需要按照以下说明自行编译:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install .

这些手动步骤需要你安装 Nvidia CUDA toolkit .

text-generation-webui

在 text-generation-webui 中有一个暂时支持AutoGPTQ的版本。

这需要使用 204731952ae59d79ea3805a425c73dd171d943c3 版本的 text-generation-webui。

所以，请先将 text-genration-webui 更新到最新版本。

如何在text-generation-webui中下载和使用这个模型

启动 text-generation-webui

点击 Model 选项卡。

取消勾选 Autoload model。

在 Download custom model or LoRA 中输入 TheBloke/WizardLM-Uncensored-Falcon-40B-3bit-GPTQ。

点击 Download。

等待它下载完成。

点击左上角 Model 旁边的刷新图标。

在 Model 下拉菜单中选择刚刚下载的模型，WizardLM-Uncensored-Falcon-40B-3bit-GPTQ。

确保 Loader 设置为 AutoGPTQ。这个模型不适用于 ExLlama 或 GPTQ-for-LLaMa。

勾选 Trust Remote Code，然后保存设置。

点击 Reload。

一旦加载完成，请点击 Text Generation 选项卡并输入提示！

关于 trust_remote_code

请注意，该命令行参数会导致隼提供的 Python 代码在您的计算机上执行。

目前还需要执行这段代码，因为隼还不支持Hugging Face transformers。将来 transformers 将原生支持该模型，然后将不再需要 trust_remote_code。

在这个仓库中，您可以看到两个 .py 文件 - 这些文件将被执行。它们是从基本仓库 Falcon-7B-Instruct 复制过来的。

简单的 Python 代码示例

要运行此代码，您需要安装 AutoGPTQ 和 einops:

pip install auto-gptq
pip install einops

然后您可以运行这个示例代码:

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

# Download the model from HF and store it locally, then reference its location here:
quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False, use_safetensors=True, torch_dtype=torch.float32, trust_remote_code=True)

prompt = "Write a story about llamas"
prompt_template = f"### Instruction: {prompt}\n### Response:"

tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0").input_ids
output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0]))

提供的文件

gptq_model-3bit--1g.safetensors

这将适用于 AutoGPTQ 的 3cb1bf5 版本 ( 3cb1bf5a6d43a06dc34c6442287965d1838303d3 )

它是通过不使用 groupsize 创建的，以减小 VRAM 要求，并使用 desc_act (act-order) 来提高推理质量。

gptq_model-3bit--1g.safetensors
- 仅与最新的 AutoGPTQ CUDA 版本一起工作，编译源代码时符合提交 3cb1bf5
  - 在此时，它不能与 AutoGPTQ Triton 一起工作，但希望将来会增加支持。
- 可与使用 --autogptq --trust_remote_code 的 text-generation-webui 一起使用
  - 在此时它不适用于一键安装程序
- 不适用于任何版本的 GPTQ-for-LLaMa
- 参数: Groupsize = None. 以 Act 顺序降序 (desc_act)

Discord

如需进一步的支持以及有关这些模型和AI的讨论，请加入我们:

TheBloke AI's Discord server

感谢，以及如何贡献

感谢 chirper.ai 团队！

很多人问我是否可以做贡献。我喜欢提供模型和帮助人们，也非常愿意花更多时间做这方面的工作，以及扩展到新的项目，如微调/训练。

如果你有能力和意愿贡献，我将非常感激，并将帮助我继续提供更多的模型，并开始进行新的AI项目。

捐赠者将在任何AI/LLM/模型问题和请求上获得优先支持，可以访问一个私人Discord房间，以及其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

Patreon 特别致谢: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman.

感谢所有慷慨的赞助者和捐助者！

✨ 原始模型卡片: Falcon-40B-Instruct

✨ Falcon-40B-Instruct

Falcon-40B-Instruct 是一个由 TII 开发的模型，它使用 Falcon-40B 进行微调，并在混合 Baize 的训练数据上进行微调。它根据 TII Falcon LLM License 许可协议提供。

论文即将发布 ?。

为什么使用 Falcon-40B-Instruct?

您正在寻找一个基于 Falcon-40B 的即用型聊天/指导模型。
Falcon-40B 是最好的开源模型。它的表现优于 LLaMA 、 StableLM 、 RedPajama 、 MPT 等。请参阅 OpenLLM Leaderboard 。
它具有针对推理进行优化的架构，具有 FlashAttention ( Dao et al., 2022 ) 和 multiquery ( Shazeer et al., 2019 )。

? 这是一个指导模型，对于进一步的微调可能不是理想的选择。如果您对构建自己的指导/聊天模型感兴趣，我们建议从 Falcon-40B 开始。

? 寻找一个更小、更便宜的模型？ Falcon-7B-Instruct 是 Falcon-40B-Instruct 的小弟弟！

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Falcon-40B-Instruct 模型卡片

模型详细信息

模型描述

开发人员: https://www.tii.ae ；
模型类型: Causal decoder-only;
语言(NLP): 英语和法语;
许可证: TII Falcon LLM License ；
微调自模型: Falcon-7B 。

模型来源

论文: 即将发布。

用途

直接使用

Falcon-40B-Instruct 已在聊天数据集上进行了微调。

范围外使用

在未充分评估风险和采取适当防范措施的情况下进行生产使用；任何可能被视为不负责任或有害的用例。

偏见、风险和限制

Falcon-40B-Instruct 主要在英语数据上进行训练，因此对于其他语言无法适当泛化。此外，由于其训练数据源自代表网络的大规模语料库，因此它会保留在网上常见的刻板印象和偏见。

建议

我们建议使用 Falcon-40B-Instruct 的用户制定保护措施，并为任何生产使用采取适当的预防措施。

如何开始使用该模型

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

培训详情

培训数据

Falcon-40B-Instruct 在 150M 个标记的数据集上进行了微调，其中包含 Bai ze 的混合数据集的 5%。

数据由 Falcon- 7B / 40B 分词器分词。

评估

论文即将发布。

请参阅 OpenLLM Leaderboard 获取初步结果。

技术规格

有关预训练的更多信息，请参阅 Falcon-40B 。

模型架构和目标

Falcon-40B 是一个仅有解码器的因果模型，训练任务是因果语言建模（即预测下一个标记）。

架构在很大程度上借鉴于 GPT-3 论文 ( Brown et al., 2020 )，具有以下差异:

位置嵌入: rotary ( Su et al., 2021 );
注意力: multiquery ( Shazeer et al., 2019 ) 和 FlashAttention ( Dao et al., 2022 );
解码器块: 并行注意力/MLP，带有单层标准化。

在 multiquery 中，我们使用了一种内部变体，该变体使用每个张量并行度的独立密钥和值。

Hyperparameter	Value	Comment
Layers	60
d_model	8192
head_dim	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

计算基础设施

硬件

Falcon-40B-Instruct 在 AWS SageMaker 上进行训练，使用 P4d 实例中的 64 个 A100 40GB GPU。

软件

Falcon-40B-Instruct 使用自定义的分布式训练代码库 Gigatron 进行训练。它使用 3D 并行性方法结合 ZeRO 和高性能 Triton 内核 (FlashAttention 等)。

引用

论文即将发布 ?.

许可证

Falcon-40B-Instruct 根据 TII Falcon LLM License 许可协议提供。总的来说，

您可以自由地将我们的模型用于研究和/或个人目的；
您可以分享和创建这些模型的衍生物，但您需要进行归因并使用相同的许可证进行共享；
对于商业用途，如相关收入低于每年 100 万美元，您可以免除版税支付，否则您应该与 TII 签订商业协议。

联系方式

falconllm@tii.ae

作者:

Tom Jobbins

数据集大小:

16.26 GB