模型:

TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ

许可:

apache-2.0

其他:

text-generation-inference h2o-llmstudio large+language+model llm gpt custom_code RefinedWeb

语言:

数据集:

3AOpenAssistant/oasst1

类库:

Transformers

任务:

文本生成

模型介绍文件清单

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

H2O's GPT-GM-OASST1-Falcon 40B v2 GPTQ

这些文件是 H2O's GPT-GM-OASST1-Falcon 40B v2 的 GPTQ 4bit 模型文件。

它是使用 AutoGPTQ 进行 4bit 的量化结果。

可用的存储库

提示模板

<|prompt|>prompt<|endoftext|>
<|answer|>

实验性

请注意，这是一个实验性的 GPTQ 模型。目前对它的支持非常有限。

这也预计会非常慢。目前无法避免，但正在寻找解决方法。

如何在文本生成网页界面中下载和使用此模型

启动文本生成网页界面

点击“模型”选项卡。

取消选择“自动加载模型”。

在“下载自定义模型或LoRA”中输入“TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ”。

点击“下载”。

等待直到下载完成。

点击左上角“模型”旁边的“刷新”图标。

在“模型下拉菜单”中选择刚刚下载的模型“TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ”。

确保“加载器”设置为“AutoGPTQ”。此模型无法与ExLlama或GPTQ-for-LLaMa一起使用。

选中“信任远程代码”，然后点击“保存设置”

点击“重新加载”。

一旦它显示加载完成，点击“文本生成”选项卡并输入提示！

如何从Python代码中使用此GPTQ模型

首先确保您已安装 AutoGPTQ ：

pip install auto-gptq

然后尝试以下示例代码：

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ"
model_basename = "gptq_model-4bit--1g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
prompt_template=f'''<|prompt|>{prompt}<|endoftext|><|answer|>'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

提供的文件

gptq_model-4bit--1g.safetensors

这将与AutoGPTQ、ExLlama和GPTQ-for-LLaMa的CUDA版本一起使用。有报告称在最近的GPTQ-for-LLaMa的Triton模式下存在问题。如果遇到问题，请改用AutoGPTQ。

创建它时未设置group_size以降低VRAM要求，并使用--act-order（desc_act）尽可能提高推理准确性。

gptq_model-4bit--1g.safetensors
- 适用于CUDA或Triton模式下的AutoGPTQ。
- LLaMa模型也可在ExLlama中使用 [ExLlama]( https://github.com/turboderp/exllama} 这通常比AutoGPTQ提供更高的性能，并且使用的VRAM更少。
- 适用于CUDA模式下的GPTQ-for-LLaMa。可能在GPTQ-for-LLaMa的Triton模式下出现问题。
- 适用于文本生成网页界面，包括一键安装程序。
- 参数：Groupsize = -1。Act Order / desc_act = True。

常见问题解答

关于 trust-remote-code

请注意，此命令行参数会导致Falcon提供的Python代码在您的计算机上执行。

目前需要执行此代码，因为Falcon太新，尚不受Hugging Face transformers的支持。在将来的某个时候，transformers将原生支持该模型，然后就不再需要 trust_remote_code。

在这个repo中，您可以看到两个.py文件 - 这些文件将被执行。它们是从 Falcon-40B-Instruct 复制的基础repo中。

Discord

有关这些模型和人工智能的进一步支持和讨论，请加入我们：

TheBloke AI's Discord server

致谢以及如何贡献。

感谢 chirper.ai 团队！

我有很多人问我是否可以贡献。我喜欢提供模型并帮助人们，并且很愿意能够花更多时间做这件事，以及扩展到新的项目，如微调/训练。

如果您有能力和愿意做出贡献，我将非常感激，并将帮助我继续提供更多模型，并开始开展新的AI项目。

捐赠者将优先获得对所有AI/LLM/模型问题和请求的支持，以及访问私人Discord房间和其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特别感谢：来自CarbonQuill的Luke、Aemon Algiz、Dmitriy Samsonov。

Patreon特别提及：Mano Prime、Fen Risland、Derek Yates、Preetika Verma、webtim、Sean Connelly、Alps Aficionado、Karl Bernard、Junyu Yang、Nathan LeClaire、Chris McCloskey、Lone Striker、Asp the Wyvern、Eugene Pentland、Imad Khwaja、trip7s trip、WelcomeToTheClub、John Detwiler、Artur Olbinski、Khalefa Al-Ahmad、Trenton Dambrowitz、Talal Aujan、Kevin Schuppel、Luke Pendergrass、Pyrater、Joseph William Delisle、terasurfer、vamX、Gabriel Puliatti、David Flickinger、Jonathan Leane、Iucharbius、Luke、Deep Realms、Cory Kujawski、ya boyyy、Illia Dulskyi、senxiiz、Johann-Peter Hartmann、John Villwock、K、Ghost、Spiking Neurons AB、Nikolai Manek、Rainer Wilmers、Pierre Kircher、biorpg、Space Cruiser、Ai Maven、subjectnull、Willem Michiel、Ajan Kanaga、Kalila、chris gileta、Oscar Rangel。

感谢所有慷慨的赞助人和捐赠者！

原始模型卡片：H2O的GPT-GM-OASST1-Falcon 40B v2

模型卡片

摘要

此模型是使用 H2O LLM Studio 进行训练的。

基本模型： tiiuae/falcon-40b
数据集准备： OpenAssistant/oasst1

使用

要在具有GPU的计算机上使用transformers库的模型，首先确保已安装transformers、accelerate和torch库。

pip install transformers==4.29.2
pip install bitsandbytes==0.39.0
pip install accelerate==0.19.0
pip install torch==2.0.0
pip install einops==0.6.1

import torch
from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer

model_kwargs = {}

quantization_config = None
# optional quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)
model_kwargs["quantization_config"] = quantization_config

tokenizer = AutoTokenizer.from_pretrained(
    "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    use_fast=False,
    padding_side="left",
    trust_remote_code=True,
)

generate_text = pipeline(
    model="h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    use_fast=False,
    device_map={"": "cuda:0"},
    model_kwargs=model_kwargs,
)

res = generate_text(
    "Why is drinking water so healthy?",
    min_new_tokens=2,
    max_new_tokens=1024,
    do_sample=False,
    num_beams=1,
    temperature=float(0.3),
    repetition_penalty=float(1.2),
    renormalize_logits=True
)
print(res[0]["generated_text"])

您可以在预处理步骤之后打印一个示例提示，以查看它如何被馈送到分词器中：

print(generate_text.preprocess("Why is drinking water so healthy?")["prompt_text"])

<|prompt|>Why is drinking water so healthy?<|endoftext|><|answer|>

或者，您可以下载h2oai_pipeline.py，将其存储在笔记本旁边，并根据加载的模型和分词器构建管道：

import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = None
# optional quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

tokenizer = AutoTokenizer.from_pretrained(
    "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    use_fast=False,
    padding_side="left",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map={"": "cuda:0"},
    quantization_config=quantization_config
).eval()
generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text(
    "Why is drinking water so healthy?",
    min_new_tokens=2,
    max_new_tokens=1024,
    do_sample=False,
    num_beams=1,
    temperature=float(0.3),
    repetition_penalty=float(1.2),
    renormalize_logits=True
)
print(res[0]["generated_text"])

您还可以根据加载的模型和分词器自行构建管道，并考虑预处理步骤：

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Important: The prompt needs to be in the same format the model was trained with.
# You can find an example prompt in the experiment logs.
prompt = "<|prompt|>How are you?<|endoftext|><|answer|>"

quantization_config = None
# optional quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

tokenizer = AutoTokenizer.from_pretrained(
    "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    use_fast=False,
    padding_side="left",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map={"": "cuda:0"},
    quantization_config=quantization_config
).eval()

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

# generate configuration can be modified to your needs
tokens = model.generate(
    **inputs,
    min_new_tokens=2,
    max_new_tokens=1024,
    do_sample=False,
    num_beams=1,
    temperature=float(0.3),
    repetition_penalty=float(1.2),
    renormalize_logits=True
)[0]

tokens = tokens[inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(tokens, skip_special_tokens=True)
print(answer)

模型架构

RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 8192)
    (h): ModuleList(
      (0-59): 60 x DecoderLayer(
        (ln_attn): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (ln_mlp): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear(in_features=8192, out_features=9216, bias=False)
          (dense): Linear(in_features=8192, out_features=8192, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): MLP(
          (dense_h_to_4h): Linear(in_features=8192, out_features=32768, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear(in_features=32768, out_features=8192, bias=False)
        )
      )
    )
    (ln_f): LayerNorm((8192,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=8192, out_features=65024, bias=False)
)

模型配置

此模型是使用H2O LLM Studio和cfg.yaml中的配置进行训练的。访问 H2O LLM Studio 以了解如何训练自己的大型语言模型。

免责声明

在使用此存储库提供的大型语言模型之前，请仔细阅读此免责声明。您使用模型即表示您同意以下条款和条件。

偏见和冒犯性：大型语言模型是使用各种互联网文本数据进行训练的，其中可能包含有偏见、种族主义、冒犯或其他不适当的内容。通过使用此模型，您承认并接受生成的内容有时可能会出现偏见或产生冒犯性或不适当的内容。此存储库的开发者不支持、不支持或不推广任何此类内容或观点。
限制：大型语言模型是一种基于人工智能的工具，而不是人类。它可能会产生不正确、无意义或不相关的回应。用户有责任对生成的内容进行批判性评估，并自行决定是否使用。
自担风险使用：使用此大型语言模型的用户必须对其使用工具的任何后果承担全部责任。本存储库的开发人员和贡献者不承担因使用或滥用提供的模型而导致的任何损害、损失或伤害。
道德考虑：鼓励用户负责任、合乎道德地使用大型语言模型。通过使用该模型，您同意不将其用于促进仇恨言论、歧视、骚扰或任何形式的非法或有害活动。
报告问题：如果您遇到由大型语言模型生成的有偏见、冒犯性或其他不适当的内容，请通过提供的渠道向存储库维护者报告。您的反馈将有助于改进模型并减轻潜在问题。
免责声明的更改：本存储库的开发人员保留随时修改或更新本免责声明的权利，无需事先通知。用户有责任定期查看免责声明以了解任何更改。

通过使用本存储库提供的大型语言模型，您同意接受并遵守本免责声明中概述的条款和条件。如果您不同意免责声明的任何部分，您应当避免使用该模型和生成的任何内容。

作者:

Tom Jobbins

数据集大小:

21 GB