模型:

TheBloke/Vicuna-13B-1-3-SuperHOT-8K-GPTQ

许可:

other

预印本库:

arxiv:2306.05685 arxiv:2302.13971

其他:

text-generation-inference custom_code llama

类库:

Transformers

任务:

文本生成

模型介绍文件清单

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

LmSys' Vicuna 13B 1.3.0 GPTQ

这些文件是用于 LmSys' Vicuna 13B 1.3.0 与 Kaio Ken's SuperHOT 8K 合并的 GPTQ 4位模型文件。

这是使用 GPTQ-for-LLaMa 进行四位量化的结果。

这是一个实验性的新 GPTQ，可以提供高达8K的上下文大小。

经过最新版本的 text-generation-webui 进行测试，可以与 ExLlama 兼容。

这个模型还经过了使用 AutoGPTQ 的 Python 代码测试，并且设置了 trust_remote_code=True。

代码贡献者：

增加上下文长度的原始概念和代码： kaiokendev
包括这一功能的更新 Llama 建模代码，可以自动执行 trust_remote_code： emozilla

请仔细阅读以下内容了解如何使用。

由于 llama.cpp 尚不支持 SuperHOT，暂时不提供 GGML 版本。正在进行相关调查，希望能尽快提供。

可用的仓库

提示模板

A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input
USER: prompt
ASSISTANT:

如何在文本生成 WebUI 中使用 ExLlama 下载和使用该模型

请确保您使用的是 text-generation-webui 最新版本

点击 "Model" 选项卡

在 "Download custom model or LoRA" 下，输入 "TheBloke/Vicuna-13B-1-3-SuperHOT-8K-GPTQ"

点击 "Download"

模型开始下载。下载完毕后会显示 "Done"

取消 "Autoload the model" 的勾选

在左上角，单击 "Model" 旁边的刷新图标

在 "Model" 下拉菜单中，选择刚刚下载的模型："Vicuna-13B-1-3-SuperHOT-8K-GPTQ"

若要使用增加的上下文，请将 "Loader" 设置为 "ExLlama" ，将 "max_seq_len" 设置为 8192 或 4096，将 "compress_pos_emb" 设置为 4（对于 8192 上下文）或设置为 2（对于 4096 上下文）。

现在点击 "Save Settings" ，然后点击 "Reload"

模型将自动加载，现在可以使用了！

当您准备好后，点击 "Text Generation" 选项卡并输入提示即可开始！

如何使用 Python 代码中的此 GPTQ 模型 AutoGPTQ

首先确保已安装 AutoGPTQ 和 Einops：

pip3 install einops auto-gptq

然后运行以下代码。请注意，为了使其正常工作，硬编码将 config.json 设置为序列长度为 8192。

如果您想尝试 4096 以减少 VRAM 使用量，请手动编辑 config.json 将 max_position_embeddings 设置为所需值。

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/Vicuna-13B-1-3-SuperHOT-8K-GPTQ"
model_basename = "vicuna-13b-1.3.0-superhot-8k-GPTQ-4bit-128g.no-act.order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device_map='auto',
        use_triton=use_triton,
        quantize_config=None)

model.seqlen = 8192

# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
prompt_template=f'''USER: {prompt}
ASSISTANT:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

使用其他用户界面：猴子补丁

存储库中提供了 llama_rope_scaled_monkey_patch.py ，由 @kaiokendev 编写。

理论上，它可以添加到任何 Python 用户界面或自定义代码中，以实现与 trust_remote_code=True 相同的结果。我未进行过测试，而且应该不如使用 trust_remote_code=True，但为了完整性和出于兴趣，我将其包含在内。

提供的文件

vicuna-13b-1.3.0-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors

这将与 AutoGPTQ、ExLlama 和 GPTQ-for-LLaMa 的 CUDA 版本一起工作。有报道称最近的 GPTQ-for-LLaMa 的 Triton 模式存在问题。如果您遇到问题，请改用 AutoGPTQ。

它是使用 group_size 128 创建的，以增加推理准确性，但不使用 --act-order（desc_act）以增加兼容性和提高推理速度。

vicuna-13b-1.3.0-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors
- 可与具有增加上下文（4096 或 8192）的 ExLlama 一起使用
- 如果设置 trust_remote_code=True，可以与 Python 代码中的 AutoGPTQ 一起使用，包括增加上下文
- 应该可以与 GPTQ-for-LLaMa 的 CUDA 模式一起使用，但不确定增加上下文是否有效 - 待确认。可能在 GPTQ-for-LLaMa Triton 模式下存在问题
- 可与 text-generation-webui 一起使用，包括一键安装程序
- 参数：Groupsize = 128。Act Order / desc_act = False

Discord

如需更多支持以及关于这些模型和人工智能的讨论，请加入我们：

TheBloke AI's Discord server

感谢以及如何贡献

感谢 chirper.ai 团队！

我有很多人问我是否可以贡献。我喜欢提供模型和帮助他人，并且很乐意花更多的时间来做这些，并且扩展到新的项目，如微调/训练。

如果您有能力和愿望贡献，我将非常感谢并希望能够花更多的时间提供更多的模型，并开始新的人工智能项目。

捐助者将优先获得所有关于人工智能/LLM/模型的支持、问题和请求，并可以访问一个私人 Discord 房间，还有其他的福利。

Patreon： https://patreon.com/TheBlokeAI
Ko-Fi： https://ko-fi.com/TheBlokeAI

特别感谢：Luke from CarbonQuill、Aemon Algiz、Dmitriy Samsonov。

Patreon 特别提到的人员：Pyrater、WelcomeToTheClub、Kalila、Mano Prime、Trenton Dambrowitz、Spiking Neurons AB、Pierre Kircher、Fen Risland、Kevin Schuppel、Luke、Rainer Wilmers、vamX、Gabriel Puliatti、Alex、Karl Bernard、Ajan Kanaga、Talal Aujan、Space Cruiser、ya boyyy、biorpg、Johann-Peter Hartmann、Asp the Wyvern、Ai Maven、Ghost、Preetika Verma、Nikolai Manek、trip7s trip、John Detwiler、Fred von Graf、Artur Olbinski、subjectnull、John Villwock、Junyu Yang、Rod A、Lone Striker、Chris McCloskey、Iucharbius、Matthew Berman、Illia Dulskyi、Khalefa Al-Ahmad、Imad Khwaja、chris gileta、Willem Michiel、Greatston Gnanesh、Derek Yates、K、Alps Aficionado、Oscar Rangel、David Flickinger、Luke Pendergrass、Deep Realms、Eugene Pentland、Cory Kujawski、terasurfer、Jonathan Leane、senxiiz、Joseph William Delisle、Sean Connelly、webtim、zynix、Nathan LeClaire。

感谢我所有慷慨的赞助者和捐助者！

原始模型卡片：Kaio Ken 的 SuperHOT 8K

带有 8K 上下文的 SuperHOT 原型2

这是 SuperHOT 的第二个原型，这次是带有 8K 上下文的 30B 模型，没有使用 RLHF，使用了在 the github blog 中描述的相同技术。测试表明，该模型确实利用了扩展的 8K 上下文。

您需要使用猴子补丁，或者如果您已经使用了猴子补丁，将缩放因子更改为 0.25，将最大序列长度更改为 8192。

寻找合并和量化的模型？

30B 4 位 CUDA： tmpupload/superhot-30b-8k-4bit-safetensors
30B 4 位 CUDA 128g： tmpupload/superhot-30b-8k-4bit-128g-safetensors

训练详情

我使用以下配置训练了 LoRA：

1200 个样本（超过 2048 序列长度的约 400 个样本）
学习率为 3e-4
3 个 epochs
导出的模块为：
- q_proj
- k_proj
- v_proj
- o_proj
- 没有偏差
排名 = 4
Alpha = 8
无丢弃
权重衰减为 0.1
AdamW beta1 为 0.9 和 beta2 为 0.99，epsilon 为 1e-5
在 4 位基础模型上进行训练

原始模型卡片：LmSys' Vicuna 13B 1.3.0

Vicuna 模型卡片

模型详情

Vicuna 是一个通过在用户共享的 ShareGPT 对话中进行有监督指令微调的聊天助手。

开发者： LMSYS
模型类型：基于 Transformer 架构的自回归语言模型。
许可证：非商业许可证
细调自模型： LLaMA 。

模型来源

存储库： https://github.com/lm-sys/FastChat
博客： https://lmsys.org/blog/2023-03-30-vicuna/
论文： https://arxiv.org/abs/2306.05685
演示： https://chat.lmsys.org/

用途

Vicuna 主要用于研究大型语言模型和聊天机器人。该模型的主要使用者是自然语言处理、机器学习和人工智能方面的研究人员和爱好者。

如何开始使用模型

命令行界面： https://github.com/lm-sys/FastChat#vicuna-weights 。API（OpenAI API、Huggingface API）： https://github.com/lm-sys/FastChat/tree/main#api 。

训练详情

Vicuna v1.3 是通过对 LLaMA 进行监督指令微调得到的。训练数据包括约 140K 条从 ShareGPT.com 收集的对话。有关更多细节，请参阅本 paper 的附录中的“Vicuna 模型的训练详细信息”部分。

评估

Vicuna 使用标准基准测试、人类偏好和 LLM 作为评判进行评估。有关更多细节，请参阅本 paper 。

不同版本 Vicuna 之间的区别

请参阅 vicuna_weights_version.md

作者:

Tom Jobbins

数据集大小:

6.95 GB