模型:

TheBloke/Manticore-13B-Chat-Pyg-SuperHOT-8K-GPTQ

许可:

other

其他:

text-generation-inference custom_code llama

类库:

Transformers

任务:

文本生成

模型介绍文件清单

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

OpenAccess AI Collective 的Manticore 13B Chat GPTQ

这些文件是用 OpenAccess AI Collective's Manticore 13B Chat 合并 Kaio Ken's SuperHOT 8K 后的GPTQ 4位模型文件。

使用 GPTQ-for-LLaMa 将其量化为4位的结果。

这是一个实验性的新GPTQ模型，可以提供高达8K的上下文大小

使用最新版本的 text-generation-webui 的 ExLlama 进行了测试。

它还通过使用AutoGPTQ的Python代码进行了测试，并且 trust_remote_code=True 。

代码按照以下方式进行了归功:

增加上下文长度的原始概念和代码: kaiokendev
更新了包含此功能的Llama建模代码，通过trust_remote_code自动应用: emozilla

请仔细阅读以下内容以了解如何使用它。

尚未提供GGML版本，因为llama.cpp尚不支持SuperHOT。正在调查并希望很快解决此问题。

可用的存储库

如何在text-generation-webui中轻松下载和使用此模型

请确保您正在使用text-generation-webui的最新版本

点击 Model 选项卡。

在 Download custom model or LoRA 下，输入 TheBloke/Manticore-13B-Chat-Pyg-SuperHOT-8K-GPTQ

单击 Download

模型将开始下载。下载完成后将显示“完成”。

取消选中 Autoload the model

在左上角，点击 Model 旁边的刷新图标。

在 Model 下拉菜单中，选择刚刚下载的模型: Manticore-13B-Chat-Pyg-SuperHOT-8K-GPTQ

要使用扩展的上下文，请将 Loader 设置为 ExLlama ，将 max_seq_len 设置为8192或4096，并将 compress_pos_emb 设置为 4 以使用8192上下文，或将 compress_pos_emb 设置为 2 以使用4096上下文。

现在点击 Save Settings ，然后点击 Reload

模型将自动加载，现在已经准备好使用！

当您准备好时，点击 Text Generation 选项卡，然后输入提示开始！

如何使用此GPTQ模型的Python代码与AutoGPTQ

首先确保您已安装AutoGPTQ和Einops：

pip3 install einops auto-gptq

然后运行以下代码。请注意，为了使其工作， config.json 已经在代码中硬编码为8192的序列长度。

如果要尝试使用4096以减少VRAM使用量，请手动编辑 config.json 将 max_position_embeddings 设置为所需值。

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/Manticore-13B-Chat-Pyg-SuperHOT-8K-GPTQ"
model_basename = "manticore-13b-chat-pyg-superhot-8k-GPTQ-4bit-128g.no-act.order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device_map='auto',
        use_triton=use_triton,
        quantize_config=None)

model.seqlen = 8192

# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
prompt_template=f'''USER: {prompt}
ASSISTANT:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

使用其他UI：monkey patch

仓库中提供了 llama_rope_scaled_monkey_patch.py 文件，由 @kaiokendev编写。

它理论上可以添加到任何Python UI或自定义代码中，以实现与 trust_remote_code=True 相同的结果。我尚未测试过，应使用 trust_remote_code=True 替代，但是为了完整性和广泛性而包括在内。

提供的文件

manticore-13b-chat-pyg-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors

这将适用于 AutoGPTQ，ExLlama 和 GPTQ-for-LLaMa 的CUDA版本。有报道称最近GPTQ-for-LLaMa的Triton模式存在问题。如果有问题，请使用AutoGPTQ。

它是使用group_size 128创建的，以提高推理准确性，但没有使用 --act-order (desc_act) 来提高兼容性和改善推理速度。

manticore-13b-chat-pyg-superhot-8k-GPTQ-4bit-128g.no-act.order.safetensors
- 适用于具有增加上下文的ExLlama（4096或8192）
- 在Python代码中使用AutoGPTQ时有效，如果设置了增加上下文的trust_remote_code=True。
- 应该适用于GPTQ-for-LLaMa的CUDA模式，但不确定是否适用于增加上下文 - 待定。可能在GPTQ-for-LLaMa Triton模式下存在问题。
- 适用于text-generation-webui，包括一键安装程序。
- 参数: Groupsize = 128. Act Order / desc_act = False.

Discord

如需进一步支持和讨论这些模型和AI，请加入我们的 Discord:

TheBloke AI's Discord server

鸣谢和如何贡献

感谢 chirper.ai 团队!

我收到很多人询问他们是否可以贡献。我乐于提供模型并帮助人们，非常希望能够有更多的时间进行此类工作，以及扩大到诸如微调/培训等新项目。

如果您有能力和意愿进行贡献，将非常感激，并将帮助我继续提供更多的模型，并开始进行新的AI项目。

捐助者将在任何AI/LLM/模型问题和请求上获得优先支持，可以进入私人Discord频道，并获得其他福利。

Patreon: https://patreon.com/TheBlokeAI
Ko-Fi: https://ko-fi.com/TheBlokeAI

特别感谢给: CarbonQuill的Luke，Aemon Algiz，Dmitriy Samsonov。

Patreon特别感谢: zynix，ya boyyy，Trenton Dambrowitz，Imad Khwaja，Alps Aficionado，chris gileta，John Detwiler，Willem Michiel，RoA，Mano Prime，Rainer Wilmers，Fred von Graf，Matthew Berman，Ghost，Nathan LeClaire，Iucharbius，Ai Maven，Illia Dulskyi，Joseph William Delisle，Space Cruiser，Lone Striker，Karl Bernard，Eugene Pentland，Greatston Gnanesh，Jonathan Leane，Randy H，Pierre Kircher，Willian Hasse，Stephen Murray，Alex，terasurfer，Edmond Seymore，Oscar Rangel，Luke Pendergrass，Asp the Wyvern，Junyu Yang，David Flickinger，Luke，Spiking Neurons AB，subjectnull，Pyrater，Nikolai Manek，senxiiz，Ajan Kanaga，Johann-Peter Hartmann，Artur Olbinski，Kevin Schuppel，Derek Yates，Kalila，K，Talal Aujan，Khalefa Al-Ahmad，Gabriel Puliatti，John Villwock，WelcomeToTheClub，Daniel P. Andersen，Preetika Verma，Deep Realms，Fen Risland，trip7s trip，webtim，Sean Connelly，Michael Levine，Chris McCloskey，biorpg，vamX，Viktor Bowallius，Cory Kujawski。

感谢所有慷慨的赞助者和捐助者！

原始模型卡片：Kaio Ken's SuperHOT 8K

SuperHOT Prototype 2 w/ 8K Context

这是SuperHOT的第二个原型，这次是30B与8K上下文，无RLHF，使用与 the github blog 中描述的相同技术。测试表明，模型确实利用了8K的扩展上下文。

您将需要使用monkeypatch或者如果已经使用monkeypatch，请将缩放因子更改为0.25，最大序列长度更改为8192

寻找合并和量化的模型吗？

30B 4位CUDA: tmpupload/superhot-30b-8k-4bit-safetensors
30B 4位CUDA 128g: tmpupload/superhot-30b-8k-4bit-128g-safetensors

培训细节

我使用以下配置训练了LoRA：

1200个样本（约2048个序列长度以上400个样本）
学习率为3e-4
3个epochs
输出的模块为：
- q_proj
- k_proj
- v_proj
- o_proj
- 不带偏差
Rank = 4
Alpha = 8
无dropout
权重衰减为0.1
AdamW的beta1为0.9和beta2为0.99，epsilon为1e-5
使用4位基本模型进行训练

原始模型卡片：OpenAccess AI Collective's Manticore 13B Chat

Manticore 13B Chat

Manticore 13B Chat 在 Manticore 基础上使用新的数据集进行了扩展，其中包括Pygmalion数据集的去重子集。它还删除了所有使用###的Alpaca样式提示，改为仅使用CHAT样式提示，包括USER：、ASSISTANT：以及、和令牌。

有问题、评论、反馈、捐款还是想提供帮助吗？请在我们的 Discord 上联系我们，或发送电子邮件至 wing@openaccessaicollective.org

训练数据集

Manticore 13B Chat是在以下数据集上进行了微调：同时还使用了原始Manticore 13B的数据集。

Manticore 13B Chat 使用以下数据集的25%进行训练。将数据集合并、随机化，然后分为4个部分。

去重的pygmalion数据集，过滤出RP数据
riddle_sense - instruct增强
hellaswag，更新了详细说明，有30K+行
gsm8k - instruct增强
ewof/code-alpaca-instruct-unfiltered

Manticore 13B

ShareGPT - 基于经过清洁和去重的子集
WizardLM
Wizard-Vicuna
subset of QingyiSi/Alpaca-CoT for roleplay and CoT
GPT4-LLM-Cleaned
GPTeacher-General-Instruct
ARC-Easy和ARC-Challenge - 用于详细响应的instruct增强，来源于train集合
hellaswag - 用于简洁响应的instruct增强的5K行子集，来源于train集合
metaeval/ScienceQA_text_only - 用于简洁响应的instruct增强
openai/summarize_from_feedback - instruct增强的tl;dr摘要

未添加到Manticore 13B的内容：

mmlu - mmlu数据集未添加到此模型中，因为benchmark使用了test集合

致谢

特别感谢Nanobit对Axolotl的帮助，TheBloke对这些模型进行量化使其对所有人更易访问，ehartford提供的清洁数据集，以及0x000011b对RP数据集的贡献。

演示

在HF Spaces中尝试使用模型。该演示使用量化的GGML模型，以便在较小的GPU（甚至CPU）上快速返回预测结果。量化的GGML可能会导致模型质量的微小损失。

https://huggingface.co/spaces/openaccess-ai-collective/manticore-13b-chat-pyg

发布说明

https://wandb.ai/wing-lian/manticore-13b-v2/runs/hxr3aiiw

构建

Manticore是在8xA100 80GB 上使用 Axolotl 构建的

3个epochs，大约8小时的时间。不会发布更多epochs。
可以在这个仓库的 /config folder 中找到复制此构建所需的配置。

偏见、风险和限制

Manticore没有使用像RLHF这样的技术来与人类偏好进行对齐，也没有使用像ChatGPT这样的响应循环过滤来部署，因此模型可能会产生问题的输出（特别是在提示要求这样做的情况下）。从基础模型LlaMa13B进行微调，请参阅其模型卡片的限制部分以获取相关信息。

示例

USER: make up a sonnet in the style of shakespeare about a skier skiing in the glades on a bluebird day.
ASSISTANT: A skier doth glide 'neath sun’s glow, His edges sharp to cut the snow, Bluebird sky above, While he doth revel on the slopes below, The cold air biting his nose, But not his spirits, For he is filled with a sense of joy, As he carves through the powdery drifts. His body moves in graceful rhythm, And each turn he makes with ease, He feels like he could ski forever, With no care for the world outside these glades.

USER: tell me a joke about alpacas
ASSISTANT: Did you hear about the famous alpaca ballet dancer? She was a prima llama.
USER: okay, that is actually pretty good
ASSISTANT: What do you call an alpaca in a bowtie? A llama-cholla!
USER: why is that funny?
ASSISTANT: Because it’s a play on words. “Llama-cholla” sounds like “lame chola,” which is Spanish slang for someone who acts cool or pretentious. So, the joke is implying that alpacas are vain and act cool with their bowties.

USER: What is 44 + 77?
ASSISTANT: 121
USER: What is the sum of 33 and 77?
ASSISTANT: 110
USER: tell me 5 words that start with “ex” and the sum of 33 and 77.
ASSISTANT: exotic, expressive, explicit, experimental, and extravagant. The sum of 33 and 77 is 110.
USER: tell me 3 words that start with “egr” and end in “s”
ASSISTANT: egret, eggplant, and egrets.

USER: Sally has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
ASSISTANT: Sally has a total of 6 siblings, so she has a total of 6 - 2 = 4 sisters.
USER: Sally has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Use multi-step reasoning to explain your answer.
ASSISTANT: Sally has a total of three brothers, so she has a total of three * 2 = six siblings. Her family consists of three boys and six girls in total. This means that Sally has three sisters since there are six female relatives and only three male ones.
USER: But when brothers have sisters, aren’t they shared or the same?
ASSISTANT: No, not every sibling shares their mother. For example, I have two brothers who share one mother but my third brother has a different mother from both of them.
USER: Let’s assume they all share the same mother and father. Now explain with multi-step reasoning.
ASSISTANT: If Sally has three brothers then that means she has six siblings altogether. Each brother has two sisters, which means that there are twelve female relatives in her family - one for each of her brothers and herself. Since there are only six females in the entire family, that means that Sally’s mother has no other children than her own three sons - so she can’t be the mother of any of the sisters. So all those sisters have to belong to one of the brothers, and since they are shared by two brothers, then each brother has a single sister. Therefore, there is only one sister for Sally, making the final answer 1.

相当确定推理是不正确的，最后得出只有一个姐妹的答案是一个幸运的猜测

作者:

Tom Jobbins

数据集大小:

6.95 GB