模型:

TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ

许可:

other

其他:

text-generation-inference custom_code llama

类库:

Transformers

任务:

文本生成

模型介绍文件清单

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

Panchovix的合并 - WizardLM 33B V1.0未经审查和SuperHOT 8K GPTQ

这些文件是用于 Panchovix's merge of WizardLM 33B V1.0 Uncensored and SuperHOT 8K 的GPTQ 4位模型文件。

这是一个实验性的新GPTQ，提供了最多8K的上下文大小。

经过最新版本的 text-generation-webui 测试，增加的上下文已经可以使用 ExLlama 。

通过使用AutoGPTQ和trust_remote_code=True的Python代码进行了测试。

代码来源:

原始概念和增加上下文长度的代码： kaiokendev
包括这个功能的更新Llama建模代码，通过trust_remote_code： emozilla 。

请仔细阅读以下内容以了解如何使用。

注意：在30B模型上使用完整的8K上下文将超过24GB VRAM。

目前尚未提供GGML版本，因为llama.cpp尚不支持SuperHOT。正在进行调查，希望很快会有。

可用的存储库

目前尚未提供GGML quants，因为llama.cpp尚不支持SuperHOT。正在进行调查，希望很快会有。

如何轻松下载并使用此模型在text-generation-webui与ExLlama中使用

请确保您使用的是text-generation-webui的最新版本

单击Model标签。

在Download custom model or LoRA下面，输入TheBloke/WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ.

单击Download。

模型开始下载。完成后会显示“已完成”

取消选择Autoload the model

在左上角，单击Model旁边的刷新图标。

在Model下拉菜单中，选择刚下载的模型：WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ

要使用增加的上下文，请将Loader设置为ExLlama，将max_seq_len设置为8192或4096，并将compress_pos_emb设置为4（对于8192上下文）或2（对于4096上下文）。

现在单击Save Settings，然后点击Reload

模型将自动加载，现在可以使用了！

准备好后，单击Text Generation标签，输入提示即可开始！

如何从Python代码中使用此GPTQ模型与AutoGPTQ

首先确保您已安装了AutoGPTQ和Einops:

pip3 install einops auto-gptq

然后运行以下代码。请注意，为了使其正常工作，config.json的序列长度已硬编码为8192。

如果您想尝试4096以减少VRAM使用，请手动编辑config.json以将max_position_embeddings设置为所需值。

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ"
model_basename = "wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device_map='auto',
        use_triton=use_triton,
        quantize_config=None)

model.seqlen = 8192

# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"
prompt_template=f'''USER: {prompt}
ASSISTANT:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

使用其他UI界面：猴子补丁

在存储库中提供了llama_rope_scaled_monkey_patch.py，由@kaiokendev编写。

理论上，它可以添加到任何Python UI界面或自定义代码中，以实现与trust_remote_code=True相同的结果。我尚未测试过这一点，而且它应该被使用trust_remote_code=True取代，但我包含在内是为了完整性和利益。

提供的文件

wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order.safetensors

这将与AutoGPTQ，ExLlama和CUDA版本的GPTQ-for-LLaMa一起工作。有报道称最新版本的GPTQ-for-LLaMa的Triton模式存在问题。如果有问题，请使用AutoGPTQ代替。

创建此文件时，没有使用group_size以降低VRAM要求，并使用--act-order（desc_act）尽可能增加推理准确性。

wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order.safetensors
- 与CUDA或Triton模式下的AutoGPTQ一起工作。
- LLaMa模型也可与ExLlama（ https://github.com/turboderp/exllama} ）一起使用，ExLlama通常提供比AutoGPTQ更高的性能，并且使用的VRAM更少。
- 与CUDA模式下的GPTQ-for-LLaMa一起工作。使用GPTQ-for-LLaMa Triton模式可能会有问题。
- 与text-generation-webui一起工作，包括一键安装程序。
- 参数：Groupsize = -1。Act Order / desc_act = True。

Discord

如需进一步支持和讨论这些模型和AI，请加入我们： TheBloke AI's Discord server

感谢以及如何贡献

感谢 chirper.ai 团队！

很多人都问我是否可以一起贡献。我喜欢提供模型并帮助人们，并且希望能够有更多的时间来做这件事，以及扩展到新的项目，如微调/训练。

如果您有能力和愿意贡献，我将非常感激，并将帮助我继续提供更多模型，并开始进行新的AI项目。

赞助者将优先获得有关所有AI/LLM/模型问题和请求的支持，以及访问私人Discord房间和其他福利。

Patreon： https://patreon.com/TheBlokeAI
Ko-Fi： https://ko-fi.com/TheBlokeAI

特别感谢：CarbonQuill的Luke，Aemon Algiz，Dmitriy Samsonov。

Patreon特别提到：Pyrater，WelcomeToTheClub，Kalila，Mano Prime，Trenton Dambrowitz，Spiking Neurons AB，Pierre Kircher，Fen Risland，Kevin Schuppel，Luke，Rainer Wilmers，vamX，Gabriel Puliatti，Alex，Karl Bernard，Ajan Kanaga，Talal Aujan，Space Cruiser，ya boyyy，biorpg，Johann-Peter Hartmann，Asp the Wyvern，Ai Maven，Ghost，Preetika Verma，Nikolai Manek，trip7s trip，John Detwiler，Fred von Graf，Artur Olbinski，subjectnull，John Villwock，Junyu Yang，Rod A，Lone Striker，Chris McCloskey，Iucharbius，Matthew Berman，Illia Dulskyi，Khalefa Al-Ahmad，Imad Khwaja，chris gileta，Willem Michiel，Greatston Gnanesh，Derek Yates，K，Alps Aficionado，Oscar Rangel，David Flickinger，Luke Pendergrass，Deep Realms，Eugene Pentland，Cory Kujawski，terasurfer，Jonathan Leane，senxiiz，Joseph William Delisle，Sean Connelly，webtim，zynix，Nathan LeClaire。

感谢所有慷慨的赞助者和捐助者！

原始模型卡片：Panchovix的WizardLM 33B V1.0 Uncensored和SuperHOT 8K的合并

与kaiokendev的 33b SuperHOT 8k LoRA 合并，无需量化。（完整FP16模型）

作者:

Tom Jobbins

数据集大小:

15.78 GB