模型:

TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

WizardLM的WizardLM 13B V1.1 fp16

这些是合并了 WizardLM's WizardLM 13B V1.1 Kaio Ken's SuperHOT 8K 的fp16 pytorch格式的模型文件。

Kaio Ken's SuperHOT 13b LoRA 被合并到基础模型上,并且通过在推理过程中使用trust_remote_code=True来实现8K的上下文。

请注意,config.json已经设置为8192的序列长度。如果你想尝试较小的序列长度,可以将其修改为4096。

可用的存储库

如何在Python代码中使用此模型

首先确保你安装了Einops:

pip3 install auto-gptq

然后运行以下代码。config.json已经默认设置为8192的序列长度,但你也可以在你的Python代码中进行配置。

提供的建模代码,可以通过使用trust_remote_code=True自动从配置的max_position_embeddings中设置scale参数。例如对于8192来说,scale被设置为4。

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline
import argparse

model_name_or_path = "TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
# Change this to the sequence length you want
config.max_position_embeddings = 8192

model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
        config=config,
        trust_remote_code=True,
        device_map='auto')

# Note: check to confirm if this is correct prompt template is correct for this model!
prompt = "Tell me about AI"
prompt_template=f'''USER: {prompt}
ASSISTANT:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

使用其他界面:猴子补丁

存储库中提供了llama_rope_scaled_monkey_patch.py,由@kaiokendev编写。

理论上它可以添加到任何Python界面或自定义代码中,以实现与trust_remote_code=True相同的结果。我没有测试过这一点,而且它应该被使用trust_remote_code=True所取代,但为了完整性和兴趣,我还是包括了它。

Discord

如需进一步的支持,并对这些模型和人工智能进行讨论,请加入我们:

TheBloke AI's Discord server

感谢以及如何贡献

感谢 chirper.ai 团队!

我得到很多人的询问,问他们是否可以做出贡献。我喜欢提供模型和帮助别人,并且很乐意能够将更多的时间花在这方面,以及扩展到新的项目,如微调/训练。

如果你有能力和意愿贡献,我将非常感激,并将帮助我继续提供更多模型,并开始新的人工智能项目。

捐助者将在任何与AI/LLM/模型有关的问题和请求上获得优先支持,可以进入一个私密的Discord房间,以及其他好处。

特别感谢:CarbonQuill的Luke,Aemon Algiz。

Patreon特别提及:RoA,Lone Striker,Gabriel Puliatti,Derek Yates,Randy H,Jonathan Leane,Eugene Pentland,Karl Bernard,Viktor Bowallius,senxiiz,Daniel P. Andersen,Pierre Kircher,Deep Realms,Cory Kujawski,Oscar Rangel,Fen Risland,Ajan Kanaga,LangChain4j,webtim,Nikolai Manek,Trenton Dambrowitz,Raven Klaugh,Kalila,Khalefa Al-Ahmad,Chris McCloskey,Luke @flexchar,Ai Maven,Dave,Asp the Wyvern,Sean Connelly,Imad Khwaja,Space Cruiser,Rainer Wilmers,subjectnull,Alps Aficionado,Willian Hasse,Fred von Graf,Artur Olbinski,Johann-Peter Hartmann,WelcomeToTheClub,Willem Michiel,Michael Levine,Iucharbius,Spiking Neurons AB,K,biorpg,John Villwock,Pyrater,Greatston Gnanesh,Mano Prime,Junyu Yang,Stephen Murray,John Detwiler,Luke Pendergrass,terasurfer,Pieter,zynix,Edmond Seymore,theTransient,Nathan LeClaire,vamX,Kevin Schuppel,Preetika Verma,ya boyyy,Alex,SuperWojo,Ghost,Joseph William Delisle,Matthew Berman,Talal Aujan,chris gileta,Illia Dulskyi。

感谢所有慷慨的赞助者和捐赠者!

原始模型卡片:Kaio Ken's SuperHOT 8K

SuperHOT 8K 的第二个原型

这是SuperHOT的第二个原型,一个NSFW专注于LoRA,这次使用7B的8K上下文,并且没有采用RLHF技术,使用了与 the github blog 中描述的相同技术。

Looking for Merged & Quantized Models?

Make some please :)

Using the monkey-patch?

You will NEED to apply the monkeypatch or, if you are already using the monkeypatch, change the scaling factor to 0.25 and the maximum sequence length to 8192

The monkeypatch is only necessary if you are using a front-end/back-end that does not already support scaling and said front-end/back-end is Python-based (i.e. Huggingface Transformers). To apply the patch, you will need to copy the llama_rope_scaled_monkey_patch.py into your working directory and call the exported function replace_llama_rope_with_scaled_rope at the very start of your Python program. It will modify the Transformers library's implementation of RoPE to properly apply the scaling factor.

Using Oobabooga with Exllama?

Switch your loader to exllama or exllama_hf Add the arguments max_seq_len 8192 and compress_pos_emb 4 . While the model may work well with compress_pos_emb 2 , it was trained on 4, so that is what I advocate for you to use

Example in the command-line:

  • python server.py --max_seq_len 8192 --compress_pos_emb 4 --loader exllama_hf

In the UI, you will see the loader option in the Models tab. Once you select either exllama or exllama_hf , the max_seq_len and compress_pos_emb settings will appear.

Training Details

I trained the LoRA with the following configuration:

  • 1200 samples (~400 samples over 2048 sequence length)
  • learning rate of 3e-4
  • 3 epochs
  • The exported modules are:
    • q_proj
    • k_proj
    • v_proj
    • o_proj
    • no bias
  • Rank = 4
  • Alpha = 8
  • no dropout
  • weight decay of 0.1
  • AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
  • Trained on 4-bit base model
  • Cutoff length: 4096

原始模型卡片:WizardLM's WizardLM 13B V1.1

这是WizardLM-13B V1.1模型的完全权重。

存储库: https://github.com/nlpxucan/WizardLM

Twitter: https://twitter.com/WizardLM_AI/status/1677282955490918401