模型:
TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16
Chat & support: my new Discord server
Want to contribute? TheBloke's Patreon page
这些是合并了 WizardLM's WizardLM 13B V1.1 和 Kaio Ken's SuperHOT 8K 的fp16 pytorch格式的模型文件。
Kaio Ken's SuperHOT 13b LoRA 被合并到基础模型上,并且通过在推理过程中使用trust_remote_code=True来实现8K的上下文。
请注意,config.json已经设置为8192的序列长度。如果你想尝试较小的序列长度,可以将其修改为4096。
首先确保你安装了Einops:
pip3 install auto-gptq
然后运行以下代码。config.json已经默认设置为8192的序列长度,但你也可以在你的Python代码中进行配置。
提供的建模代码,可以通过使用trust_remote_code=True自动从配置的max_position_embeddings中设置scale参数。例如对于8192来说,scale被设置为4。
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline import argparse model_name_or_path = "TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) # Change this to the sequence length you want config.max_position_embeddings = 8192 model = AutoModelForCausalLM.from_pretrained(model_name_or_path, config=config, trust_remote_code=True, device_map='auto') # Note: check to confirm if this is correct prompt template is correct for this model! prompt = "Tell me about AI" prompt_template=f'''USER: {prompt} ASSISTANT:''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])
存储库中提供了llama_rope_scaled_monkey_patch.py,由@kaiokendev编写。
理论上它可以添加到任何Python界面或自定义代码中,以实现与trust_remote_code=True相同的结果。我没有测试过这一点,而且它应该被使用trust_remote_code=True所取代,但为了完整性和兴趣,我还是包括了它。
如需进一步的支持,并对这些模型和人工智能进行讨论,请加入我们:
感谢 chirper.ai 团队!
我得到很多人的询问,问他们是否可以做出贡献。我喜欢提供模型和帮助别人,并且很乐意能够将更多的时间花在这方面,以及扩展到新的项目,如微调/训练。
如果你有能力和意愿贡献,我将非常感激,并将帮助我继续提供更多模型,并开始新的人工智能项目。
捐助者将在任何与AI/LLM/模型有关的问题和请求上获得优先支持,可以进入一个私密的Discord房间,以及其他好处。
特别感谢:CarbonQuill的Luke,Aemon Algiz。
Patreon特别提及:RoA,Lone Striker,Gabriel Puliatti,Derek Yates,Randy H,Jonathan Leane,Eugene Pentland,Karl Bernard,Viktor Bowallius,senxiiz,Daniel P. Andersen,Pierre Kircher,Deep Realms,Cory Kujawski,Oscar Rangel,Fen Risland,Ajan Kanaga,LangChain4j,webtim,Nikolai Manek,Trenton Dambrowitz,Raven Klaugh,Kalila,Khalefa Al-Ahmad,Chris McCloskey,Luke @flexchar,Ai Maven,Dave,Asp the Wyvern,Sean Connelly,Imad Khwaja,Space Cruiser,Rainer Wilmers,subjectnull,Alps Aficionado,Willian Hasse,Fred von Graf,Artur Olbinski,Johann-Peter Hartmann,WelcomeToTheClub,Willem Michiel,Michael Levine,Iucharbius,Spiking Neurons AB,K,biorpg,John Villwock,Pyrater,Greatston Gnanesh,Mano Prime,Junyu Yang,Stephen Murray,John Detwiler,Luke Pendergrass,terasurfer,Pieter,zynix,Edmond Seymore,theTransient,Nathan LeClaire,vamX,Kevin Schuppel,Preetika Verma,ya boyyy,Alex,SuperWojo,Ghost,Joseph William Delisle,Matthew Berman,Talal Aujan,chris gileta,Illia Dulskyi。
感谢所有慷慨的赞助者和捐赠者!
这是SuperHOT的第二个原型,一个NSFW专注于LoRA,这次使用7B的8K上下文,并且没有采用RLHF技术,使用了与 the github blog 中描述的相同技术。
Looking for Merged & Quantized Models?Make some please :)
Using the monkey-patch?You will NEED to apply the monkeypatch or, if you are already using the monkeypatch, change the scaling factor to 0.25 and the maximum sequence length to 8192
The monkeypatch is only necessary if you are using a front-end/back-end that does not already support scaling and said front-end/back-end is Python-based (i.e. Huggingface Transformers). To apply the patch, you will need to copy the llama_rope_scaled_monkey_patch.py into your working directory and call the exported function replace_llama_rope_with_scaled_rope at the very start of your Python program. It will modify the Transformers library's implementation of RoPE to properly apply the scaling factor.
Using Oobabooga with Exllama?Switch your loader to exllama or exllama_hf Add the arguments max_seq_len 8192 and compress_pos_emb 4 . While the model may work well with compress_pos_emb 2 , it was trained on 4, so that is what I advocate for you to use
Example in the command-line:
In the UI, you will see the loader option in the Models tab. Once you select either exllama or exllama_hf , the max_seq_len and compress_pos_emb settings will appear.
Training DetailsI trained the LoRA with the following configuration:
这是WizardLM-13B V1.1模型的完全权重。
存储库: https://github.com/nlpxucan/WizardLM
Twitter: https://twitter.com/WizardLM_AI/status/1677282955490918401