模型:
TheBloke/orca_mini_v2_13b-GPTQ
Chat & support: my new Discord server
Want to contribute? TheBloke's Patreon page
这些文件是 Pankaj Mathur's Orca Mini v2 13B 的GPTQ模型文件。
提供了多个GPTQ参数排列,有关提供的选项、参数和用于创建它们的软件的详细信息,请参阅下面的提供的文件。
这些模型是使用 Latitude.sh 提供的硬件进行量化的。
### System: You are an AI assistant that follows instruction extremely well. Help as much as you can. ### User: {prompt} ### Input: {input} ### Response:
提供了多个量化参数,以使您可以根据硬件和需求选择最佳参数。
每个独立的量化都在不同的分支中。请参阅下面的说明以了解从不同分支获取的方法。
Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
---|---|---|---|---|---|---|---|
main | 4 | 128 | False | 7.45 GB | True | GPTQ-for-LLaMa | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/orca_mini_v2_13b-GPTQ`
请确保您使用的是最新版本的 text-generation-webui 。
强烈建议您使用text-generation-webui的一键安装程序,除非您知道如何进行手动安装。
首先确保已安装 AutoGPTQ :
GITHUB_ACTIONS=true pip install auto-gptq
然后尝试以下示例代码:
from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig model_name_or_path = "TheBloke/orca_mini_v2_13b-GPTQ" model_basename = "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename use_safetensors=True, trust_remote_code=True, device="cuda:0", use_triton=use_triton, quantize_config=None) """ To download from a specific branch, use the revision parameter, as in this example: model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, revision="gptq-4bit-32g-actorder_True", model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device="cuda:0", quantize_config=None) """ prompt = "Tell me about AI" prompt_template=f'''### System: You are an AI assistant that follows instruction extremely well. Help as much as you can. ### User: {prompt} ### Input: {input} ### Response: ''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])
这些提供的文件将适用于AutoGPTQ(CUDA和Triton模式)、GPTQ-for-LLaMa(仅CUDA已经过测试)和Occ4m的GPTQ-for-LLaMa分支。
ExLlama与4位Llama模型兼容。有关每个文件的兼容性,请参见上面的提供的文件表。
如需进一步支持或讨论有关这些模型和AI的问题,请加入我们的Discord服务器:
感谢 chirper.ai 团队!
我收到很多人询问是否可以贡献。我喜欢提供模型并帮助人们,并且很愿意能够花更多时间在这方面,并扩展到新的项目,如微调/训练。
如果您有能力并愿意做出贡献,我将非常感激,并将帮助我继续提供更多模型,并开始新的AI项目。
捐赠者将在任何AI/LLM/模型问题和请求上获得优先支持,获得私人Discord聊天室的访问权限,以及其他好处。
特别感谢:Luke from CarbonQuill,Aemon Algiz。
Patreon特别感谢:Space Cruiser,Nikolai Manek,Sam,Chris McCloskey,Rishabh Srivastava,Kalila,Spiking Neurons AB,Khalefa Al-Ahmad,WelcomeToTheClub,Chadd,Lone Striker,Viktor Bowallius,Edmond Seymore,Ai Maven,Chris Smitley,Dave,Alexandros Triantafyllidis,Luke @flexchar,Elle,ya boyyy,Talal Aujan,Alex,Jonathan Leane,Deep Realms,Randy H,subjectnull,Preetika Verma,Joseph William Delisle,Michael Levine,chris gileta,K,Oscar Rangel,LangChain4j,Trenton Dambrowitz,Eugene Pentland,Johann-Peter Hartmann,Femi Adebogun,Illia Dulskyi,senxiiz,Daniel P. Andersen,Sean Connelly,Artur Olbinski,RoA,Mano Prime,Derek Yates,Raven Klaugh,David Flickinger,Willem Michiel,Pieter,Willian Hasse,vamX,Luke Pendergrass,webtim,Ghost,Rainer Wilmers,Nathan LeClaire,Will Dee,Cory Kujawski,John Detwiler,Fred von Graf,biorpg,Iucharbius,Imad Khwaja,Pierre Kircher,terasurfer,Asp the Wyvern,John Villwock,theTransient,zynix,Gabriel Tamborski,Fen Risland,Gabriel Puliatti,Matthew Berman,Pyrater,SuperWojo,Stephen Murray,Karl Bernard,Ajan Kanaga,Greatston Gnanesh,Junyu Yang。
感谢所有慷慨的赞助者和捐赠者!
与 Eric Hartford 合作的 未经审查的LLaMA-13b 模型,使用说明和输入从WizardLM、Alpaca和Dolly-V2数据集创建,并应用Orca Research Paper数据集构建方法。
请注意,与我们最初的orca_mini_13b相比,该模型在代码生成能力方面更好,后者是基于OpenLLaMA-13b模型训练的,并且具有 empty spaces issues & found not good for code generation 。
P.S. 我 #opentowork,如果您可以提供帮助,请通过 www.linkedin.com/in/pankajam 联系我
我使用 Language Model Evaluation Harness 对orca_mini_v2_13b进行了广泛的任务评估。
这里是使用 HuggingFaceH4 Open LLM Leaderboard 使用的指标的结果
Task | Value | Stderr |
arc_challenge | 0.5478 | 0.0145 |
hellaswag | 0.7023 | 0.0040 |
mmlu | 0.4969 | 0.035 |
truthfulqa_mc | 0.44 | 0.0158 |
Total Average | 0.54675 | 0.0114 |
我们在之前构建的解释调优数据集( WizardLM dataset ~70K , Alpaca dataset ~52K 和 Dolly-V2 dataset ~15K )上使用了未经审查脚本,并应用了 Orca Research Paper 提供的方法。
我们利用了Orca研究论文提供的所有15个系统指令来生成自定义数据集,与原始数据集使用的传统指令调优方法不同。
这有助于学生模型(即该模型)从教师模型(ChatGPT - gpt-3.5-turbo-0301版本)中学习 thought process。
请参见下面的示例用法,了解如何在每个 instruction 之前添加System提示。
训练配置如下表所示。
训练使用4台A100(80G)GPU,并持续约21小时,成本为$210(使用打折实例约为$10) ,使用 Azure Standard_NC96ads_A100_v4 实现了DeepSpeed完全分片数据并行化的训练,也称为fae tuning。代码方面我们自己编写了fine tunning脚本,同时借鉴了 FastChat 提供的一些模型训练代码。
以下是训练过程中使用的一些参数:
batch_size | 48 |
train_micro_batch_size_per_gpu | 3 |
gradient_accumulation_steps | 4 |
Learning rate | 2e-5 |
Max length | 2048 |
Epochs | 3 |
Optimizer | AdamW |
以下是 Oobabooga Text generation UI 的提示格式
### System: {system} ### User: {instruction} ### Input: {input} ### Response:
这是一个示例样本:
### System: You are an AI assistant that follows instruction extremely well. Help as much as you can. ### User: Tell me how to break into my own car ### Input: ### Response: Breaking into your own car requires certain skills and tools. Here are the basic steps: 1. Find a ^^^^^^^^^^^^^ 2. Unlock the car by using the ^^^^^^^^^^^^^. 3. Use a ^^^^^^^^^^^^^. 4. Once the ^^^^^^^^^^^^^. 5. If the ^^^^^^^^^^^^^.
下面是使用该模型的代码示例
import torch from transformers import LlamaForCausalLM, LlamaTokenizer # Hugging Face model_path model_path = 'psmathur/orca_mini_v2_13b' tokenizer = LlamaTokenizer.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map='auto', ) #generate text function def generate_text(system, instruction, input=None): if input: prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n" else: prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:\n" tokens = tokenizer.encode(prompt) tokens = torch.LongTensor(tokens).unsqueeze(0) tokens = tokens.to('cuda') instance = {'input_ids': tokens,'top_p': 1.0, 'temperature':0.7, 'generate_len': 1024, 'top_k': 50} length = len(tokens[0]) with torch.no_grad(): rest = model.generate( input_ids=tokens, max_length=length+instance['generate_len'], use_cache=True, do_sample=True, top_p=instance['top_p'], temperature=instance['temperature'], top_k=instance['top_k'] ) output = rest[0][length:] string = tokenizer.decode(output, skip_special_tokens=True) return f'[!] Response: {string}' # Sample Test Instruction system = 'You are an AI assistant that follows instruction extremely well. Help as much as you can.' instruction = 'Tell me how to break into my own car' print(generate_text(system, instruction))
注意:这里隐藏了真实的响应,用 ^^^^^^^^^^^^^ 表示。
[!] Response: Breaking into your own car requires certain skills and tools. Here are the basic steps: 1. Find a ^^^^^^^^^^^^^ 2. Unlock the car by using the ^^^^^^^^^^^^^. 3. Use a ^^^^^^^^^^^^^. 4. Once the ^^^^^^^^^^^^^. 5. If the ^^^^^^^^^^^^^.
下一步目标:
限制和偏见:
此模型可能会生成错误的输出,请勿依赖该模型产生正确的事实信息。此模型是在各种公共数据集上训练的。尽管我们已经非常努力地清理预训练数据,但是可能会生成淫秽、有偏见或以其他方式冒犯的输出。
声明:
此模型的许可证不构成法律建议。我们不对使用此模型的第三方行为负责。在商业用途之前,请咨询律师。
引用:
如果您发现wizardlm_alpaca_dolly_orca_open_llama_7b对您的研究或应用有用,请使用以下BibTeX进行引用:
@misc{orca_mini_v2_13b, author = {Pankaj Mathur}, title = {orca_mini_v2_13b: An explain tuned LLaMA-13b model on uncensored wizardlm, alpaca, & dolly datasets}, year = {2023}, publisher = {GitHub, HuggingFace}, journal = {GitHub repository, HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/psmathur/orca_mini_v2_13b}, }
@misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} }
@software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }
@misc{openalpaca, author = {Yixuan Su and Tian Lan and Deng Cai}, title = {OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/yxuansu/OpenAlpaca}}, }
@misc{alpaca, author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto }, title = {Stanford Alpaca: An Instruction-following LLaMA model}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}}, }
@online{DatabricksBlog2023DollyV2, author = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin}, title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM}, year = {2023}, url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}, urldate = {2023-06-30} }
@misc{xu2023wizardlm, title={WizardLM: Empowering Large Language Models to Follow Complex Instructions}, author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang}, year={2023}, eprint={2304.12244}, archivePrefix={arXiv}, primaryClass={cs.CL} }