英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

Pankaj Mathur的Orca Mini v2 13B GPTQ

这些文件是 Pankaj Mathur's Orca Mini v2 13B 的GPTQ模型文件。

提供了多个GPTQ参数排列,有关提供的选项、参数和用于创建它们的软件的详细信息,请参阅下面的提供的文件。

这些模型是使用 Latitude.sh 提供的硬件进行量化的。

可用的仓库

提示模板:orca_mini

### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
{prompt}

### Input:
{input}

### Response:

提供的文件

提供了多个量化参数,以使您可以根据硬件和需求选择最佳参数。

每个独立的量化都在不同的分支中。请参阅下面的说明以了解从不同分支获取的方法。

Branch Bits Group Size Act Order (desc_act) File Size ExLlama Compatible? Made With Description
main 4 128 False 7.45 GB True GPTQ-for-LLaMa Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.
gptq-4bit-32g-actorder_True 4 32 True 8.00 GB True AutoGPTQ 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed.
gptq-4bit-64g-actorder_True 4 64 True 7.51 GB True AutoGPTQ 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
gptq-4bit-128g-actorder_True 4 128 True 7.26 GB True AutoGPTQ 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
gptq-8bit--1g-actorder_True 8 None True 13.36 GB False AutoGPTQ 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed.
gptq-8bit-128g-actorder_False 8 128 False 13.65 GB False AutoGPTQ 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed.

从分支下载的方法

  • 在text-generation-webui中,您可以在下载名称的末尾添加:branch,例如TheBloke/orca_mini_v2_13b-GPTQ:gptq-4bit-32g-actorder_True
  • 使用Git,您可以使用以下命令克隆一个分支:
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/orca_mini_v2_13b-GPTQ`
  • 在Python Transformers代码中,分支是revision参数;请参阅下面的说明。

如何轻松下载和在 text-generation-webui 中使用此模型。

请确保您使用的是最新版本的 text-generation-webui

强烈建议您使用text-generation-webui的一键安装程序,除非您知道如何进行手动安装。

  • 点击 Model 标签。
  • 在 Download custom model or LoRA 下,输入 TheBloke/orca_mini_v2_13b-GPTQ。
    • 要从特定分支下载,请输入例如TheBloke/orca_mini_v2_13b-GPTQ:gptq-4bit-32g-actorder_True
    • 有关每个选项的分支列表,请参见上面的提供的文件。
  • 点击 Download 。
  • 模型将开始下载。下载完成后,将显示“完成”。
  • 在左上角,点击 Model 旁边的刷新图标。
  • 在 Model 下拉菜单中,选择刚刚下载的模型:orca_mini_v2_13b-GPTQ
  • 模型将自动加载,现在已经准备好使用!
  • 如果您需要任何自定义设置,请设置它们,然后按顶部右侧的 Save settings for this model ,接着点选 Reload the Model 。
    • 请注意,您无需再设置GPTQ参数。这些参数将根据文件quantize_config.json的配置自动设置。
  • 准备就绪后,点击 Text Generation 标签,并输入提示开始使用!
  • 如何从Python代码中使用此GPTQ模型

    首先确保已安装 AutoGPTQ

    GITHUB_ACTIONS=true pip install auto-gptq

    然后尝试以下示例代码:

    from transformers import AutoTokenizer, pipeline, logging
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
    
    model_name_or_path = "TheBloke/orca_mini_v2_13b-GPTQ"
    model_basename = "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
    
    use_triton = False
    
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
    
    model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
            model_basename=model_basename
            use_safetensors=True,
            trust_remote_code=True,
            device="cuda:0",
            use_triton=use_triton,
            quantize_config=None)
    
    """
    To download from a specific branch, use the revision parameter, as in this example:
    
    model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
            revision="gptq-4bit-32g-actorder_True",
            model_basename=model_basename,
            use_safetensors=True,
            trust_remote_code=True,
            device="cuda:0",
            quantize_config=None)
    """
    
    prompt = "Tell me about AI"
    prompt_template=f'''### System:
    You are an AI assistant that follows instruction extremely well. Help as much as you can.
    
    ### User:
    {prompt}
    
    ### Input:
    {input}
    
    ### Response:
    '''
    
    print("\n\n*** Generate:")
    
    input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
    print(tokenizer.decode(output[0]))
    
    # Inference can also be done using transformers' pipeline
    
    # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
    logging.set_verbosity(logging.CRITICAL)
    
    print("*** Pipeline:")
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.15
    )
    
    print(pipe(prompt_template)[0]['generated_text'])
    

    兼容性

    这些提供的文件将适用于AutoGPTQ(CUDA和Triton模式)、GPTQ-for-LLaMa(仅CUDA已经过测试)和Occ4m的GPTQ-for-LLaMa分支。

    ExLlama与4位Llama模型兼容。有关每个文件的兼容性,请参见上面的提供的文件表。

    Discord

    如需进一步支持或讨论有关这些模型和AI的问题,请加入我们的Discord服务器:

    TheBloke AI's Discord server

    感谢以及如何贡献

    感谢 chirper.ai 团队!

    我收到很多人询问是否可以贡献。我喜欢提供模型并帮助人们,并且很愿意能够花更多时间在这方面,并扩展到新的项目,如微调/训练。

    如果您有能力并愿意做出贡献,我将非常感激,并将帮助我继续提供更多模型,并开始新的AI项目。

    捐赠者将在任何AI/LLM/模型问题和请求上获得优先支持,获得私人Discord聊天室的访问权限,以及其他好处。

    特别感谢:Luke from CarbonQuill,Aemon Algiz。

    Patreon特别感谢:Space Cruiser,Nikolai Manek,Sam,Chris McCloskey,Rishabh Srivastava,Kalila,Spiking Neurons AB,Khalefa Al-Ahmad,WelcomeToTheClub,Chadd,Lone Striker,Viktor Bowallius,Edmond Seymore,Ai Maven,Chris Smitley,Dave,Alexandros Triantafyllidis,Luke @flexchar,Elle,ya boyyy,Talal Aujan,Alex,Jonathan Leane,Deep Realms,Randy H,subjectnull,Preetika Verma,Joseph William Delisle,Michael Levine,chris gileta,K,Oscar Rangel,LangChain4j,Trenton Dambrowitz,Eugene Pentland,Johann-Peter Hartmann,Femi Adebogun,Illia Dulskyi,senxiiz,Daniel P. Andersen,Sean Connelly,Artur Olbinski,RoA,Mano Prime,Derek Yates,Raven Klaugh,David Flickinger,Willem Michiel,Pieter,Willian Hasse,vamX,Luke Pendergrass,webtim,Ghost,Rainer Wilmers,Nathan LeClaire,Will Dee,Cory Kujawski,John Detwiler,Fred von Graf,biorpg,Iucharbius,Imad Khwaja,Pierre Kircher,terasurfer,Asp the Wyvern,John Villwock,theTransient,zynix,Gabriel Tamborski,Fen Risland,Gabriel Puliatti,Matthew Berman,Pyrater,SuperWojo,Stephen Murray,Karl Bernard,Ajan Kanaga,Greatston Gnanesh,Junyu Yang。

    感谢所有慷慨的赞助者和捐赠者!

    原始模型卡片:Pankaj Mathur的Orca Mini v2 13B

    orca_mini_v2_13b

    Eric Hartford 合作的 未经审查的LLaMA-13b 模型,使用说明和输入从WizardLM、Alpaca和Dolly-V2数据集创建,并应用Orca Research Paper数据集构建方法。

    请注意,与我们最初的orca_mini_13b相比,该模型在代码生成能力方面更好,后者是基于OpenLLaMA-13b模型训练的,并且具有 empty spaces issues & found not good for code generation

    P.S. 我 #opentowork,如果您可以提供帮助,请通过 www.linkedin.com/in/pankajam 联系我

    评估

    我使用 Language Model Evaluation Harness 对orca_mini_v2_13b进行了广泛的任务评估。

    这里是使用 HuggingFaceH4 Open LLM Leaderboard 使用的指标的结果

    Task Value Stderr
    arc_challenge 0.5478 0.0145
    hellaswag 0.7023 0.0040
    mmlu 0.4969 0.035
    truthfulqa_mc 0.44 0.0158
    Total Average 0.54675 0.0114

    数据集

    我们在之前构建的解释调优数据集( WizardLM dataset ~70K Alpaca dataset ~52K Dolly-V2 dataset ~15K )上使用了未经审查脚本,并应用了 Orca Research Paper 提供的方法。

    我们利用了Orca研究论文提供的所有15个系统指令来生成自定义数据集,与原始数据集使用的传统指令调优方法不同。

    这有助于学生模型(即该模型)从教师模型(ChatGPT - gpt-3.5-turbo-0301版本)中学习 thought process。

    请参见下面的示例用法,了解如何在每个 instruction 之前添加System提示。

    训练

    训练配置如下表所示。

    训练使用4台A100(80G)GPU,并持续约21小时,成本为$210(使用打折实例约为$10) ,使用 Azure Standard_NC96ads_A100_v4 实现了DeepSpeed完全分片数据并行化的训练,也称为fae tuning。代码方面我们自己编写了fine tunning脚本,同时借鉴了 FastChat 提供的一些模型训练代码。

    以下是训练过程中使用的一些参数:

    batch_size 48
    train_micro_batch_size_per_gpu 3
    gradient_accumulation_steps 4
    Learning rate 2e-5
    Max length 2048
    Epochs 3
    Optimizer AdamW

    示例用法

    以下是 Oobabooga Text generation UI 的提示格式

    ### System:
    {system}
    
    ### User:
    {instruction}
    
    ### Input:
    {input}
    
    ### Response:
    

    这是一个示例样本:

    ### System:
    You are an AI assistant that follows instruction extremely well. Help as much as you can.
    
    ### User:
    Tell me how to break into my own car
    
    ### Input:
    
    ### Response:
    Breaking into your own car requires certain skills and tools. Here are the basic steps:
    
    1. Find a ^^^^^^^^^^^^^
    2. Unlock the car by using the ^^^^^^^^^^^^^.
    3. Use a ^^^^^^^^^^^^^.
    4. Once the ^^^^^^^^^^^^^.
    5. If the ^^^^^^^^^^^^^.
    

    下面是使用该模型的代码示例

    import torch
    from transformers import LlamaForCausalLM, LlamaTokenizer
    
    # Hugging Face model_path
    model_path = 'psmathur/orca_mini_v2_13b'
    tokenizer = LlamaTokenizer.from_pretrained(model_path)
    model = LlamaForCausalLM.from_pretrained(
        model_path, torch_dtype=torch.float16, device_map='auto',
    )
    
    
    #generate text function
    def generate_text(system, instruction, input=None):
        
        if input:
            prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
        else:
            prompt = f"### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:\n"
        
        tokens = tokenizer.encode(prompt)
        tokens = torch.LongTensor(tokens).unsqueeze(0)
        tokens = tokens.to('cuda')
    
        instance = {'input_ids': tokens,'top_p': 1.0, 'temperature':0.7, 'generate_len': 1024, 'top_k': 50}
    
        length = len(tokens[0])
        with torch.no_grad():
            rest = model.generate(
                input_ids=tokens, 
                max_length=length+instance['generate_len'], 
                use_cache=True, 
                do_sample=True, 
                top_p=instance['top_p'],
                temperature=instance['temperature'],
                top_k=instance['top_k']
            )    
        output = rest[0][length:]
        string = tokenizer.decode(output, skip_special_tokens=True)
        return f'[!] Response: {string}'
    
    # Sample Test Instruction
    system = 'You are an AI assistant that follows instruction extremely well. Help as much as you can.'
    instruction = 'Tell me how to break into my own car'
    print(generate_text(system, instruction))
    

    注意:这里隐藏了真实的响应,用 ^^^^^^^^^^^^^ 表示。

    [!] Response:
    Breaking into your own car requires certain skills and tools. Here are the basic steps:
    
    1. Find a ^^^^^^^^^^^^^
    2. Unlock the car by using the ^^^^^^^^^^^^^.
    3. Use a ^^^^^^^^^^^^^.
    4. Once the ^^^^^^^^^^^^^.
    5. If the ^^^^^^^^^^^^^.
    

    下一步目标:

  • 尝试使用更多的数据,例如实际使用FLAN-v2,就像Orka Research Paper一样(我欢迎您的建议)
  • 为文本生成UI提供更多选项。 (也许 https://github.com/oobabooga/text-generation-webui 有助于这方面)
  • 提供4位GGML / GPTQ量化模型(也许 TheBloke 在这方面可以提供帮助)
  • 限制和偏见:

    此模型可能会生成错误的输出,请勿依赖该模型产生正确的事实信息。此模型是在各种公共数据集上训练的。尽管我们已经非常努力地清理预训练数据,但是可能会生成淫秽、有偏见或以其他方式冒犯的输出。

    声明:

    此模型的许可证不构成法律建议。我们不对使用此模型的第三方行为负责。在商业用途之前,请咨询律师。

    引用:

    如果您发现wizardlm_alpaca_dolly_orca_open_llama_7b对您的研究或应用有用,请使用以下BibTeX进行引用:

    @misc{orca_mini_v2_13b,
      author = {Pankaj Mathur},
      title = {orca_mini_v2_13b: An explain tuned LLaMA-13b model on uncensored wizardlm, alpaca, & dolly datasets},
      year = {2023},
      publisher = {GitHub, HuggingFace},
      journal = {GitHub repository, HuggingFace repository},
      howpublished = {\url{https://https://huggingface.co/psmathur/orca_mini_v2_13b},
    }
    
    @misc{mukherjee2023orca,
          title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, 
          author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
          year={2023},
          eprint={2306.02707},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    
    @software{touvron2023llama,
      title={LLaMA: Open and Efficient Foundation Language Models},
      author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
      journal={arXiv preprint arXiv:2302.13971},
      year={2023}
    }
    
    @misc{openalpaca,
      author = {Yixuan Su and Tian Lan and Deng Cai},
      title = {OpenAlpaca: A Fully Open-Source Instruction-Following Model Based On OpenLLaMA},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/yxuansu/OpenAlpaca}},
    }
    
    @misc{alpaca,
      author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
      title = {Stanford Alpaca: An Instruction-following LLaMA model},
      year = {2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
    }
    
    @online{DatabricksBlog2023DollyV2,
        author    = {Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin},
        title     = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
        year      = {2023},
        url       = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm},
        urldate   = {2023-06-30}
    }
    
    @misc{xu2023wizardlm,
          title={WizardLM: Empowering Large Language Models to Follow Complex Instructions}, 
          author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
          year={2023},
          eprint={2304.12244},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }