模型:

TheBloke/open-llama-7b-open-instruct-GPTQ

英文

Chat & support: my new Discord server

Want to contribute? TheBloke's Patreon page

VMWare的open-llama-7B-open-instruct GPTQ

这些文件是用于 VMWare's open-llama-7B-open-instruct 的GPTQ 4位模型文件。

这是使用 GPTQ-for-LLaMa 进行4位量化的结果。

可用的存储库

提示模板

标准羊驼:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction: prompt
### Response:"

如何轻松下载和使用此模型在文本生成WebUI中

请确保您正在使用最新版本的文本生成WebUI

  • 点击“模型”选项卡。
  • 在“下载自定义模型或LoRA”下,输入“TheBloke/open-llama-7b-open-instruct-GPTQ”。
  • 点击“下载”。
  • 模型将开始下载。下载完成后会显示“完成”
  • 在左上角,点击“模型”旁边的刷新图标。
  • 在“模型”下拉菜单中,选择刚刚下载的模型:open-llama-7b-open-instruct-GPTQ
  • 模型将自动加载,现在已准备就绪!
  • 如果您需要自定义设置,请设置并单击“为此模型保存设置”,然后在右上角单击“重新加载模型”。
    • 请注意,您无需再设置GPTQ参数。这些将根据文件“quantize_config.json”自动设置。
  • 准备好后,点击“文本生成”选项卡,然后输入提示开始!
  • 如何在Python代码中使用此GPTQ模型

    首先确保您已安装 AutoGPTQ

    pip install auto-gptq

    然后尝试以下示例代码:

    from transformers import AutoTokenizer, pipeline, logging
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
    import argparse
    
    model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ"
    model_basename = "open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order"
    
    use_triton = False
    
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
    
    model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
            model_basename=model_basename,
            use_safetensors=True,
            trust_remote_code=True,
            device="cuda:0",
            use_triton=use_triton,
            quantize_config=None)
    
    print("\n\n*** Generate:")
    
    input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
    print(tokenizer.decode(output[0]))
    
    # Inference can also be done using transformers' pipeline
    
    # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
    logging.set_verbosity(logging.CRITICAL)
    
    prompt = "Tell me about AI"
    prompt_template=f'''### Human: {prompt}
    ### Assistant:'''
    
    print("*** Pipeline:")
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.15
    )
    
    print(pipe(prompt_template)[0]['generated_text'])
    

    提供的文件

    open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors

    这将适用于AutoGPTQ和GPTQ-for-LLaMa的CUDA版本。有报告称在最近的GPTQ-for-LLaMa Triton模式下存在问题。如果有问题,请改用AutoGPTQ。

    它使用group_size 128创建以提高推断准确性,但无 --act-order(desc_act)以提高兼容性和推断速度。

    • open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors
      • 适用于AutoGPTQ的CUDA或Triton模式。
      • 适用于GPTQ-for-LLaMa的CUDA模式。可能在GPTQ-for-LLaMa Triton模式下存在问题。
      • 适用于文本生成WebUI,包括一键安装程序。
      • 参数:Groupsize = 128。Act Order / desc_act = False。

    讨论

    如需进一步支持和讨论这些模型和AI,加入我们:

    TheBloke AI's Discord server

    感谢及如何贡献

    感谢 chirper.ai 团队!

    很多人问我他们是否可以贡献。 我喜欢提供模型并帮助人们,并且很乐意花更多的时间做这件事,同时还可以扩展到新的项目,例如微调/训练。

    如果您能够并愿意做出贡献,我将非常感激,并将帮助我继续提供更多模型,并开始进行新的AI项目。

    捐赠者将优先获得有关AI/LLM/模型问题和请求的支持,可以进入私人Discord房间,并享受其他福利。

    特别感谢:来自CarbonQuill的Luke、Aemon Algiz、Dmitriy Samsonov。

    Patreon特别提到:Ajan Kanaga、Kalila、Derek Yates、Sean Connelly、Luke、Nathan LeClaire、Trenton Dambrowitz、Mano Prime、David Flickinger、vamX、Nikolai Manek、senxiiz、Khalefa Al-Ahmad、Illia Dulskyi、trip7s trip、Jonathan Leane、Talal Aujan、Artur Olbinski、Cory Kujawski、Joseph William Delisle、Pyrater、Oscar Rangel、Lone Striker、Luke Pendergrass、Eugene Pentland、Johann-Peter Hartmann。

    感谢我所有慷慨的赞助者和捐赠者!

    原始模型卡片:VMWare的open-llama-7B-open-instruct

    VMware/open-llama-7B-open-instruct

    经过指令调整的完全训练的Open LLama 7B模型。该模型对商业用途开放。

    注意:该模型使用Alpaca提示模板进行了训练。

    许可证

    命名法

    • 模型:Open-llama
    • 模型大小:7B参数
    • 数据集:Open-instruct-v1(oasst,dolly,hhrlhf)

    在Transformers中使用

    import os
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = 'VMware/open-llama-7B-open-instruct'
    
    
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype= torch.float16, device_map = 'sequential')
    
    prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
    
    prompt=  'Explain in simple terms how the attention mechanism of a transformer model works'
    
    
    inputt = prompt_template.format(instruction= prompt)
    input_ids = tokenizer(inputt, return_tensors="pt").input_ids.to("cuda")
    
    output1 = model.generate(input_ids, max_length=512)
    input_length = input_ids.shape[1]
    output1 = output1[:, input_length:]
    output= tokenizer.decode(output1[0])
    
    print(output)
    
    '''
     Attention is a mechanism used in deep learning models, such as transformer models, to capture global dependencies between different parts of the input. In a transformer model, the attention mechanism works by computing a weighted sum of the input vectors and then applying a non-linear activation function to the result.
    
    The attention mechanism in a transformer model works in two steps:
    
    1. Query-Key Mapping: First, the input sequence is divided into two parts: the query vector and the key vector. The query vector represents the input at the current position, and the key vector represents the input at a previous position.
    
    2. Attention Weight Calculation: Second, the attention weights are calculated using the dot product between the query vector and each key vector. The attention weights represent the importance of the input at the previous position to the current position.
    
    The attention weights are then used to compute the attention score for each input element. The attention score represents the relevance of the input element to the current position.
    
    The attention mechanism in a transformer model is designed to capture global dependencies between different parts of the input. By attending to input elements from different positions, the model can learn to understand the relationships between different parts of the input. This allows the model to perform more complex tasks, such as understanding the relationships between words in a sentence or pixels in an image.</s>
    
    '''
    

    微调详细信息

    微调脚本将在我们的 RAIL Github Repository 中提供

    评估

    待办事项