模型:
TheBloke/open-llama-7b-open-instruct-GPTQ
Chat & support: my new Discord server
Want to contribute? TheBloke's Patreon page
这些文件是用于 VMWare's open-llama-7B-open-instruct 的GPTQ 4位模型文件。
这是使用 GPTQ-for-LLaMa 进行4位量化的结果。
标准羊驼:
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: prompt ### Response:"
请确保您正在使用最新版本的文本生成WebUI
首先确保您已安装 AutoGPTQ :
pip install auto-gptq
然后尝试以下示例代码:
from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import argparse model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ" model_basename = "open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device="cuda:0", use_triton=use_triton, quantize_config=None) print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) prompt = "Tell me about AI" prompt_template=f'''### Human: {prompt} ### Assistant:''' print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])
open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors
这将适用于AutoGPTQ和GPTQ-for-LLaMa的CUDA版本。有报告称在最近的GPTQ-for-LLaMa Triton模式下存在问题。如果有问题,请改用AutoGPTQ。
它使用group_size 128创建以提高推断准确性,但无 --act-order(desc_act)以提高兼容性和推断速度。
如需进一步支持和讨论这些模型和AI,加入我们:
感谢 chirper.ai 团队!
很多人问我他们是否可以贡献。 我喜欢提供模型并帮助人们,并且很乐意花更多的时间做这件事,同时还可以扩展到新的项目,例如微调/训练。
如果您能够并愿意做出贡献,我将非常感激,并将帮助我继续提供更多模型,并开始进行新的AI项目。
捐赠者将优先获得有关AI/LLM/模型问题和请求的支持,可以进入私人Discord房间,并享受其他福利。
特别感谢:来自CarbonQuill的Luke、Aemon Algiz、Dmitriy Samsonov。
Patreon特别提到:Ajan Kanaga、Kalila、Derek Yates、Sean Connelly、Luke、Nathan LeClaire、Trenton Dambrowitz、Mano Prime、David Flickinger、vamX、Nikolai Manek、senxiiz、Khalefa Al-Ahmad、Illia Dulskyi、trip7s trip、Jonathan Leane、Talal Aujan、Artur Olbinski、Cory Kujawski、Joseph William Delisle、Pyrater、Oscar Rangel、Lone Striker、Luke Pendergrass、Eugene Pentland、Johann-Peter Hartmann。
感谢我所有慷慨的赞助者和捐赠者!
经过指令调整的完全训练的Open LLama 7B模型。该模型对商业用途开放。
注意:该模型使用Alpaca提示模板进行了训练。
import os import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = 'VMware/open-llama-7B-open-instruct' tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype= torch.float16, device_map = 'sequential') prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:" prompt= 'Explain in simple terms how the attention mechanism of a transformer model works' inputt = prompt_template.format(instruction= prompt) input_ids = tokenizer(inputt, return_tensors="pt").input_ids.to("cuda") output1 = model.generate(input_ids, max_length=512) input_length = input_ids.shape[1] output1 = output1[:, input_length:] output= tokenizer.decode(output1[0]) print(output) ''' Attention is a mechanism used in deep learning models, such as transformer models, to capture global dependencies between different parts of the input. In a transformer model, the attention mechanism works by computing a weighted sum of the input vectors and then applying a non-linear activation function to the result. The attention mechanism in a transformer model works in two steps: 1. Query-Key Mapping: First, the input sequence is divided into two parts: the query vector and the key vector. The query vector represents the input at the current position, and the key vector represents the input at a previous position. 2. Attention Weight Calculation: Second, the attention weights are calculated using the dot product between the query vector and each key vector. The attention weights represent the importance of the input at the previous position to the current position. The attention weights are then used to compute the attention score for each input element. The attention score represents the relevance of the input element to the current position. The attention mechanism in a transformer model is designed to capture global dependencies between different parts of the input. By attending to input elements from different positions, the model can learn to understand the relationships between different parts of the input. This allows the model to perform more complex tasks, such as understanding the relationships between words in a sentence or pixels in an image.</s> '''
微调脚本将在我们的 RAIL Github Repository 中提供
待办事项