模型:
SearchUnify-ML/xgen-7b-8k-open-instruct-gptq
这些是用于 VMWare's XGEN 7B 8K Open Instruct 的 GPTQ 4 bit 模型文件。
这是使用 GPTQ-for-LLaMa 进行 4 bit 量化的结果。
首先,确保安装了 AutoGPTQ :
pip install auto-gptq
其次,安装 tiktoken 以便使用分词器
pip install tiktoken
from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq" model_basename = "gptq_model-4bit-128g" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False, trust_remote_code=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=False, trust_remote_code=True, device="cuda:0", use_triton=use_triton) # Note: check the prompt template is correct for this model. prompt = "Explain the rules of field hockey to a novice." prompt_template = f'''### Instruction: {prompt} ### Response:''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512) print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")