模型:

SearchUnify-ML/xgen-7b-8k-open-instruct-gptq

英文

SearchUnify-ML/xgen-7b-8k-open-instruct-gptq

这些是用于 VMWare's XGEN 7B 8K Open Instruct 的 GPTQ 4 bit 模型文件。

这是使用 GPTQ-for-LLaMa 进行 4 bit 量化的结果。

如何在 Python 代码中使用这个 GPTQ 模型

首先,确保安装了 AutoGPTQ

pip install auto-gptq

其次,安装 tiktoken 以便使用分词器

pip install tiktoken
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                          use_fast=False,
                                          trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
                                           model_basename=model_basename,
                                           use_safetensors=False,
                                           trust_remote_code=True,
                                           device="cuda:0",
                                           use_triton=use_triton)

# Note: check the prompt template is correct for this model.
prompt = "Explain the rules of field hockey to a novice."
prompt_template = f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")