模型:
ethzanalytics/RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g
通过auto-gptq对 RedPajama-INCITE-Chat-3B-v1 进行的GPTQ量化。模型文件只有2GB。
请注意,目前还无法直接从hub加载auto_gptq - 如果需要,可以使用 this function 通过repo名称下载。
首先安装auto-GPTQ
pip install ninja auto-gptq[triton]
加载:
import torch from pathlib import Path from auto_gptq import AutoGPTQForCausalLM from transformers import AutoTokenizer model_repo = Path.cwd() / "RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g" device = "cuda:0" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_repo) model = AutoGPTQForCausalLM.from_quantized( model_repo, device=device, use_safetensors=True, use_triton=device != "cpu", # comment/remove if not on Linux ).to(device)
推理:
import re import pprint as pp prompt = "How can I further strive to increase shareholder value even further?" prompt = f"<human>: {prompt}\n<bot>:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, penalty_alpha=0.6, top_k=4, temperature=0.7, do_sample=True, max_new_tokens=192, length_penalty=0.9, pad_token_id=model.config.eos_token_id ) result = tokenizer.batch_decode( outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True ) bot_responses = re.findall(r'<bot>:(.*?)(<human>|$)', result[0], re.DOTALL) bot_responses = [response[0].strip() for response in bot_responses] print(bot_responses[0])