模型:

Thireus/Vicuna13B-v1.1-8bit-128g

任务:

文本生成

类库:

PyTorch Transformers

其他:

llama vicuna text-generation-inference

预印本库:

arxiv:2210.17323 arxiv:2105.03536 arxiv:2212.09720 arxiv:2301.00774

许可:

other

模型介绍文件清单

英文

这是Vicuna 13B v1.1 HF的8位GPTQ版本（请勿与8位RTN混淆）。

问：为什么采用8位量化而不是4位？答：用于评估目的。理论上，8位量化模型应该在困惑度上略微优于4位量化版本（可能不明显-待评估...）。如果您的GPU VRAM可用内存超过15GB，您可能想尝试一下。请注意，8位量化不意味着以8位精度加载模型。以8位精度加载模型（--load-in-8bit）会导致感知质量（困惑度）下降。

参考文献：

该模型是Vicuna 13B v1.1的8位量化版本。

13B参数
组大小：128
wbits：8
真实顺序：是
激活函数顺序：是
8位GPTQ
c4
转换过程：LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g

基准测试

使用 https://github.com/qwopqwop200/GPTQ-for-LLaMa/ 进行测试。最佳结果以粗体显示。

--benchmark 2048 --check结果：

Model	wikitext2 PPL	ptb PPL	c4 PPL	VRAM Utilization
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g	8.517391204833984	20.888103485107422	7.058407783508301	8670.26953125
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g	8.508771896362305	20.75649070739746	7.105874538421631	14840.26171875

--eval结果：

Model	wikitext2 PPL	ptb PPL	c4 PPL
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g	7.119165420532227	25.692861557006836	9.06746768951416
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g	6.988043308258057	24.882535934448242	8.991846084594727

--new-eval --eval结果：

Model	wikitext2 PPL	ptb-new PPL	c4-new PPL
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g	7.119165420532227	35.637290954589844	9.550592422485352
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g	6.988043308258057	34.264320373535156	9.426002502441406

PPL = 困惑度（值越低越好）- https://huggingface.co/docs/transformers/perplexity

基本安装过程

简直是一场噩梦，我只会简要地详述你需要什么。解决WSL问题相当痛苦。我无法提供安装支持，抱歉。您当然可以使用支持8位量化的llama.cpp和其他加载程序，我只是选择了oobabooga/text-generation-webui。在text-generation-webui加载之前，您可能会遇到许多错误，范围从缺少PATH或环境变量到必须手动pip卸载/安装软件包。下面的备注可能会在text-generation-webui和GPTQ-for-LLaMa都获得适当的错误修复后过时。如果该模型生成非常慢的答案（每秒1个令牌），则表示您没有使用Cuda来进行位操作，或者您的硬件需要升级。如果该模型生成带有奇怪字符的答案，则表示您正在使用qwopqwop200/GPTQ-for-LLaMa的错误提交。如果该模型生成超出主题范围的答案或者它自言自语，则表示您正在使用qwopqwop200/GPTQ-for-LLaMa的错误提交。

推荐 - Triton（每秒令牌数快） - 在Windows上通过WSL（我使用的方式）或Linux工作：

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
#git fetch origin pull/1229/head:triton # Since been merged # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
git checkout triton
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
cd GPTQ-for-LLaMa
#git checkout 508de42 # Since been fixed # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
git checkout 210c379 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one
pip install -r requirements.txt

不推荐 - Cuda（每秒令牌数慢）和输出问题 https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/128 ：

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
cd GPTQ-for-LLaMa
git checkout 505c2c7 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one
pip install -r requirements.txt
python setup_cuda.py install