模型:
michaelfeil/ct2fast-flan-alpaca-base
使用C++中的int8推理使推理加速2倍-8倍
declare-lab/flan-alpaca-base 的量化版本
pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
与 ctranslate2 和 hf-hub-ctranslate2 兼容的检查点
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub model_name = "michaelfeil/ct2fast-flan-alpaca-base" model = TranslatorCT2fromHfHub( # load in int8 on CUDA model_name_or_path=model_name, device="cuda", compute_type="int8_float16" ) outputs = model.generate( text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"], min_decoding_length=24, max_decoding_length=32, max_input_length=512, beam_size=5 ) print(outputs)
这只是一个量化版本。许可条件预期与原始的huggingface仓库相同。