模型:

michaelfeil/ct2fast-flan-alpaca-base

英文

使用Ctranslate2进行快速推理

使用C++中的int8推理使推理加速2倍-8倍

declare-lab/flan-alpaca-base 的量化版本

pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0

ctranslate2 hf-hub-ctranslate2 兼容的检查点

  • compute_type=int8_float16 for device="cuda"
  • compute_type=int8 for device="cpu"
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub

model_name = "michaelfeil/ct2fast-flan-alpaca-base"
model = TranslatorCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name, 
        device="cuda",
        compute_type="int8_float16"
)
outputs = model.generate(
    text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
    min_decoding_length=24,
    max_decoding_length=32,
    max_input_length=512,
    beam_size=5
)
print(outputs)

许可证和其他备注:

这只是一个量化版本。许可条件预期与原始的huggingface仓库相同。