模型:
TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ
Chat & support: my new Discord server
Want to contribute? TheBloke's Patreon page
These files are GPTQ 4bit model files for H2O's GPT-GM-OASST1-Falcon 40B v2 .
It is the result of quantising to 4bit using AutoGPTQ .
<|prompt|>prompt<|endoftext|> <|answer|>
Please note this is an experimental GPTQ model. Support for it is currently quite limited.
It is also expected to be VERY SLOW . This is unavoidable at the moment, but is being looked at.
First make sure you have AutoGPTQ installed:
pip install auto-gptq
Then try the following example code:
from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ" model_basename = "gptq_model-4bit--1g" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device="cuda:0", use_triton=use_triton, quantize_config=None) # Note: check the prompt template is correct for this model. prompt = "Tell me about AI" prompt_template=f'''<|prompt|>{prompt}<|endoftext|><|answer|>''' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])
gptq_model-4bit--1g.safetensors
This will work with AutoGPTQ, ExLlama, and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.
This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then trust_remote_code will no longer be needed.
In this repo you can see two .py files - these are the files that get executed. They are copied from the base repo at Falcon-40B-Instruct .
For further support, and discussions on these models and AI in general, join us at:
Thanks to the chirper.ai team!
I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
Special thanks to : Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
Patreon special mentions : Mano Prime, Fen Risland, Derek Yates, Preetika Verma, webtim, Sean Connelly, Alps Aficionado, Karl Bernard, Junyu Yang, Nathan LeClaire, Chris McCloskey, Lone Striker, Asp the Wyvern, Eugene Pentland, Imad Khwaja, trip7s trip, WelcomeToTheClub, John Detwiler, Artur Olbinski, Khalefa Al-Ahmad, Trenton Dambrowitz, Talal Aujan, Kevin Schuppel, Luke Pendergrass, Pyrater, Joseph William Delisle, terasurfer , vamX, Gabriel Puliatti, David Flickinger, Jonathan Leane, Iucharbius , Luke, Deep Realms, Cory Kujawski, ya boyyy, Illia Dulskyi, senxiiz, Johann-Peter Hartmann, John Villwock, K, Ghost , Spiking Neurons AB, Nikolai Manek, Rainer Wilmers, Pierre Kircher, biorpg, Space Cruiser, Ai Maven, subjectnull, Willem Michiel, Ajan Kanaga, Kalila, chris gileta, Oscar Rangel.
Thank you to all my generous patrons and donaters!
This model was trained using H2O LLM Studio .
To use the model with the transformers library on a machine with GPUs, first make sure you have the transformers , accelerate and torch libraries installed.
pip install transformers==4.29.2 pip install bitsandbytes==0.39.0 pip install accelerate==0.19.0 pip install torch==2.0.0 pip install einops==0.6.1
import torch from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer model_kwargs = {} quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) model_kwargs["quantization_config"] = quantization_config tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) generate_text = pipeline( model="h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", tokenizer=tokenizer, torch_dtype=torch.float16, trust_remote_code=True, use_fast=False, device_map={"": "cuda:0"}, model_kwargs=model_kwargs, ) res = generate_text( "Why is drinking water so healthy?", min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"])
You can print a sample prompt after the preprocessing step to see how it is feed to the tokenizer:
print(generate_text.preprocess("Why is drinking water so healthy?")["prompt_text"])
<|prompt|>Why is drinking water so healthy?<|endoftext|><|answer|>
Alternatively, you can download h2oai_pipeline.py , store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
import torch from h2oai_pipeline import H2OTextGenerationPipeline from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", trust_remote_code=True, torch_dtype=torch.float16, device_map={"": "cuda:0"}, quantization_config=quantization_config ).eval() generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer) res = generate_text( "Why is drinking water so healthy?", min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"])
You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # Important: The prompt needs to be in the same format the model was trained with. # You can find an example prompt in the experiment logs. prompt = "<|prompt|>How are you?<|endoftext|><|answer|>" quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", trust_remote_code=True, torch_dtype=torch.float16, device_map={"": "cuda:0"}, quantization_config=quantization_config ).eval() inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda") # generate configuration can be modified to your needs tokens = model.generate( **inputs, min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True )[0] tokens = tokens[inputs["input_ids"].shape[1]:] answer = tokenizer.decode(tokens, skip_special_tokens=True) print(answer)
RWForCausalLM( (transformer): RWModel( (word_embeddings): Embedding(65024, 8192) (h): ModuleList( (0-59): 60 x DecoderLayer( (ln_attn): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) (ln_mlp): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) (self_attention): Attention( (maybe_rotary): RotaryEmbedding() (query_key_value): Linear(in_features=8192, out_features=9216, bias=False) (dense): Linear(in_features=8192, out_features=8192, bias=False) (attention_dropout): Dropout(p=0.0, inplace=False) ) (mlp): MLP( (dense_h_to_4h): Linear(in_features=8192, out_features=32768, bias=False) (act): GELU(approximate='none') (dense_4h_to_h): Linear(in_features=32768, out_features=8192, bias=False) ) ) ) (ln_f): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=8192, out_features=65024, bias=False) )
This model was trained using H2O LLM Studio and with the configuration in cfg.yaml . Visit H2O LLM Studio to learn how to train your own large language models.
Please read this disclaimer carefully before using the large language model provided in this repository. Your use of the model signifies your agreement to the following terms and conditions.
By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.