模型:
Kirili4ik/ruDialoGpt3-medium-finetuned-telegram
DialoGPT是在俄语上进行训练并在我的电报聊天中进行微调的模型。
此模型由 sberbank-ai 创建,并在俄语论坛上进行了训练(请参阅 Grossmend's model )。您可以在 habr 上找到关于如何进行训练的信息(俄语)。我创建了一个简单的流程,并在我的导出电报聊天(约30MB的JSON文件)上对该模型进行了微调。事实上,从电报获取数据并对模型进行微调非常简单。因此,我为此准备了一个Colab教程: https://colab.research.google.com/drive/1fnAVURjyZRK9VQg1Co_-SKUQnRES8l9R?usp=sharing
⚠️ 由于数据的特殊性,托管的推理API可能无法正常工作 ⚠️
?要尝试,请使用我的 Spaces demo ?
# Download model and tokenizer checkpoint = "Kirili4ik/ruDialoGpt3-medium-finetuned-telegram" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint) model.eval() # util function to get expected len after tokenizing def get_length_param(text: str, tokenizer) -> str: tokens_count = len(tokenizer.encode(text)) if tokens_count <= 15: len_param = '1' elif tokens_count <= 50: len_param = '2' elif tokens_count <= 256: len_param = '3' else: len_param = '-' return len_param # util function to get next person number (1/0) for Machine or Human in the dialogue def get_user_param(text: dict, machine_name_in_chat: str) -> str: if text['from'] == machine_name_in_chat: return '1' # machine else: return '0' # human chat_history_ids = torch.zeros((1, 0), dtype=torch.int) while True: next_who = input("Who's phrase?\t") #input("H / G?") # Human or GPT # In case Human if next_who == "H" or next_who == "Human": input_user = input("===> Human: ") # encode the new user input, add parameters and return a tensor in Pytorch new_user_input_ids = tokenizer.encode(f"|0|{get_length_param(input_user, tokenizer)}|" \ + input_user + tokenizer.eos_token, return_tensors="pt") # append the new user input tokens to the chat history chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if next_who == "G" or next_who == "GPT": next_len = input("Phrase len? 1/2/3/-\t") #input("Exp. len?(-/1/2/3): ") # encode the new user input, add parameters and return a tensor in Pytorch new_user_input_ids = tokenizer.encode(f"|1|{next_len}|", return_tensors="pt") # append the new user input tokens to the chat history chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) # print(tokenizer.decode(chat_history_ids[-1])) # uncomment to see full gpt input # save previous len input_len = chat_history_ids.shape[-1] # generated a response; PS you can read about the parameters at hf.co/blog/how-to-generate chat_history_ids = model.generate( chat_history_ids, num_return_sequences=1, # use for more variants, but have to print [i] max_length=512, no_repeat_ngram_size=3, do_sample=True, top_k=50, top_p=0.9, temperature = 0.6, # 0 for greedy mask_token_id=tokenizer.mask_token_id, eos_token_id=tokenizer.eos_token_id, unk_token_id=tokenizer.unk_token_id, pad_token_id=tokenizer.pad_token_id, device='cpu' ) # pretty print last ouput tokens from bot print(f"===> GPT-3: {tokenizer.decode(chat_history_ids[:, input_len:][0], skip_special_tokens=True)}")