模型:
AlexWortega/instruct_rugptlarge
这是使用指令式伪造方法对ruGPTlarge进行继续训练得到的模型,它在零样本和有样本情况下表现更好,相较于XGLM1.7b、mgpt在俄语上的表现更佳。
from transformers import GPT2TokenizerFast,GPT2LMHeadModel tokenizer = GPT2TokenizerFast.from_pretrained("AlexWortega/instruct_rugptlarge") special_tokens_dict = {'additional_special_tokens': ['<code>', '</code>', '<instructionS>', '<instructionE>', '<next>']} tokenizer.add_special_tokens(special_tokens_dict) device = 'cuda' model = GPT2LMHeadModel.from_pretrained("AlexWortega/instruct_rugptlarge") model.to(device) model.resize_token_embeddings(len(tokenizer)) def generate_seqs(q,model, k=2): gen_kwargs = { "min_length": 20, "max_new_tokens": 100, "top_k": 50, "top_p": 0.7, "do_sample": True, "early_stopping": True, "no_repeat_ngram_size": 2, "eos_token_id": tokenizer.eos_token_id, "pad_token_id": tokenizer.eos_token_id, "use_cache": True, "repetition_penalty": 1.5, "length_penalty": 1.2, "num_beams": 4, "num_return_sequences": k } q = q + '<instructionS>' t = tokenizer.encode(q, return_tensors='pt').to(device) g = model.generate(t, **gen_kwargs) generated_sequences = tokenizer.batch_decode(g, skip_special_tokens=True) return generated_sequences
请注意,生成的最佳参数如下:
gen_kwargs = { "min_length": 20, "max_new_tokens": 100, "top_k": 50, "top_p": 0.9, "do_sample": True, "early_stopping": True, "no_repeat_ngram_size": 2, "eos_token_id": tokenizer.eos_token_id, "pad_token_id": tokenizer.eos_token_id, "use_cache": True, "repetition_penalty": 1.5, "length_penalty": 0.8, "num_beams": 4, "num_return_sequences": k }
指南 ruGPT Small v0.1a的权重使用Apache License 2.0版本进行许可。
我使用Novograd优化器,学习率为2e-5,全局批量大小为6(每个数据并行工作器为3)。我同时使用数据并行和管道并行进行训练。在训练过程中,我们将输入序列截断为1024个标记,对于少于1024个标记的输入序列,我们将多个序列连接成一个长序列,以提高数据效率。
#指标
van deij пипл, ван дееей
@article{ title={GPT2xl is underrated task solver}, author={Nickolich Aleksandr, 5Q, datascience, Ilya Gusev, Alex Kukushkin, Karina Romanova, Arseniy Shahmatov, Maksim Gersimenko}, year={2023} }