模型:
stanfordnlp/SteamSHP-flan-t5-xl
SteamSHP-XL是一个偏好模型,经过训练可以预测在给定一些语境和两个可能的回答的情况下,人们会认为哪个回答更有帮助。它可以用于自然语言生成的评估,或者作为强化学习高效力训练的奖励模型。
它是一个FLAN-T5-xl模型(30亿参数),在以下数据集上进行了微调:
还有一个规模较小的变体称为 SteamSHP-Large ,它是通过对FLAN-T5-large(7.8亿参数)进行微调而得到的。尽管它是大小的四分之一,但在所有领域的SHP + Anthrop接触测试数据中,它的准确性平均只降低0.75个点。
输入文本应该是以下格式:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) } RESPONSE A: { first possible continuation (not containing any newlines \n) } RESPONSE B: { second possible continuation (not containing any newlines \n) } Which response is better? RESPONSE
SteamSHP-XL生成的输出将是A或B。
使用该模型的方法如下:
>> from transformers import T5ForConditionalGeneration, T5Tokenizer >> device = 'cuda' # if you have a GPU >> tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl') >> model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl').to(device) >> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: Lime marmalade lol\n\n Which response is better? RESPONSE" >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device) >> y = model.generate(x, max_new_tokens=1) >> tokenizer.batch_decode(y, skip_special_tokens=True) ['A']
如果输入超过512个标记限制,您可以使用 pybsd 将输入分解为句子,并且只包括适合512个标记的内容。当试图将一个示例塞入512个标记时,我们建议尽可能截断上下文,尽量不要改动回答部分。
如果您想将SteamSHP-XL用作奖励模型,以获取单个回答的分数,则需要构造输入,使得RESPONSE A是您要评分的回答,而RESPONSE B只是一个空输入:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) } RESPONSE A: { continuation (not containing any newlines \n) } RESPONSE B: . Which response is better? RESPONSE
然后计算分配给标签A的概率。这个概率(或logit)就是回答的分数:
>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: .\n\n Which response is better? RESPONSE" >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device) >> outputs = model.generate(x, return_dict_in_generate=True, output_scores=True, max_new_tokens=1) >> torch.exp(outputs.scores[0][:, 71]) / torch.exp(outputs.scores[0][:,:]).sum(axis=1).item() # index 71 corresponds to the token for 'A' 0.819
由于RESPONSE B只是一个空输入,概率通常会很高(在0.8到1.0的范围内)。因此,您可能希望对概率进行归一化处理。
您还可以比较分配给每个回答的两个独立概率(在给定相同上下文的情况下),以推断偏好标签。例如,如果一个回答的概率为0.95,而另一个回答的概率为0.80,则人们更喜欢前者。以这种方式推断偏好标签只会导致在SHP + HH-RLHF测试数据上的准确率平均下降0.006,这意味着将SteamSHP-XL用作奖励模型而不是偏好模型仅会有很小的惩罚。
SteamSHP-XL仅对可用的39.2万个训练示例中的12.5万个进行微调,因为我们发现:
我们使用准确性评估了SHP和HH-RLHF测试数据上的模型,但仅对可以截断以适合500个标记以内的数据(共计18621个可用测试示例中的20753个)进行了评估。SteamSHP-XL在所有领域上的平均准确率为72.8%:
Domain | Accuracy |
---|---|
askculinary | 0.7199 |
askhr | 0.7743 |
askdocs | 0.7210 |
askanthropology | 0.7594 |
asksciencefiction | 0.7283 |
askacademia | 0.7442 |
askengineers | 0.7183 |
legaladvice | 0.8068 |
explainlikeimfive | 0.7392 |
askbaking | 0.6741 |
askphysics | 0.8000 |
askscience | 0.7114 |
askphilosophy | 0.6907 |
askvet | 0.7742 |
changemyview | 0.7043 |
askcarguys | 0.7568 |
askhistorians | 0.7476 |
asksocialscience | 0.7308 |
anthropic (helpfulness) | 0.7310 |
ALL (unweighted) | 0.7278 |
正如前面提到的,如果您将SteamSHP用作奖励模型,并尝试根据分配给每个回答的概率独立推断偏好标签,那也可能起作用!但这样做将导致测试数据的准确率下降0.006(平均下降),这意味着会有很小的惩罚。
SteamSHP经过训练,用于预测人们会认为哪个回答更有帮助,而不是哪个回答更具有危害性。它不应被用于检测有害性,进行道德判断或类似的目的。
训练SteamSHP时使用的数据集中可能存在的偏见和误导也可能在模型预测中传播。尽管SHP排除了带有NSFW(18岁以上)内容的帖子,选择了管理良好并有反对骚扰和偏见的政策的子版块,但其中一些数据可能包含有歧视或有害的语言。人们集体认为更有帮助的回答也不能保证更具事实性。
SHP和HH-RLHF中捕获到偏好的人群不代表更广泛人群的观点。尽管没有提供具体的人口统计信息,总体而言,Reddit用户(典型记录在SHP中)在性别上存在不均衡,并来自发达、西方和英语国家(Pew Research)。
由Anthropic提供的 Past work 发现,为了迎合人类偏好而优化的模型可能会牺牲真相。
如果您对该模型有任何疑问,请联系 kawin@stanford.edu。此模型由Kawin Ethayarajh、Heidi (Chenyu) Zhang、Yizhong Wang和Dan Jurafsky创建。
我们即将发表一篇论文,但在此之前,请引用:
@InProceedings{pmlr-v162-ethayarajh22a, title = {Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information}, author = {Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {5988--6008}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, }