模型:
stanfordnlp/SteamSHP-flan-t5-large
SteamSHP-Large是一个训练有素的偏好模型,旨在预测在给定一些上下文和两个可能的回应的情况下,人们会认为哪个回应更有帮助。它可以用于自然语言生成的评估或作为强化学习的奖励模型。
这是一个基于FLAN-T5-large模型(780M个参数)进行微调的模型,微调的数据集包括:
还有一个更大的变体叫做 SteamSHP-XL ,是通过对FLAN-T5-xl模型(3B个参数)进行微调得到的。
输入文本应采用以下格式:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) }
RESPONSE A: { first possible continuation (not containing any newlines \n) }
RESPONSE B: { second possible continuation (not containing any newlines \n) }
Which response is better? RESPONSE
 SteamSHP-Large生成的输出将是A或B。
以下是如何使用该模型:
>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>> device = 'cuda' # if you have a GPU
>> tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-flan-t5-large')
>> model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-flan-t5-large').to(device)
>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: Lime marmalade lol\n\n Which response is better? RESPONSE"
>> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device)
>> y = model.generate(x, max_new_tokens=1)
>> tokenizer.batch_decode(y, skip_special_tokens=True)
['B']
 如果输入超过了512个令牌的限制,可以使用 pybsd 将输入分成句子,并只包括适应512个令牌的部分。当尝试将示例压缩到512个令牌时,建议尽可能缩短上下文,并尽可能保持响应不变。
如果要将SteamSHP-Large用作奖励模型-计算单个响应的分数,则需要组织输入,使得 RESPONSE A 是要评分的响应,而 RESPONSE B 为空输入:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) }
RESPONSE A: { continuation (not containing any newlines \n) }
RESPONSE B: .
Which response is better? RESPONSE
 然后计算分配给标签A的概率。该概率(或对数几率,取决于你的需求)是响应的分数:
>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: .\n\n Which response is better? RESPONSE" >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device) >> outputs = model.generate(x, return_dict_in_generate=True, output_scores=True, max_new_tokens=1) >> torch.exp(outputs.scores[0][:, 71]) / torch.exp(outputs.scores[0][:,:]).sum(axis=1).item() # index 71 corresponds to the token for 'A' 0.8617
该概率几乎总是很高的(在0.8到1.0的范围内),因为RESPONSE B只是一个空输入。因此,你可能希望对概率进行归一化。
你还可以独立比较分配给每个响应的两个概率(在相同的上下文下),来推断偏好标签。例如,如果一个响应的概率为0.95,而另一个响应为0.80,则前者将被优先选择。以这种方式推断偏好标签只会导致SHP + HH-RLHF测试数据的准确率平均下降0.005,意味着将SteamSHP作为奖励模型而不是偏好模型的惩罚非常小。
SteamSHP-Large仅对可用的392K个训练示例中的125K个示例进行了微调,因为我们发现:
我们使用准确率对SHP和HH-RLHF的测试数据进行了评估,但只针对可以截断以适应500个令牌的数据进行了评估(共有18621个可用的测试示例中的20753个)。SteamSHP-Large在所有领域的平均准确率为72.0%:
| Domain | Accuracy | 
|---|---|
| askculinary | 0.7199 | 
| askhr | 0.7507 | 
| askdocs | 0.6920 | 
| askanthropology | 0.7925 | 
| asksciencefiction | 0.7266 | 
| askacademia | 0.7442 | 
| askengineers | 0.7146 | 
| legaladvice | 0.7958 | 
| explainlikeimfive | 0.7312 | 
| askbaking | 0.6656 | 
| askphysics | 0.7888 | 
| askscience | 0.6926 | 
| askphilosophy | 0.6837 | 
| askvet | 0.7696 | 
| changemyview | 0.6984 | 
| askcarguys | 0.7297 | 
| askhistorians | 0.7476 | 
| asksocialscience | 0.8231 | 
| anthropic (helpfulness) | 0.7310 | 
| ALL (unweighted) | 0.7203 | 
正如先前提到的,如果将SteamSHP用作奖励模型,并尝试根据分配给每个响应的概率,独立推断偏好标签,这也可能行之有效!但这样做会导致测试数据准确率平均下降0.005,意味着存在一小部分的惩罚。
SteamSHP训练的目标是预测人们认为哪个响应更有帮助,而不是哪个响应更有害。因此,它不应该用于检测有害性、进行伦理判断或类似的用途。
训练SteamSHP时使用的数据集中的偏见和错误信息可能会传递到模型的预测结果中。尽管SHP过滤掉了包含NSFW(18岁以上)内容的帖子,选择了受良好管理并具有反对骚扰和偏见的政策的子版块,但其中一些数据可能包含有歧视或有害语言。人类共同认为更有帮助的回复也不能保证更加真实准确。
捕捉在SHP和HH-RLHF中的偏好的人员并不能代表更广泛的人群。虽然没有提供具体的人口统计信息,但整体上,Reddit用户在SHP中被捕捉到的偏好具有男性和来自发达国家、西方国家和英语国家的不成比例特点(根据皮尤研究中心的统计)。
Anthropic的研究发现,为了迎合人类偏好而进行优化的模型可能会以损害真相为代价。
如果您对该模型有任何问题,请联系 kawin@stanford.edu。此模型由Kawin Ethayarajh、Heidi(Chenyu)Zhang、Yizhong Wang和Dan Jurafsky创建。
我们即将发表一篇论文,但在此之前,请引用:
@InProceedings{pmlr-v162-ethayarajh22a,
  title = 	 {Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information},
  author =       {Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {5988--6008},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher = {PMLR},
}