使用Python中的RLHF为你的LLM构建奖励模型

2023年09月20日由 alex 发表 605 0

增强学习通过人类反馈是一种改善GPT-3等语言模型性能的强大技术。RLHF的一个关键方面是训练一个引导微调过程的奖励模型。在本文中，我们将为你介绍使用出色的trl库创建收集人类偏好和训练奖励模型的数据集的步骤。

下面是我们将采用的工作流程，如图所示，摘自设计用于LLMs的开源数据整理平台Argilla。

第1步：Argilla服务器设置

你需要运行一个Argilla服务器。如果你还没有设置一个，可以按照Argilla的快速入门或安装说明进行操作。

第2步：安装所需的软件包

!pip install -U argilla pandas trl plotly -qqq

这条命令会安装所需的Python包：argilla（Argilla客户端）、pandas（用于数据操作）、trl（文本自适应预训练和增强学习）以及plotly（用于创建图表）。

第3步：导入所需的库。

import random
import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
)
from trl import RewardTrainer
import argilla as rg

你导入了用于数据处理、模型训练和使用Argilla所需的各种库。

第4步：初始化Argilla客户端（可选）

rg.init(
    api_url="http://localhost:6900",  # Replace with your Argilla server's URL"http://localhost:6900",  # Replace with your Argilla server's URL
    api_key="admin.apikey"  # Replace with your API key if applicable
)

这一步是使用URL和API密钥来初始化Argilla客户端，如果你正在使用Docker快速启动或Hugging Face Spaces运行Argilla。如果你在本地运行Argilla而没有这些环境，可能不需要这一步。

第5步：加载数据集

你正在使用datasets库中的load_dataset函数加载一个数据集。在这种情况下，数据集的名称是"argilla/dolly-curated-comparison-falcon-7b-instruct"，你特别选择了数据集的"train"部分。

hf_dataset = load_dataset("argilla/dolly-curated-comparison-falcon-7b-instruct", split="train")"argilla/dolly-curated-comparison-falcon-7b-instruct", split="train")

第6步：转换为Pandas DataFrame

在加载数据集之后，你将其转换为Pandas DataFrame，以便更轻松地进行数据操作和探索。

df = hf_dataset.to_pandas()
df # printing the dataframe

第7步：为数据集设置定义字段

为了创建一个奖励模型，我们希望标注者根据给定的提示，对两个回答的质量进行评估和排序，从最有利的到最不利的。为了实现这一点，我们需要设置显示字段，并制定要呈现给标注者的具体问题。

在这种情况下，我们有三个字段：

fields = [
    rg.TextField(name="instruction", title="User instruction"),
    rg.TextField(name="response-1"),
    rg.TextField(name="response-2")
]

1. response：这个字段对应用户的指示或提示。

2. response-1：代表第一个回答。

3. response-2：代表第二个回答。

第8步：配置标注问题

我们正在设置一个供标注人员回答的问题。在这种用例中，要求标注人员从提供的两个回答中选择最佳回答。你还可以配置更复杂的排序任务，但为了简单起见，我们将专注于从两个选项中选择最佳回答。

question = rg.RatingQuestion(
    name="choose-best",
    title="Choose the best response:",
    description="Choose the most helpful, harmless, and truthful response. Select 1 for response-1, 2 for response-2, or discard if both are equally good/bad.",
    values=[1,2],
    required=True
)

1. name：为此问题分配一个名称。

2. title：指定问题的标题，显示给标注者以进行标记。

3. description：为标注者提供详细的指导说明，告诉他们如何选择最佳回答。它提及选择标准和可用选项（1 代表回答1，2 代表回答2，或丢弃以表示相同质量）。

4. values：表示标注者可以选择的可能值（在本例中为1或2）。

5. required：要求标注者必须回答此问题。

第9步：提供标注指南

我们已经提供了标注指南来指导标注者。这些指南基于论文《使用人类反馈训练语言模型遵循说明》。

guidelines = """These guidelines are based on the paper 
[Training Language Models to Follow Instructions with Human Feedback]. 
(You can include your specific guidelines here.)
"""

这些指南帮助标注员理解任务，并在选择最佳回答时做出明智决策。

第10步：构建比较记录

在这一步中，我们创建比较记录来收集数据。每个记录将一条指令（提示）与两个回答配对。我们随机将一个人工撰写的原始回答分配给“response-1”，将Falcon模型生成的回答分配给“response-2”。

# Building records from the hf dataset
records = [
    rg.FeedbackRecord(fields={"instruction": r["prompt"], "response-1": r["original_response"], "response-2": r["response-2"]})
    for r in hf_dataset
]

第11步：创建数据集配置

现在我们定义一个数据集配置，包括字段、问题和标签者的指导原则。

# Creating a dataset configuration
dataset = rg.FeedbackDataset(
    fields=fields,
    questions=[question],
    guidelines=guidelines
)

第12步：添加记录并发布数据集

我们将在第10步生成的记录合并到数据集中，并使其可供标注者访问。该数据集被命名为“comparison-data-falcon”。

# Adding records to the dataset and publishing it
dataset.add_records(records)
dataset.push_to_argilla(name="comparison-data-falcon")

现在，数据集已经准备好了，可以通过配置的反馈界面提供标注员们的输入。

4‘’

第13步（可选）：将内容推送至Hugging Face Hub

如果你希望与他人共享此数据集以进行复制和重复使用，我们可以选择将其上传到Hugging Face Hub。

# Pushing the dataset to the Hugging Face Hub (optional)
dataset.push_to_huggingface("comparison-data-falcon")

这一步允许其他用户访问数据集，并提供有关其结构、指南和导入说明的信息。

第14步：获取已标记的数据集

根据你是否使用Argilla UI标记过任何数据点，你有两种选项来获取已标记的数据集。

1. 如果你还没有标记任何响应，你可以从Hugging Face Hub中获取一个预先标记的数据集。该数据集已经包含了排名的响应，并可供使用。

# If you haven't labeled any responses with the UI, run this cell to retrieve the labeled dataset from Hugging Face Hub.
feedback_dataset = rg.FeedbackDataset.from_huggingface("argilla/comparison-data-falcon-with-feedback")

2. 如果你在用户界面中对一些响应进行了标记，你可以从Argilla检索带有标记的数据集。

# If you have labeled some examples, run this cell to retrieve the labeled dataset from Argilla.
feedback_dataset = rg.FeedbackDataset.from_argilla('comparison-data-falcon')

第15步：为奖励建模准备数据集

现在，我们需要按照标准的方式为训练奖励模型准备数据集。这涉及根据用户反馈选择接受和拒绝的回应。我们将为奖励建模创建一个训练任务实例，并使用一个格式化函数来准备数据。

from typing import Any, Dict
from argilla.feedback import TrainingTask
from collections import Counter
def formatting_func(sample: Dict[str, Any]):
    values = [
        annotation["value"]
        for annotation in sample["choose-best"]
        if annotation["status"] == "submitted"
    ]
    winning_response = Counter(values).most_common(1)[0][0]
    if winning_response == 1:
        chosen = sample["response-1"]
        rejected = sample["response-2"]
    else:
        chosen = sample["response-2"]
        rejected = sample["response-1"]
    return chosen, rejected
task = TrainingTask.for_reward_modeling(formatting_func=formatting_func)

第16步：观察结果数据集

在使用TRL进行训练准备后，你可以观察到生成的数据集。

dataset = feedback_dataset.prepare_for_training(framework="trl", task=task)
dataset
#### Output starts ####
Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 7401
})
#### Output ends ####

这将显示关于数据集的信息，包括其特征和行数。你还可以访问数据集内的特定数据点，如已选择和未选择的响应。

dataset[0]
#### Output ####
{'chosen': "Depreciation is the drop in value of an asset due to wear and 
tear, age and obsolescence (going out of date) as recorded in an 
organization's financial records.",
 'rejected': 'What is Depreciation – 10 Important Facts to Know?\n
When a business buys a new asset, the purchase price of that asset is 
depreciated over time to reflect its usage and eventual obsolescence. 
Depreciation expense can be a tax deductible expense and is usually a 
non-cash expense reported on a company’s income statement and balance sheet. 
The amount of depreciation expense a company reports each year is the 
difference between the original purchase price of the asset and what the 
current value of that asset might be. Here are 10 important facts to know 
about depreciation:\n1. Depreciation is a non-cash expense. It is an expense 
that is reported in a business’s income statement and balance sheet and not 
a cash flow expense.\n2. Depreciation is an accounting standard and it is 
required to be disclosed in a business’s financial statements.\n3. 
The amount of depreciation is usually a tax expense and not a cash expense 
reported on a company’s income statement'}

现在，数据集已经准备好可以用作训练奖励模型的对比数据。

步骤17：选择基准模型

在训练奖励模型之前，你需要选择一个基准模型进行微调。通常情况下，这个基准模型是通过指令微调步骤得到的监督微调模型。在这个示例中，我们将使用distilroberta-base模型，但你也可以尝试其他模型。

model_name = "distilroberta-base"

第18步：初始化ArgillaTrainer

创建ArgillaTrainer的实例，该实例将处理奖励模型的训练过程。你将向其提供数据集、任务、框架（trl）、选定的基模型以及其他训练配置选项。

trainer = ArgillaTrainer(
    dataset=feedback_dataset,
    task=task,
    framework="trl",
    model=model_name,
    train_size=0.8,
)

1. dataset：之前准备的带有标注的数据集。

2. task：奖励建模任务配置。

3. framework：将框架设定为"trl"以进行奖励建模。

4. model：你选择的用于微调的基础模型。

5. train_size：用于训练的数据集比例（在本例中为80%）。

第19步：更新训练配置

调整训练配置选项，如批大小、评估策略和日志记录频率。在这里，我们设置批大小，评估策略定期进行，并指定日志记录间隔。

trainer.update_config(
    per_device_train_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=200,
)

第20步：开始训练

启动奖励模型的训练过程。模型将根据数据集中的示例学习区分首选和被拒绝的响应。经过训练的模型将对首选响应分配较高的值，对被拒绝的响应分配较低的值。

trainer.train("./reward_model")
#### Output Starts #####
'eval_loss': 0.1626577377319336, 
'eval_accuracy': 0.937204591492235,
'eval_runtime': 6.5907, 
'eval_samples_per_second': 224.709, 
'eval_steps_per_second': 28.221,
'epoch': 1.0}     
#### Output Ends #####

步骤21：加载分词器和模型

这个模型完全开源，并且可以在Hugging Hub上访问。现在你可以将其用于自定义数据。

从Hugging Face Hub加载分词器和模型。在这个例子中，我们使用AutoTokenizer和AutoModelForSequenceClassification来加载预训练的奖励模型。

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(
                    "argilla/roberta-base-reward-model-falcon-dolly")
model = AutoModelForSequenceClassification.from_pretrained(
                    "argilla/roberta-base-reward-model-falcon-dolly")

第22步：定义一个函数来获取分数

创建一个名为get_score的函数，该函数接受模型、tokenizer、提示和回应作为输入。此函数对输入序列进行标记化处理，通过模型进行前向传递，并提取logits。

def get_score(model, tokenizer, prompt, response):
    # Tokenize the input sequences
    inputs = tokenizer.encode_plus(prompt, response, truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    # Perform forward pass
    with torch.no_grad():
        outputs = model(**inputs)
    # Extract the logits
    logits = outputs.logits
    return logits.item()

步骤23：使用奖励模型

你现在可以使用get_score函数来获取给定提示的不同回答的分数。该分数代表了奖励模型根据用户的偏好对回答进行评分的程度。

以下是示例用法：

prompt = "What is Depreciation"
example_less_pref_response = "What is Depreciation – 10 Important Facts to Know? [..]"  # Insert the actual response here
example_preferred_response = "Depreciation is the drop in value of an asset due to wear and tear [..]"  # Insert the actual response here
# Get the score for the less preferred response
score_less_pref = get_score(model, tokenizer, prompt, example_less_pref_response)
print("Score for less preferred response:", score_less_pref)
# >> -3.915163993835449
# Get the score for the preferred response
score_preferred = get_score(model, tokenizer, prompt, example_preferred_response)
print("Score for preferred response:", score_preferred)
# >> 7.460323333740234

结论

通过训练RLHF的奖励模型，可以显著提高像GPT-3这样的语言模型的性能。按照本文中概述的步骤，你可以创建一个用于收集人类喜好的数据集，训练一个奖励模型，并使用它基于用户反馈对响应进行排序。这种方法有助于微调语言模型，使其生成更有帮助、真实和相关的响应，最终提升它们在各种应用中的实用性。

文章来源：https://medium.com/python-in-plain-english/building-a-reward-model-for-your-llm-using-rlhf-in-python-49abaf4906f

标签：

Python

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇如何使用ChatSpot

下一篇揭开无监督学习的面纱

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来