H4 Stack Exchange偏好数据集数据卡

数据集概述

该数据集包含从 Stack Overflow Data Dump 个问题和答案中提取的数据，用于偏好模型训练。重要的是，这些问题已经按照偏好模型的以下标准进行了过滤（紧随 Askell et al. 2021 ）：至少有2个答案。此数据还可用于指令微调和语言模型训练。

问题和答案被分组，并赋予了一个对应于Anthropic论文的分数：

score = log2 (1 + upvotes) rounded to the nearest integer, plus 1 if the answer was accepted by the questioner (we assign a score of −1 if the number of upvotes is negative).

使用此数据集进行偏好模型预训练（PMP）时的一些重要注意事项（其他用途可以忽略）：

由于匹配分数，数据可能需要进一步过滤。
参见Askel等人2021年第4.1节中有关使用每对样本两次的说明，通过以下二值化（以获得更好的预训练初始化）：

Subsequently, we created a binary dataset by applying a ‘binarization’ procedure to the ranked dataset. That
is, for every ranked pair A > B, we transform it into two independent binary comparisons:
GOOD:A > BAD:A
BAD:B > GOOD:B

要查看此数据中使用的所有Stack Exchange，请参见 this file 。

不幸的是，直接共享二值化数据而不包含元数据会违反许可证，因此我们已经提供了一个用于二值化的脚本。

使用数据

以下是我们内部工具使用的脚本，用于创建一个二值化数据集：

# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
from argparse import ArgumentParser
from pathlib import Path

import numpy as np
from datasets import Dataset, concatenate_datasets, load_dataset

from h4.data.utils import save_dataset_shards


H4_DIR = Path(__file__).resolve().parents[3]
DATA_DIR = H4_DIR / "data"

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--debug", action="store_true", help="Added print statements / limit data size for debugging")
    parser.add_argument(
        "--output_dir",
        default=f"{DATA_DIR}/pmp-binarized",
        type=str,
        help="Where to save the processed dataset",
    )
    parser.add_argument(
        "--exchange_name",
        type=str,
        default=None,
        help="Optional argument to specify a specific subsection of the dataset",
    )
    parser.add_argument(
        "--binary_score", type=int, default=8, help="Score assigned to binarized pairs for preference data."
    )
    parser.add_argument(
        "--stream_data", action="store_true", help="Optionally stream data, which can be useful with weaker computers"
    )
    parser.set_defaults(debug=False, stream_data=False)  # default will process full dataset

    args = parser.parse_args()
    specific_exchange = args.exchange_name
    stream_dataset = args.stream_data
    binary_score = args.binary_score

    if specific_exchange:
        data_dir = "data/" + args.exchange_name
    else:
        data_dir = None

    if args.debug:
        data_len_limit = 10000
    else:
        data_len_limit = np.inf

    dataset = load_dataset(
        "HuggingFaceH4/pmp-stack-exchange",
        data_dir=data_dir,
        split="train",
        streaming=stream_dataset,
    )

    pmp_data = []
    for i, d in enumerate(iter(dataset)):
        # check debug limit, quit if in debug mode (don't save)
        if i > data_len_limit:
            print("Early exit for debug mode!")
            print(pmp_data)
            break

        question = d["question"]
        answers = d["answers"]
        num_answers = len(answers)

        answer_scores = [a["pm_score"] for a in answers]
        if len(np.unique(answer_scores)) < 2:
            print(f"PM Scores are {answer_scores}, skipping this question {i}")
        else:
            # Sample 2 unique scores for binarization
            dif_scores = False
            while not dif_scores:
                # print("infinite loop...?")
                two_answers = random.sample(answers, 2)

                if two_answers[0]["pm_score"] != two_answers[1]["pm_score"]:
                    dif_scores = True

        answer_0 = two_answers[0]
        answer_1 = two_answers[1]
        text_0 = "Question: " + question + "\n" + "Answer: " + answer_0["text"]
        text_1 = "Question: " + question + "\n" + "Answer: " + answer_1["text"]
        score_0 = binary_score
        score_1 = binary_score

        pmp_data.append({"context": text_0, "score": score_0})
        pmp_data.append({"context": text_1, "score": score_1})

    # Save binarized data
    sublist_len = 100000

    print(f"Dataset length is {len(pmp_data)}")
    # bypass known issue in arrow https://issues.apache.org/jira/browse/ARROW-17137
    print(f"Processed dataset length > {sublist_len}, processing to HF dataset in chunks")
    chunks = [pmp_data[x : x + sublist_len] for x in range(0, len(pmp_data), sublist_len)]
    ds_chunks = [Dataset.from_list(ch) for ch in chunks]
    ds = concatenate_datasets(ds_chunks)

    save_dataset_shards(ds, args.output_dir, subset="stackexchange", shard_size="100MB")

语言

这只适用于英语，但可能包含其他语言。一些被省略的Stack Exchange包括：

spanish: es.meta.stackoverflow.com, es.stackoverflow.com
japanese: ja.meta.stackoverflow.com, ja.stackoverflow.com
portugese: pt.stackoverflow.com, pt.meta.stackoverflow.com
russian: ru.stackoverflow, ru.meta.stackoverflow

许可信息

许可证： https://creativecommons.org/licenses/by-sa/4.0/

cc-by-sa 4.0许可证虽然具有有意的自由性，但要求署名：

署名-您必须按照作者或许可方指定的方式表明该作品来自Stack Exchange网络。可以通过一个离散的文本标注或其他不引人注目但清晰可见的视觉指示来满足此要求。

确保在使用内容的任何互联网用途中，都要包含一个直接链接到源站点上原始问题（例如 http://stackoverflow.com/questions/12345 ）的超链接。

以可视方式显示或以其他明确方式显示使用的每个问题和答案的作者名字。

确保在使用内容的任何互联网用途中，每个作者名字都包含一个指向其在源站点上用户个人资料页面的超链接（例如 http://stackoverflow.com/users/12345/username ），直接指向Stack Exchange域名的标准HTML方式（即非通过Tinyurl或其他间接超链接、混淆或重定向的方式），不包含任何“nofollow”命令或任何其他此类搜索引擎避免检测的手段，并且即使在禁用JavaScript的情况下也可见。有关详细信息，请参阅Stack Exchange服务条款。

引用信息

@online{h4stackexchange,
  author = {Lambert, Nathan and Tunstall, Lewis and Rajani, Nazneen and Thrush, Tristan},
  title = {HuggingFace H4 Stack Exchange Preference Dataset},
  year = 2023,
  url = {https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences},
}

作者:

HuggingFaceH4

数据集大小:

2.68 GB