数据集:
HuggingFaceH4/stack-exchange-preferences
该数据集包含从 Stack Overflow Data Dump 个问题和答案中提取的数据,用于偏好模型训练。重要的是,这些问题已经按照偏好模型的以下标准进行了过滤(紧随 Askell et al. 2021 ):至少有2个答案。此数据还可用于指令微调和语言模型训练。
问题和答案被分组,并赋予了一个对应于Anthropic论文的分数:
score = log2 (1 + upvotes) rounded to the nearest integer, plus 1 if the answer was accepted by the questioner (we assign a score of −1 if the number of upvotes is negative).
使用此数据集进行偏好模型预训练(PMP)时的一些重要注意事项(其他用途可以忽略):
Subsequently, we created a binary dataset by applying a ‘binarization’ procedure to the ranked dataset. That is, for every ranked pair A > B, we transform it into two independent binary comparisons: GOOD:A > BAD:A BAD:B > GOOD:B
要查看此数据中使用的所有Stack Exchange,请参见 this file 。
不幸的是,直接共享二值化数据而不包含元数据会违反许可证,因此我们已经提供了一个用于二值化的脚本。
以下是我们内部工具使用的脚本,用于创建一个二值化数据集:
# Copyright 2023 The HuggingFace Team. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import random from argparse import ArgumentParser from pathlib import Path import numpy as np from datasets import Dataset, concatenate_datasets, load_dataset from h4.data.utils import save_dataset_shards H4_DIR = Path(__file__).resolve().parents[3] DATA_DIR = H4_DIR / "data" if __name__ == "__main__": parser = ArgumentParser() parser.add_argument("--debug", action="store_true", help="Added print statements / limit data size for debugging") parser.add_argument( "--output_dir", default=f"{DATA_DIR}/pmp-binarized", type=str, help="Where to save the processed dataset", ) parser.add_argument( "--exchange_name", type=str, default=None, help="Optional argument to specify a specific subsection of the dataset", ) parser.add_argument( "--binary_score", type=int, default=8, help="Score assigned to binarized pairs for preference data." ) parser.add_argument( "--stream_data", action="store_true", help="Optionally stream data, which can be useful with weaker computers" ) parser.set_defaults(debug=False, stream_data=False) # default will process full dataset args = parser.parse_args() specific_exchange = args.exchange_name stream_dataset = args.stream_data binary_score = args.binary_score if specific_exchange: data_dir = "data/" + args.exchange_name else: data_dir = None if args.debug: data_len_limit = 10000 else: data_len_limit = np.inf dataset = load_dataset( "HuggingFaceH4/pmp-stack-exchange", data_dir=data_dir, split="train", streaming=stream_dataset, ) pmp_data = [] for i, d in enumerate(iter(dataset)): # check debug limit, quit if in debug mode (don't save) if i > data_len_limit: print("Early exit for debug mode!") print(pmp_data) break question = d["question"] answers = d["answers"] num_answers = len(answers) answer_scores = [a["pm_score"] for a in answers] if len(np.unique(answer_scores)) < 2: print(f"PM Scores are {answer_scores}, skipping this question {i}") else: # Sample 2 unique scores for binarization dif_scores = False while not dif_scores: # print("infinite loop...?") two_answers = random.sample(answers, 2) if two_answers[0]["pm_score"] != two_answers[1]["pm_score"]: dif_scores = True answer_0 = two_answers[0] answer_1 = two_answers[1] text_0 = "Question: " + question + "\n" + "Answer: " + answer_0["text"] text_1 = "Question: " + question + "\n" + "Answer: " + answer_1["text"] score_0 = binary_score score_1 = binary_score pmp_data.append({"context": text_0, "score": score_0}) pmp_data.append({"context": text_1, "score": score_1}) # Save binarized data sublist_len = 100000 print(f"Dataset length is {len(pmp_data)}") # bypass known issue in arrow https://issues.apache.org/jira/browse/ARROW-17137 print(f"Processed dataset length > {sublist_len}, processing to HF dataset in chunks") chunks = [pmp_data[x : x + sublist_len] for x in range(0, len(pmp_data), sublist_len)] ds_chunks = [Dataset.from_list(ch) for ch in chunks] ds = concatenate_datasets(ds_chunks) save_dataset_shards(ds, args.output_dir, subset="stackexchange", shard_size="100MB")
这只适用于英语,但可能包含其他语言。一些被省略的Stack Exchange包括:
spanish: es.meta.stackoverflow.com, es.stackoverflow.com japanese: ja.meta.stackoverflow.com, ja.stackoverflow.com portugese: pt.stackoverflow.com, pt.meta.stackoverflow.com russian: ru.stackoverflow, ru.meta.stackoverflow
许可证: https://creativecommons.org/licenses/by-sa/4.0/
cc-by-sa 4.0许可证虽然具有有意的自由性,但要求署名:
署名-您必须按照作者或许可方指定的方式表明该作品来自Stack Exchange网络。可以通过一个离散的文本标注或其他不引人注目但清晰可见的视觉指示来满足此要求。
确保在使用内容的任何互联网用途中,都要包含一个直接链接到源站点上原始问题(例如 http://stackoverflow.com/questions/12345 )的超链接。
以可视方式显示或以其他明确方式显示使用的每个问题和答案的作者名字。
确保在使用内容的任何互联网用途中,每个作者名字都包含一个指向其在源站点上用户个人资料页面的超链接(例如 http://stackoverflow.com/users/12345/username ),直接指向Stack Exchange域名的标准HTML方式(即非通过Tinyurl或其他间接超链接、混淆或重定向的方式),不包含任何“nofollow”命令或任何其他此类搜索引擎避免检测的手段,并且即使在禁用JavaScript的情况下也可见。有关详细信息,请参阅Stack Exchange服务条款。
@online{h4stackexchange, author = {Lambert, Nathan and Tunstall, Lewis and Rajani, Nazneen and Thrush, Tristan}, title = {HuggingFace H4 Stack Exchange Preference Dataset}, year = 2023, url = {https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences}, }