数据集:
BramVanroy/alpaca-cleaned-dutch
此数据集包含51,712个荷兰语的AI助手和(虚假的)“人类”(生成的)之间的对话。这些对话是从 Alpaca Cleaned Dataset 进行的翻译。
☕️ Want to help me out? 使用OpenAI API和提示测试翻译这些数据,我花费了?$57.99?。如果您喜欢这个数据集,请考虑 buying me a coffee ,以抵消部分费用,我非常感谢!☕️
{ 'id': 7, 'instruction': 'Leg uit waarom de volgende breuk gelijk is aan 1/4', 'input': '4/16', 'output': 'De breuk 4/16 is gelijk aan 1/4 omdat zowel de teller als de ' 'noemer deelbaar zijn door 4. Door zowel de teller als de noemer ' 'door 4 te delen, krijgen we de breuk 1/4.' }
使用OpenAI的API为gpt-3.5-turbo进行了指令、输入和输出的翻译。max_tokens=1024,temperature=0作为参数。
翻译的提示模板如下(其中src_lang为英语,tgt_lang为荷兰语):
TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional input to the task, and the output of the task, from {src_lang} into {tgt_lang}. Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional input to the task (marked `input: `) and output for the task marked with `output: `; 2. do not translate the identifiers `instruction: `, `input: `, and `output: ` but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and input text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the input in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the input, nor the translation in the output (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English. Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n"""
这个提示与指令、可选的输入和输出串联在一起。在代码中,最后一个部分如下所示:
text = f'instruction: "{instruction}"\n\n' if inputstr: text += f'input: "{inputstr}"\n\n' text += f'output: "{outputstr}"'
系统消息是:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
请注意,有1个项目(0.0001%)未能成功翻译。由于翻译中缺少了期望的输入、指令或输出关键字,因此无法获得翻译。缺少项目的ID为[23019]。
初始数据由 Tatsu lab 生成,由 Yahma 进行清理。
谁是源语言的生产者?最初的数据集是使用OpenAI的text-davinci-003生成的。
请注意,此新数据集中的翻译未经人工验证。
与任何由机器生成的文本一样,用户应注意本数据集可能包含的潜在偏见。尽管提示明确包含了“确保避免偏见(如性别偏见、语法偏见、社会偏见)”,但这种命令的影响是未知的。数据集中可能仍然存在偏见,请谨慎使用。
翻译质量尚未经过验证。使用时请自行承担风险!
根据OpenAI的使用条款,该数据集不能用于构建 a commercial system that competes with OpenAI's services 。与原始的Alpaca数据集类似,此数据集发布在CC NC 4.0下。
此文本是使用GPT-3(gpt-3.5-turbo)生成的(部分或全部)。作者在生成草稿语言后,对语言进行了审核、编辑和修订,对本出版物的内容负有最终责任。
如果您使用此数据集,您还必须遵守 Sharing 和 Usage 政策。
根据他们的 Terms of Use 中明确说明的,特别是2c.iii部分,"[您不能]使用从服务中获取的输出来开发与OpenAI商业竞争的模型"。这意味着您不能使用此数据集来构建旨在与OpenAI商业竞争的模型。 As far as I am aware ,这是一个特定的限制,应作为当前许可证的附录。
如果您使用此数据集,请引用:
Vanroy, B. (2023). Alpaca Cleaned Dutch [数据集]. Hugging Face. https://doi.org/10.57967/HF/0530
@misc{https://doi.org/10.57967/hf/0530, doi = {10.57967/HF/0530}, url = {https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch}, author = {Vanroy, Bram}, title = {{A}lpaca {C}leaned {D}utch}, publisher = {Hugging Face}, year = {2023} }
感谢 Tatsu lab 提供的初始机器生成的数据集和 cleaning it 。