数据集:

allenai/soda

任务:

对话

子任务:

dialogue-generation

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

machine-generated

源数据集:

original extended|Atomic10x

预印本库:

arxiv:2212.10465

其他:

dialogue narrative commonsense

许可:

cc-by-4.0

数据集介绍文件清单

英文

?SODA 数据集信息卡

数据集概要

?SODA 是首个公开可用的大规模高质量对话数据集，涵盖了广泛的社交互动。对话内容取自 PLM (InstructGPT; Ouyang et al., 2022)，通过从知识图谱中上下文化社会常识知识 (Atomic10x; West et al., 2022) ，对对话进行提取。人工评估显示，SODA 中的对话比之前的人工编写数据集，如 DailyDialog (Li et al., 2017)、BlendedSkillTalk (Smith et al., 2020)，更加一致、具体和（令人惊讶地）自然。另外，由于社会常识知识包含情感反应（即 xReact 关系），SODA 还包含了带有 1.7K 种独特情感标签的 385K 个对话，以及关于经验者和事件起因的信息，即 PersonX 和符号化常识知识三元组中的 head 事件。

语言

英语

数据集结构

field	type	description
head	str	the head event in the symbolic commonsense knowledge triple
relation	str	the relationship between head and tail events
tail	str	the tail event in the symbolic commonsense knowledge triple
literal	str	the symbolic commonsense knowledge in sentence-form
narrative	str	narrative based on the literal
dialogue	list of str	dialogue grounded in the narrative
speakers	list of str	the speakers for each turn in the dialogue
PersonX	str	the assigned name for PersonX in the commonsense knowledge triple
PersonY	str\|null	the assigned name for PersonY in the commonsense knowledge triple
PersonZ	str\|null	the assigned name for PersonZ in the commonsense knowledge triple
original_index	int	the original index from Atomic10x
split	str	the split information: {train, valid, test}
head_answer	str	the answer for whether the head is included in the narrative : {Yes, Unknown}
pmi_head_answer	str	the answer for whether the head is included in the narrative with point-wise mutual information applied: {Yes, No, Unknown}
relation_tail_answer	str	the answer for whether the relation - tail is included in the dialogue : {Yes, No, Unknown}
pmi_relation_tail_answer	str	the answer for whether the relation - tail is included in the dialogue with point-wise mutual information applied: {Yes, No, Unknown}

数据集创建

为了创建 ?SODA，我们通过将社会常识知识上下文化的方式，从 InstructGPT 中提取对话内容 - 即在多个步骤中添加上下文信息：(1) 从符号化常识知识图谱中检索社会常识，(2) 将其转化为句子形式，(3) 从句子生成叙述，(4) 从叙述中推断说话者，最后 (5) 建立在叙述和说话者上的有内容对话。通过将 PLM 锚定在常识知识中来推导对话，提供了两个关键优势：(1) 最小化无意义的对话，(2) 最大化多样性。有关更多详细信息，请参阅我们的 paper 。

训练模型

使用 ?SODA，我们训练了 ??‍?COSMO：一个能够优于先前最佳表现代理的通用对话系统，在领域内外数据集上都表现出色。COSMO-3B 可在 here 获取！

附加信息

有关我们论文的简要摘要，请参见此处的 tweet 。

引用

如果您发现本资源库中的资源有用，请引用我们的工作：

@article{kim2022soda,
    title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization},
    author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Peter West and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
    journal={ArXiv},
    year={2022},
    volume={abs/2212.10465}
}

作者:

allenai

数据集大小:

816.21 MB