数据集:

allenai/soda

英文

?SODA 数据集信息卡

数据集概要

?SODA 是首个公开可用的大规模高质量对话数据集,涵盖了广泛的社交互动。对话内容取自 PLM (InstructGPT; Ouyang et al., 2022),通过从知识图谱中上下文化社会常识知识 (Atomic10x; West et al., 2022) ,对对话进行提取。人工评估显示,SODA 中的对话比之前的人工编写数据集,如 DailyDialog (Li et al., 2017)、BlendedSkillTalk (Smith et al., 2020),更加一致、具体和(令人惊讶地)自然。另外,由于社会常识知识包含情感反应(即 xReact 关系),SODA 还包含了带有 1.7K 种独特情感标签的 385K 个对话,以及关于经验者和事件起因的信息,即 PersonX 和符号化常识知识三元组中的 head 事件。

语言

英语

数据集结构

field type description
head str the head event in the symbolic commonsense knowledge triple
relation str the relationship between head and tail events
tail str the tail event in the symbolic commonsense knowledge triple
literal str the symbolic commonsense knowledge in sentence-form
narrative str narrative based on the literal
dialogue list of str dialogue grounded in the narrative
speakers list of str the speakers for each turn in the dialogue
PersonX str the assigned name for PersonX in the commonsense knowledge triple
PersonY str|null the assigned name for PersonY in the commonsense knowledge triple
PersonZ str|null the assigned name for PersonZ in the commonsense knowledge triple
original_index int the original index from Atomic10x
split str the split information: {train, valid, test}
head_answer str the answer for whether the head is included in the narrative : {Yes, Unknown}
pmi_head_answer str the answer for whether the head is included in the narrative with point-wise mutual information applied: {Yes, No, Unknown}
relation_tail_answer str the answer for whether the relation - tail is included in the dialogue : {Yes, No, Unknown}
pmi_relation_tail_answer str the answer for whether the relation - tail is included in the dialogue with point-wise mutual information applied: {Yes, No, Unknown}

数据集创建

为了创建 ?SODA,我们通过将社会常识知识上下文化的方式,从 InstructGPT 中提取对话内容 - 即在多个步骤中添加上下文信息:(1) 从符号化常识知识图谱中检索社会常识,(2) 将其转化为句子形式,(3) 从句子生成叙述,(4) 从叙述中推断说话者,最后 (5) 建立在叙述和说话者上的有内容对话。通过将 PLM 锚定在常识知识中来推导对话,提供了两个关键优势:(1) 最小化无意义的对话,(2) 最大化多样性。有关更多详细信息,请参阅我们的 paper

更多细节、社会影响和限制

请参阅我们的 paper

训练模型

使用 ?SODA,我们训练了 ??‍?COSMO:一个能够优于先前最佳表现代理的通用对话系统,在领域内外数据集上都表现出色。COSMO-3B 可在 here 获取!

附加信息

有关我们论文的简要摘要,请参见此处的 tweet

引用

如果您发现本资源库中的资源有用,请引用我们的工作:

@article{kim2022soda,
    title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization},
    author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Peter West and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
    journal={ArXiv},
    year={2022},
    volume={abs/2212.10465}
}