数据集:
allenai/soda
任务:
对话子任务:
dialogue-generation语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
machine-generated预印本库:
arxiv:2212.10465许可:
cc-by-4.0?SODA 是首个公开可用的大规模高质量对话数据集,涵盖了广泛的社交互动。对话内容取自 PLM (InstructGPT; Ouyang et al., 2022),通过从知识图谱中上下文化社会常识知识 (Atomic10x; West et al., 2022) ,对对话进行提取。人工评估显示,SODA 中的对话比之前的人工编写数据集,如 DailyDialog (Li et al., 2017)、BlendedSkillTalk (Smith et al., 2020),更加一致、具体和(令人惊讶地)自然。另外,由于社会常识知识包含情感反应(即 xReact 关系),SODA 还包含了带有 1.7K 种独特情感标签的 385K 个对话,以及关于经验者和事件起因的信息,即 PersonX 和符号化常识知识三元组中的 head 事件。
英语
field | type | description |
---|---|---|
head | str | the head event in the symbolic commonsense knowledge triple |
relation | str | the relationship between head and tail events |
tail | str | the tail event in the symbolic commonsense knowledge triple |
literal | str | the symbolic commonsense knowledge in sentence-form |
narrative | str | narrative based on the literal |
dialogue | list of str | dialogue grounded in the narrative |
speakers | list of str | the speakers for each turn in the dialogue |
PersonX | str | the assigned name for PersonX in the commonsense knowledge triple |
PersonY | str|null | the assigned name for PersonY in the commonsense knowledge triple |
PersonZ | str|null | the assigned name for PersonZ in the commonsense knowledge triple |
original_index | int | the original index from Atomic10x |
split | str | the split information: {train, valid, test} |
head_answer | str | the answer for whether the head is included in the narrative : {Yes, Unknown} |
pmi_head_answer | str | the answer for whether the head is included in the narrative with point-wise mutual information applied: {Yes, No, Unknown} |
relation_tail_answer | str | the answer for whether the relation - tail is included in the dialogue : {Yes, No, Unknown} |
pmi_relation_tail_answer | str | the answer for whether the relation - tail is included in the dialogue with point-wise mutual information applied: {Yes, No, Unknown} |
为了创建 ?SODA,我们通过将社会常识知识上下文化的方式,从 InstructGPT 中提取对话内容 - 即在多个步骤中添加上下文信息:(1) 从符号化常识知识图谱中检索社会常识,(2) 将其转化为句子形式,(3) 从句子生成叙述,(4) 从叙述中推断说话者,最后 (5) 建立在叙述和说话者上的有内容对话。通过将 PLM 锚定在常识知识中来推导对话,提供了两个关键优势:(1) 最小化无意义的对话,(2) 最大化多样性。有关更多详细信息,请参阅我们的 paper 。
请参阅我们的 paper 。
使用 ?SODA,我们训练了 ???COSMO:一个能够优于先前最佳表现代理的通用对话系统,在领域内外数据集上都表现出色。COSMO-3B 可在 here 获取!
有关我们论文的简要摘要,请参见此处的 tweet 。
如果您发现本资源库中的资源有用,请引用我们的工作:
@article{kim2022soda, title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization}, author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Peter West and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi}, journal={ArXiv}, year={2022}, volume={abs/2212.10465} }