数据集:

knkarthick/dialogsum

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

mit
英文

DIALOGSum Corpus 数据集卡片

数据集描述

链接

数据集摘要

DialogSum 是一个大规模的对话摘要数据集,包含13,460个对话(加上100个留存用于主题生成)以及对应的人工标注摘要和主题。

语言

英语

数据集结构

数据实例

DialogSum 是一个大规模的对话摘要数据集,包含13,460个对话(+1000个测试集),划分为训练集、测试集和验证集。训练集中的第一个实例为:{'id': 'train_0', 'summary': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.", 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.", 'topic': "get a check-up}

数据字段

  • 对话:对话文本。
  • 摘要:对话的人工编写摘要。
  • 主题:对话的人工编写主题/一句话描述。
  • id:唯一文件标识符。

数据切分

  • 训练集:12460个
  • 验证集:500个
  • 测试集:1500个
  • 留存集:100个[仅包含id、对话和主题三个属性]

数据集创建

策划理由

在论文中:我们从三个公共对话语料库(Dailydialog,Li等,2017年;DREAM,Sun等,2019年;MuTual,Cui等,2019年)以及一个英语口语练习网站中收集了对话数据。这些数据集包含了各种日常生活话题的面对面口语对话,包括学校、工作、用药、购物、休闲、旅行等。大多数对话发生在朋友、同事之间,以及服务提供者和顾客之间。

与之前的数据集相比,DialogSum的对话具有以下显著特点:

在丰富的真实生活场景中,包括更多多样化的任务导向场景;具有清晰的沟通模式和意图,对于作为摘要来源非常有价值;具有合理的长度,符合自动摘要的目的。

我们要求注释员根据以下标准对每个对话进行摘要:

传达最显著的信息;简洁明了;保留对话中重要的命名实体;以旁观者的角度书写;使用正式语言。

源语言生成者是谁?

语言学家

注释者是谁?

语言专家

许可信息

MIT 许可证

引用信息

@inproceedings{chen-etal-2021-dialogsum,
    title = "{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset",
    author = "Chen, Yulong  and
      Liu, Yang  and
      Chen, Liang  and
      Zhang, Yue",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.449",
    doi = "10.18653/v1/2021.findings-acl.449",
    pages = "5062--5074",

贡献

感谢 @cylnlp 提供此数据集。