数据集:
knkarthick/dialogsum
语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
mitDialogSum 是一个大规模的对话摘要数据集,包含13,460个对话(加上100个留存用于主题生成)以及对应的人工标注摘要和主题。
英语
DialogSum 是一个大规模的对话摘要数据集,包含13,460个对话(+1000个测试集),划分为训练集、测试集和验证集。训练集中的第一个实例为:{'id': 'train_0', 'summary': "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll give some information about their classes and medications to help Mr. Smith quit smoking.", 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.", 'topic': "get a check-up}
在论文中:我们从三个公共对话语料库(Dailydialog,Li等,2017年;DREAM,Sun等,2019年;MuTual,Cui等,2019年)以及一个英语口语练习网站中收集了对话数据。这些数据集包含了各种日常生活话题的面对面口语对话,包括学校、工作、用药、购物、休闲、旅行等。大多数对话发生在朋友、同事之间,以及服务提供者和顾客之间。
与之前的数据集相比,DialogSum的对话具有以下显著特点:
在丰富的真实生活场景中,包括更多多样化的任务导向场景;具有清晰的沟通模式和意图,对于作为摘要来源非常有价值;具有合理的长度,符合自动摘要的目的。
我们要求注释员根据以下标准对每个对话进行摘要:
传达最显著的信息;简洁明了;保留对话中重要的命名实体;以旁观者的角度书写;使用正式语言。
语言学家
语言专家
MIT 许可证
@inproceedings{chen-etal-2021-dialogsum, title = "{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset", author = "Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.449", doi = "10.18653/v1/2021.findings-acl.449", pages = "5062--5074",
感谢 @cylnlp 提供此数据集。