数据集:
multi_woz_v22
语言:
en计算机处理:
monolingual大小:
10K<n<100K批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:1810.00278许可:
apache-2.0MultiWOZ 是一个多领域Wizard-of-Oz数据集,包含多个领域和话题的人类-人类对话,是一个完全标注的数据集。MultiWOZ 2.1 (Eric et al., 2019) 在原始版本的基础上发现并修复了许多错误的注释和用户话语,从而得到了改进版本的数据集。MultiWOZ 2.2 是该数据集的另一个改进版本,它在 MultiWOZ 2.1 的基础上修正了 17.3% 的话语中的对话状态注释错误,并重新定义了词汇表的概念,不允许具有大量可能值的槽(例如,餐厅名称、预订时间),并为这些槽引入了规范化的槽跨度注释。
该数据集支持一系列任务。
数据集中的文本为英文(en)。
数据实例是用户和系统之间的完整多轮对话。每个轮次都有一个话语,例如:
['What fun places can I visit in the East?', 'We have five spots which include boating, museums and entertainment. Any preferences that you have?']
用户的话语还使用标记表示其意图和信念状态:
[{'service': ['attraction'], 'slots': [{'copy_from': [], 'copy_from_value': [], 'exclusive_end': [], 'slot': [], 'start': [], 'value': []}], 'state': [{'active_intent': 'find_attraction', 'requested_slots': [], 'slots_values': {'slots_values_list': [['east']], 'slots_values_name': ['attraction-area']}}]}, {'service': [], 'slots': [], 'state': []}]
最后,每个话语都用对话行为进行标注,对话行为提供关于用户或系统正在询问或提供信息的结构化表示。
[{'dialog_act': {'act_slots': [{'slot_name': ['east'], 'slot_value': ['area']}], 'act_type': ['Attraction-Inform']}, 'span_info': {'act_slot_name': ['area'], 'act_slot_value': ['east'], 'act_type': ['Attraction-Inform'], 'span_end': [39], 'span_start': [35]}}, {'dialog_act': {'act_slots': [{'slot_name': ['none'], 'slot_value': ['none']}, {'slot_name': ['boating', 'museums', 'entertainment', 'five'], 'slot_value': ['type', 'type', 'type', 'choice']}], 'act_type': ['Attraction-Select', 'Attraction-Inform']}, 'span_info': {'act_slot_name': ['type', 'type', 'type', 'choice'], 'act_slot_value': ['boating', 'museums', 'entertainment', 'five'], 'act_type': ['Attraction-Inform', 'Attraction-Inform', 'Attraction-Inform', 'Attraction-Inform'], 'span_end': [40, 49, 67, 12], 'span_start': [33, 42, 54, 8]}}]
每个对话实例包含以下字段:
{ "slots": [ { "slot": String of slot name. "start": Int denoting the index of the starting character in the utterance corresponding to the slot value. "exclusive_end": Int denoting the index of the character just after the last character corresponding to the slot value in the utterance. In python, utterance[start:exclusive_end] gives the slot value. "value": String of value. It equals to utterance[start:exclusive_end], where utterance is the current utterance in string. } ] }
还有一些非分类槽,它们的值是从对话状态中的另一个槽中获取的。它们的值不明确出现在话语中。例如,用户话语可能为"I also need a taxi from the restaurant to the hotel.",其中 "taxi-departure" 和 "taxi-destination" 的状态值分别从 "restaurant-name" 和 "hotel-name" 的状态值中获取。对于这些槽,不使用跨度进行标注,而是使用 "copy from" 注释来标识其所复制值的槽。此注释格式如下:
{ "slots": [ { "slot": Slot name string. "copy_from": The slot to copy from. "value": A list of slot values being . It corresponds to the state values of the "copy_from" slot. } ] }
数据集被拆分为训练集、验证集和测试集,各自的大小为:
train | validation | test | |
---|---|---|---|
Number of dialogues | 8438 | 1000 | 1000 |
Number of turns | 42190 | 5000 | 5000 |
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
最初的数据集(版本1.0和2.0)由 Cambridge Dialogue Systems Group 的一组研究人员创建。版本2.1基于v2.0由亚马逊团队开发,v2.2由谷歌研究人员团队开发。
该数据集发布在Apache License 2.0下。
您可以引用以下相关版本的MultiWOZ:
Version 1.0
@inproceedings{ramadan2018large, title={Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing}, author={Ramadan, Osman and Budzianowski, Pawe{\l} and Gasic, Milica}, booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics}, volume={2}, pages={432--437}, year={2018} }
Version 2.0
@inproceedings{budzianowski2018large, Author = {Budzianowski, Pawe{\l} and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I{\~n}igo and Ultes Stefan and Ramadan Osman and Ga{\v{s}}i\'c, Milica}, title={MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling}, booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2018} }
Version 2.1
@article{eric2019multiwoz, title={MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines}, author={Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyag and Hakkani-Tur, Dilek}, journal={arXiv preprint arXiv:1907.01669}, year={2019} }
Version 2.2
@inproceedings{zang2020multiwoz, title={MultiWOZ 2.2: A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines}, author={Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong}, booktitle={Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020}, pages={109--117}, year={2020} }
感谢 @yjernite 添加此数据集。