数据集:
multi_woz_v22
语言:
计算机处理:
monolingual大小:
10K<n<100K批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:1810.00278许可:
MultiWOZ 是一个多领域Wizard-of-Oz数据集,包含多个领域和话题的人类-人类对话,是一个完全标注的数据集。MultiWOZ 2.1 (Eric et al., 2019) 在原始版本的基础上发现并修复了许多错误的注释和用户话语,从而得到了改进版本的数据集。MultiWOZ 2.2 是该数据集的另一个改进版本,它在 MultiWOZ 2.1 的基础上修正了 17.3% 的话语中的对话状态注释错误,并重新定义了词汇表的概念,不允许具有大量可能值的槽(例如,餐厅名称、预订时间),并为这些槽引入了规范化的槽跨度注释。
该数据集支持一系列任务。
数据集中的文本为英文(en)。
数据实例是用户和系统之间的完整多轮对话。每个轮次都有一个话语,例如:
['What fun places can I visit in the East?',
'We have five spots which include boating, museums and entertainment. Any preferences that you have?']
用户的话语还使用标记表示其意图和信念状态:
[{'service': ['attraction'],
'slots': [{'copy_from': [],
'copy_from_value': [],
'exclusive_end': [],
'slot': [],
'start': [],
'value': []}],
'state': [{'active_intent': 'find_attraction',
'requested_slots': [],
'slots_values': {'slots_values_list': [['east']],
'slots_values_name': ['attraction-area']}}]},
{'service': [], 'slots': [], 'state': []}]
最后,每个话语都用对话行为进行标注,对话行为提供关于用户或系统正在询问或提供信息的结构化表示。
[{'dialog_act': {'act_slots': [{'slot_name': ['east'],
'slot_value': ['area']}],
'act_type': ['Attraction-Inform']},
'span_info': {'act_slot_name': ['area'],
'act_slot_value': ['east'],
'act_type': ['Attraction-Inform'],
'span_end': [39],
'span_start': [35]}},
{'dialog_act': {'act_slots': [{'slot_name': ['none'], 'slot_value': ['none']},
{'slot_name': ['boating', 'museums', 'entertainment', 'five'],
'slot_value': ['type', 'type', 'type', 'choice']}],
'act_type': ['Attraction-Select', 'Attraction-Inform']},
'span_info': {'act_slot_name': ['type', 'type', 'type', 'choice'],
'act_slot_value': ['boating', 'museums', 'entertainment', 'five'],
'act_type': ['Attraction-Inform',
'Attraction-Inform',
'Attraction-Inform',
'Attraction-Inform'],
'span_end': [40, 49, 67, 12],
'span_start': [33, 42, 54, 8]}}]
每个对话实例包含以下字段:
{
"slots": [
{
"slot": String of slot name.
"start": Int denoting the index of the starting character in the utterance corresponding to the slot value.
"exclusive_end": Int denoting the index of the character just after the last character corresponding to the slot value in the utterance. In python, utterance[start:exclusive_end] gives the slot value.
"value": String of value. It equals to utterance[start:exclusive_end], where utterance is the current utterance in string.
}
]
}
还有一些非分类槽,它们的值是从对话状态中的另一个槽中获取的。它们的值不明确出现在话语中。例如,用户话语可能为"I also need a taxi from the restaurant to the hotel.",其中 "taxi-departure" 和 "taxi-destination" 的状态值分别从 "restaurant-name" 和 "hotel-name" 的状态值中获取。对于这些槽,不使用跨度进行标注,而是使用 "copy from" 注释来标识其所复制值的槽。此注释格式如下:
{
"slots": [
{
"slot": Slot name string.
"copy_from": The slot to copy from.
"value": A list of slot values being . It corresponds to the state values of the "copy_from" slot.
}
]
}
数据集被拆分为训练集、验证集和测试集,各自的大小为:
train | validation | test | |
---|---|---|---|
Number of dialogues | 8438 | 1000 | 1000 |
Number of turns | 42190 | 5000 | 5000 |
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
最初的数据集(版本1.0和2.0)由 Cambridge Dialogue Systems Group 的一组研究人员创建。版本2.1基于v2.0由亚马逊团队开发,v2.2由谷歌研究人员团队开发。
该数据集发布在Apache License 2.0下。
您可以引用以下相关版本的MultiWOZ:
Version 1.0
@inproceedings{ramadan2018large,
title={Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing},
author={Ramadan, Osman and Budzianowski, Pawe{\l} and Gasic, Milica},
booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
volume={2},
pages={432--437},
year={2018}
}
Version 2.0
@inproceedings{budzianowski2018large,
Author = {Budzianowski, Pawe{\l} and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I{\~n}igo and Ultes Stefan and Ramadan Osman and Ga{\v{s}}i\'c, Milica},
title={MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2018}
}
Version 2.1
@article{eric2019multiwoz,
title={MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines},
author={Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyag and Hakkani-Tur, Dilek},
journal={arXiv preprint arXiv:1907.01669},
year={2019}
}
Version 2.2
@inproceedings{zang2020multiwoz,
title={MultiWOZ 2.2: A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines},
author={Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong},
booktitle={Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020},
pages={109--117},
year={2020}
}
感谢 @yjernite 添加此数据集。