数据集:
taskmaster1
子任务:
dialogue-modeling语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1909.05358许可:
Taskmaster-1 is a goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains. Two procedures were used to create this collection, each with unique advantages. The first involves a two-person, spoken "Wizard of Oz" (WOz) approach in which trained agents and crowdsourced workers interact to complete the task while the second is "self-dialog" in which crowdsourced workers write the entire dialog themselves.
[More Information Needed]
The dataset is in English language.
A typical example looks like this
{
"conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668",
"instruction_id":"restaurant-table-2",
"utterances":[
{
"index":0,
"segments":[
],
"speaker":"USER",
"text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?"
},
{
"index":1,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.name.restaurant.reject"
}
],
"end_index":37,
"start_index":16,
"text":"Spicy Wet Hotdogs LLC"
}
],
"speaker":"ASSISTANT",
"text":"You might enjoy Spicy Wet Hotdogs LLC."
},
{
"index":2,
"segments":[
],
"speaker":"USER",
"text":"That sounds really good, can you make me a reservation?"
},
{
"index":3,
"segments":[
],
"speaker":"ASSISTANT",
"text":"Certainly, when would you like a reservation?"
},
{
"index":4,
"segments":[
{
"annotations":[
{
"name":"restaurant_reservation.num.guests"
},
{
"name":"restaurant_reservation.num.guests"
}
],
"end_index":20,
"start_index":18,
"text":"50"
}
],
"speaker":"USER",
"text":"I have a party of 50 who want a really sloppy dog on Saturday at noon."
}
]
}
Each conversation in the data file has the following structure:
Each utterance has the following fields:
Each segment has the following fields:
Each annotation has a single field:
The data in one_person_dialogs config is split into train , dev and test splits.
| train | validation | test | |
|---|---|---|---|
| N. Instances | 6168 | 770 | 770 |
The data in woz_dialogs config has no default splits.
| train | |
|---|---|
| N. Instances | 5507 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is licensed under Creative Commons Attribution 4.0 License
[More Information Needed]
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
Thanks to @patil-suraj for adding this dataset.