数据集:

taskmaster1

任务:

文本生成

填充掩码

子任务:

dialogue-modeling

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:1909.05358

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for Taskmaster-1

Dataset Summary

Taskmaster-1 is a goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains. Two procedures were used to create this collection, each with unique advantages. The first involves a two-person, spoken "Wizard of Oz" (WOz) approach in which trained agents and crowdsourced workers interact to complete the task while the second is "self-dialog" in which crowdsourced workers write the entire dialog themselves.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The dataset is in English language.

Dataset Structure

Data Instances

A typical example looks like this

{
    "conversation_id":"dlg-336c8165-068e-4b4b-803d-18ef0676f668",
    "instruction_id":"restaurant-table-2",
    "utterances":[
      {
        "index":0,
        "segments":[
          
        ],
        "speaker":"USER",
        "text":"Hi, I'm looking for a place that sells spicy wet hotdogs, can you think of any?"
      },
      {
        "index":1,
        "segments":[
          {
            "annotations":[
              {
                "name":"restaurant_reservation.name.restaurant.reject"
              }
            ],
            "end_index":37,
            "start_index":16,
            "text":"Spicy Wet Hotdogs LLC"
          }
        ],
        "speaker":"ASSISTANT",
        "text":"You might enjoy Spicy Wet Hotdogs LLC."
      },
      {
        "index":2,
        "segments":[
          
        ],
        "speaker":"USER",
        "text":"That sounds really good, can you make me a reservation?"
      },
      {
        "index":3,
        "segments":[
          
        ],
        "speaker":"ASSISTANT",
        "text":"Certainly, when would you like a reservation?"
      },
      {
        "index":4,
        "segments":[
          {
            "annotations":[
              {
                "name":"restaurant_reservation.num.guests"
              },
              {
                "name":"restaurant_reservation.num.guests"
              }
            ],
            "end_index":20,
            "start_index":18,
            "text":"50"
          }
        ],
        "speaker":"USER",
        "text":"I have a party of 50 who want a really sloppy dog on Saturday at noon."
      }
    ]
  }

Data Fields

Each conversation in the data file has the following structure:

conversation_id : A universally unique identifier with the prefix 'dlg-'. The ID has no meaning.
utterances : A list of utterances that make up the conversation.
instruction_id : A reference to the file(s) containing the user (and, if applicable, agent) instructions for this conversation.

Each utterance has the following fields:

index : A 0-based index indicating the order of the utterances in the conversation.
speaker : Either USER or ASSISTANT, indicating which role generated this utterance.
text : The raw text of the utterance. In case of self dialogs (one_person_dialogs), this is written by the crowdsourced worker. In case of the WOz dialogs, 'ASSISTANT' turns are written and 'USER' turns are transcribed from the spoken recordings of crowdsourced workers.
segments : A list of various text spans with semantic annotations.

Each segment has the following fields:

start_index : The position of the start of the annotation in the utterance text.
end_index : The position of the end of the annotation in the utterance text.
text : The raw text that has been annotated.
annotations : A list of annotation details for this segment.

Each annotation has a single field:

name : The annotation name.

Data Splits

one_person_dialogs

The data in one_person_dialogs config is split into train , dev and test splits.

train	validation	test
N. Instances	6168	770	770

woz_dialogs

The data in woz_dialogs config has no default splits.

train
N. Instances	5507

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

The dataset is licensed under Creative Commons Attribution 4.0 License

Citation Information

[More Information Needed]

@inproceedings{48484,
title	= {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author	= {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year	= {2019}
}

Contributions

Thanks to @patil-suraj for adding this dataset.

作者:

佚名

数据集大小:

21.37 KB