数据集:

silicone

任务:

文本生成

填充掩码

文本分类

子任务:

dialogue-modeling language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

100K<n<1M 10K<n<100K 1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2009.11152

其他:

emotion-classification dialogue-act-classification

许可:

cc-by-sa-4.0

数据集介绍文件清单

英文

SILICONE Benchmark 数据集卡片

数据集概述

SILICONE（口语自然语言理解评估基准）是一个特别设计用于口语的自然语言理解系统的资源集合，用于训练、评估和分析。所有数据集都是英文的，涵盖了各种领域，包括日常生活、脚本场景、联合任务完成、电话对话和电视对话。部分数据集还包括情感和/或情绪标签。

支持的任务和排行榜

【需要更多信息】

语言

英文。

数据集结构

数据实例

DailyDialog Act Corpus（对话行为）

在 dyda_da 配置下，数据集中的一个示例为：

{
  'Utterance': "the taxi drivers are on strike again .",
  'Dialogue_Act': 2, # "inform"
  'Dialogue_ID': "2"
}

DailyDialog Act Corpus（情感）

在 dyda_e 配置下，数据集中的一个示例为：

{
  'Utterance': "'oh , breaktime flies .'",
  'Emotion': 5, # "sadness"
  'Dialogue_ID': "997"
}

Interactive Emotional Dyadic Motion Capture（IEMOCAP）数据库

在 iemocap 配置下，数据集中的一个示例为：

{
  'Dialogue_ID': "Ses04F_script03_2",
  'Utterance_ID': "Ses04F_script03_2_F025",
  'Utterance': "You're quite insufferable.  I expect it's because you're drunk.",
  'Emotion': 0, # "ang"
}

HCRC MapTask Corpus

在 maptask 配置下，数据集中的一个示例为：

{
  'Speaker': "f",
  'Utterance': "i think that would bring me over the crevasse",
  'Dialogue_Act': 4, # "explain"
}

Multimodal EmotionLines Dataset（情感）

在 meld_e 配置下，数据集中的一个示例为：

{
  'Utterance': "'Push 'em out , push 'em out , harder , harder .'",
  'Speaker': "Joey",
  'Emotion': 3, # "joy"
  'Dialogue_ID': "1",
  'Utterance_ID': "2"
}

Multimodal EmotionLines Dataset（情感）

在 meld_s 配置下，数据集中的一个示例为：

{
  'Utterance': "'Okay , y'know what ? There is no more left , left !'",
  'Speaker': "Rachel",
  'Sentiment': 0, # "negative"
  'Dialogue_ID': "2",
  'Utterance_ID': "4"
}

ICSI MRDA Corpus

在 mrda 配置下，数据集中的一个示例为：

{
  'Utterance_ID': "Bed006-c2_0073656_0076706",
  'Dialogue_Act': 0, # "s"
  'Channel_ID': "Bed006-c2",
  'Speaker': "mn015",
  'Dialogue_ID': "Bed006",
  'Utterance': "keith is not technically one of us yet ."
}

BT OASIS Corpus

在 oasis 配置下，数据集中的一个示例为：

{
  'Speaker': "b",
  'Utterance': "when i rang up um when i rang to find out why she said oh well your card's been declined",
  'Dialogue_Act': 21, # "inform"
}

SEMAINE数据库

在 sem 配置下，数据集中的一个示例为：

{
  'Utterance': "can you think of somebody who is like that ?",
  'NbPairInSession': "11",
  'Dialogue_ID': "59",
  'SpeechTurn': "674",
  'Speaker': "Agent",
  'Sentiment': 1, # "Neutral"
}

Switchboard Dialog Act（SwDA）Corpus

在 swda 配置下，数据集中的一个示例为：

{
  'Utterance': "but i 'd probably say that 's roughly right .",
  'Dialogue_Act': 33, # "aap_am"
  'From_Caller': "1255",
  'To_Caller': "1087",
  'Topic': "CRIME",
  'Dialogue_ID': "818",
  'Conv_ID': "sw2836",
}

数据字段

对于 dyda_da 配置，不同的字段包括：

Utterance：字符串形式的话语。
Dialogue_Act：话语的对话行为标签。可以是 "commissive"（0），"directive"（1），"inform"（2）或 "question"（3）。
Dialogue_ID：对话的标识符（字符串形式）。

对于 dyda_e 配置，不同的字段包括：

Utterance：字符串形式的话语。
Dialogue_Act：话语的对话行为标签。可以是 "anger"（0），"disgust"（1），"fear"（2），"happiness"（3），"no emotion"（4），"sadness"（5）或 "surprise"（6）。
Dialogue_ID：对话的标识符（字符串形式）。

对于 iemocap 配置，不同的字段包括：

Dialogue_ID：对话的标识符（字符串形式）。
Utterance_ID：话语的标识符（字符串形式）。
Utterance：字符串形式的话语。
Emotion：话语的情感标签。可以是 "Anger"（0），"Disgust"（1），"Excitement"（2），"Fear"（3），"Frustration"（4），"Happiness"（5），"Neutral"（6），"Other"（7），"Sadness"（8），"Surprise"（9）或 "Unknown"（10）。

对于 maptask 配置，不同的字段包括：

Speaker：说话者的标识符（字符串形式）。
Utterance：字符串形式的话语。
Dialogue_Act：话语的对话行为标签。可以是 "acknowledge"（0），"align"（1），"check"（2），"clarify"（3），"explain"（4），"instruct"（5），"query_w"（6），"query_yn"（7），"ready"（8），"reply_n"（9），"reply_w"（10）或 "reply_y"（11）。

对于 meld_e 配置，不同的字段包括：

Utterance：字符串形式的话语。
Speaker：字符串形式的说话者。
Emotion：话语的情感标签。可以是 "anger"（0），"disgust"（1），"fear"（2），"joy"（3），"neutral"（4），"sadness"（5）或 "surprise"（6）。
Dialogue_ID：对话的标识符（字符串形式）。
Utterance_ID：话语的标识符（字符串形式）。

对于 meld_s 配置，不同的字段包括：

Utterance：字符串形式的话语。
Speaker：字符串形式的说话者。
Sentiment：话语的情感标签。可以是 "negative"（0），"neutral"（1）或 "positive"（2）。
Dialogue_ID：对话的标识符（字符串形式）。
Utterance_ID：话语的标识符（字符串形式）。

对于 mrda 配置，不同的字段包括：

Utterance_ID：话语的标识符（字符串形式）。
Dialogue_Act：话语的对话行为标签。可以是 "s"（0）[陈述/主观陈述]，"d"（1）[陈述性问题]，"b"（2）[Backchannel]，"f"（3）[Follow-me]或 "q"（4）[问题]。
Channel_ID：通道的标识符（字符串形式）。
Speaker：说话者的标识符（字符串形式）。
Dialogue_ID：通道的标识符（字符串形式）。
Utterance：字符串形式的话语。

对于 oasis 配置，不同的字段包括：

Speaker：说话者的标识符（字符串形式）。
Utterance：字符串形式的话语。
Dialogue_Act：话语的对话行为标签。可以是 "accept"（0），"ackn"（1），"answ"（2），"answElab"（3），"appreciate"（4），"backch"（5），"bye"（6），"complete"（7），"confirm"（8），"correct"（9），"direct"（10），"directElab"（11），"echo"（12），"exclaim"（13），"expressOpinion"（14），"expressPossibility"（15），"expressRegret"（16），"expressWish"（17），"greet"（18），"hold"（19），"identifySelf"（20），"inform"（21），"informCont"（22），"informDisc"（23），"informIntent"（24），"init"（25），"negate"（26），"offer"（27），"pardon"（28），"raiseIssue"（29），"refer"（30），"refuse"（31），"reqDirect"（32），"reqInfo"（33），"reqModal"（34），"selfTalk"（35），"suggest"（36），"thank"（37），"informIntent-hold"（38），"correctSelf"（39），"expressRegret-inform"（40）或 "thank-identifySelf"（41）。

对于 sem 配置，不同的字段包括：

Utterance：字符串形式的话语。
NbPairInSession：对话中话语对的数量。
Dialogue_ID：对话的标识符（字符串形式）。
SpeechTurn：字符串形式的说话者转换。
Speaker：字符串形式的说话者。
Sentiment：话语的情感标签。可以是 "Negative"，"Neutral"或 "Positive"。

对于 swda 配置，不同的字段包括：

Utterance：字符串形式的话语。 Dialogue_Act：话语的对话行为标签。可以是 "sd"（0）[Statement-non-opinion]，"b"（1）[Acknowledge (Backchannel)]，"sv"（2）[Statement-opinion]，"%"（3）[Uninterpretable]，"aa"（4）[Agree/Accept]，"ba"（5）[Appreciation]，"fc"（6）[Conventional-closing]，"qw"（7）[Wh-Question]，"nn"（8）[No Answers]，"bk"（9）[Response Acknowledgement]，"h"（10）[Hedge]，"qy^d"（11）[Declarative Yes-No-Question]，"bh"（12）[Backchannel in Question Form]，"^q"（13）[Quotation]，"bf"（14）[Summarize/Reformulate]，'fo_o_fw_" by_bc'（15）[Other]，'fo_o_fw_by_bc "'（16）[Other]，"na"（17）[Affirmative Non-yes Answers]，"ad"（18）[Action-directive]，"^2"（19）[Collaborative Completion]，"b^m"（20）[Repeat-phrase]，"qo"（21）[Open-Question]，"qh"（22）[Rhetorical-Question]，"^h"（23）[Hold Before Answer/Agreement]，"ar"（24）[Reject]，"ng"（25）[Negative Non-no Answers]，"br"（26）[Signal-non-understanding]，"no"（27）[Other Answers]，"fp"（28）[Conventional-opening]，"qrr"（29）[Or-Clause]，"arp_nd"（30）[Dispreferred Answers]，"t3"（31）[3rd-party-talk]，"oo_co_cc"（32）[Offers, Options Commits]，"aap_am"（33）[Maybe/Accept-part]，"t1"（34）[Downplayer]，"bd"（35）[Self-talk]，"^g"（36）[Tag-Question]，"qw^d"（37）[Declarative Wh-Question]，"fa"（38）[Apology]，"ft"（39）[Thanking]，"+"（40）[Unknown]，"x"（41）[Unknown]，"ny"（42）[Unknown]，"sv_fx"（43）[Unknown]，"qy_qr"（44）[Unknown]或 "ba_fe"（45）[Unknown]。 From_Caller：来源呼叫者的标识符（字符串形式）。 To_Caller：目标呼叫者的标识符（字符串形式）。 Topic：主题（字符串形式）。 Dialogue_ID：对话的标识符（字符串形式）。 Conv_ID：对话的标识符（字符串形式）。

数据拆分

Dataset name	Train	Valid	Test
dyda_da	87170	8069	7740
dyda_e	87170	8069	7740
iemocap	7213	805	2021
maptask	20905	2963	2894
meld_e	9989	1109	2610
meld_s	9989	1109	2610
mrda	83944	9815	15470
oasis	12076	1513	1478
sem	4264	485	878
swda	190709	21203	2714

数据集的创建

策划理由

【需要更多信息】

源数据

初始数据收集和规范化

【需要更多信息】

源语言制作人是谁？

【需要更多信息】

注释

注释过程

【需要更多信息】

注释者是谁？

【需要更多信息】

个人和敏感信息

【需要更多信息】

使用数据的注意事项

数据集的社会影响

【需要更多信息】

偏见讨论

【需要更多信息】

其他已知限制

【需要更多信息】

其他信息

基准 Curators

Emile Chapuis、Pierre Colombo、Ebenge Usip。

许可信息

此作品受到 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License 许可。

引用信息

@inproceedings{chapuis-etal-2020-hierarchical,
    title = "Hierarchical Pre-training for Sequence Labelling in Spoken Dialog",
    author = "Chapuis, Emile  and
      Colombo, Pierre  and
      Manica, Matteo  and
      Labeau, Matthieu  and
      Clavel, Chlo{\'e}",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.239",
    doi = "10.18653/v1/2020.findings-emnlp.239",
    pages = "2636--2648",
    abstract = "Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (SILICONE). SILICONE is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over 2.3 billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.",
}

贡献

感谢 @eusip 和 @lhoestq 添加了这个数据集。

作者:

佚名

数据集大小:

92.03 KB