数据集:
silicone
语言:
计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2009.11152许可:
SILICONE(口语自然语言理解评估基准)是一个特别设计用于口语的自然语言理解系统的资源集合,用于训练、评估和分析。所有数据集都是英文的,涵盖了各种领域,包括日常生活、脚本场景、联合任务完成、电话对话和电视对话。部分数据集还包括情感和/或情绪标签。
【需要更多信息】
英文。
在 dyda_da 配置下,数据集中的一个示例为:
{ 'Utterance': "the taxi drivers are on strike again .", 'Dialogue_Act': 2, # "inform" 'Dialogue_ID': "2" }DailyDialog Act Corpus(情感)
在 dyda_e 配置下,数据集中的一个示例为:
{ 'Utterance': "'oh , breaktime flies .'", 'Emotion': 5, # "sadness" 'Dialogue_ID': "997" }Interactive Emotional Dyadic Motion Capture(IEMOCAP)数据库
在 iemocap 配置下,数据集中的一个示例为:
{ 'Dialogue_ID': "Ses04F_script03_2", 'Utterance_ID': "Ses04F_script03_2_F025", 'Utterance': "You're quite insufferable. I expect it's because you're drunk.", 'Emotion': 0, # "ang" }HCRC MapTask Corpus
在 maptask 配置下,数据集中的一个示例为:
{ 'Speaker': "f", 'Utterance': "i think that would bring me over the crevasse", 'Dialogue_Act': 4, # "explain" }Multimodal EmotionLines Dataset(情感)
在 meld_e 配置下,数据集中的一个示例为:
{ 'Utterance': "'Push 'em out , push 'em out , harder , harder .'", 'Speaker': "Joey", 'Emotion': 3, # "joy" 'Dialogue_ID': "1", 'Utterance_ID': "2" }Multimodal EmotionLines Dataset(情感)
在 meld_s 配置下,数据集中的一个示例为:
{ 'Utterance': "'Okay , y'know what ? There is no more left , left !'", 'Speaker': "Rachel", 'Sentiment': 0, # "negative" 'Dialogue_ID': "2", 'Utterance_ID': "4" }ICSI MRDA Corpus
在 mrda 配置下,数据集中的一个示例为:
{ 'Utterance_ID': "Bed006-c2_0073656_0076706", 'Dialogue_Act': 0, # "s" 'Channel_ID': "Bed006-c2", 'Speaker': "mn015", 'Dialogue_ID': "Bed006", 'Utterance': "keith is not technically one of us yet ." }BT OASIS Corpus
在 oasis 配置下,数据集中的一个示例为:
{ 'Speaker': "b", 'Utterance': "when i rang up um when i rang to find out why she said oh well your card's been declined", 'Dialogue_Act': 21, # "inform" }SEMAINE数据库
在 sem 配置下,数据集中的一个示例为:
{ 'Utterance': "can you think of somebody who is like that ?", 'NbPairInSession': "11", 'Dialogue_ID': "59", 'SpeechTurn': "674", 'Speaker': "Agent", 'Sentiment': 1, # "Neutral" }Switchboard Dialog Act(SwDA)Corpus
在 swda 配置下,数据集中的一个示例为:
{ 'Utterance': "but i 'd probably say that 's roughly right .", 'Dialogue_Act': 33, # "aap_am" 'From_Caller': "1255", 'To_Caller': "1087", 'Topic': "CRIME", 'Dialogue_ID': "818", 'Conv_ID': "sw2836", }
对于 dyda_da 配置,不同的字段包括:
对于 dyda_e 配置,不同的字段包括:
对于 iemocap 配置,不同的字段包括:
对于 maptask 配置,不同的字段包括:
对于 meld_e 配置,不同的字段包括:
对于 meld_s 配置,不同的字段包括:
对于 mrda 配置,不同的字段包括:
对于 oasis 配置,不同的字段包括:
对于 sem 配置,不同的字段包括:
对于 swda 配置,不同的字段包括:
Utterance:字符串形式的话语。 Dialogue_Act:话语的对话行为标签。可以是 "sd"(0)[Statement-non-opinion],"b"(1)[Acknowledge (Backchannel)],"sv"(2)[Statement-opinion],"%"(3)[Uninterpretable],"aa"(4)[Agree/Accept],"ba"(5)[Appreciation],"fc"(6)[Conventional-closing],"qw"(7)[Wh-Question],"nn"(8)[No Answers],"bk"(9)[Response Acknowledgement],"h"(10)[Hedge],"qy^d"(11)[Declarative Yes-No-Question],"bh"(12)[Backchannel in Question Form],"^q"(13)[Quotation],"bf"(14)[Summarize/Reformulate],'fo_o_fw_" by_bc'(15)[Other],'fo_o_fw_by_bc "'(16)[Other],"na"(17)[Affirmative Non-yes Answers],"ad"(18)[Action-directive],"^2"(19)[Collaborative Completion],"b^m"(20)[Repeat-phrase],"qo"(21)[Open-Question],"qh"(22)[Rhetorical-Question],"^h"(23)[Hold Before Answer/Agreement],"ar"(24)[Reject],"ng"(25)[Negative Non-no Answers],"br"(26)[Signal-non-understanding],"no"(27)[Other Answers],"fp"(28)[Conventional-opening],"qrr"(29)[Or-Clause],"arp_nd"(30)[Dispreferred Answers],"t3"(31)[3rd-party-talk],"oo_co_cc"(32)[Offers, Options Commits],"aap_am"(33)[Maybe/Accept-part],"t1"(34)[Downplayer],"bd"(35)[Self-talk],"^g"(36)[Tag-Question],"qw^d"(37)[Declarative Wh-Question],"fa"(38)[Apology],"ft"(39)[Thanking],"+"(40)[Unknown],"x"(41)[Unknown],"ny"(42)[Unknown],"sv_fx"(43)[Unknown],"qy_qr"(44)[Unknown]或 "ba_fe"(45)[Unknown]。 From_Caller:来源呼叫者的标识符(字符串形式)。 To_Caller:目标呼叫者的标识符(字符串形式)。 Topic:主题(字符串形式)。 Dialogue_ID:对话的标识符(字符串形式)。 Conv_ID:对话的标识符(字符串形式)。Dataset name | Train | Valid | Test |
---|---|---|---|
dyda_da | 87170 | 8069 | 7740 |
dyda_e | 87170 | 8069 | 7740 |
iemocap | 7213 | 805 | 2021 |
maptask | 20905 | 2963 | 2894 |
meld_e | 9989 | 1109 | 2610 |
meld_s | 9989 | 1109 | 2610 |
mrda | 83944 | 9815 | 15470 |
oasis | 12076 | 1513 | 1478 |
sem | 4264 | 485 | 878 |
swda | 190709 | 21203 | 2714 |
【需要更多信息】
【需要更多信息】
源语言制作人是谁?【需要更多信息】
【需要更多信息】
注释者是谁?【需要更多信息】
【需要更多信息】
【需要更多信息】
【需要更多信息】
【需要更多信息】
Emile Chapuis、Pierre Colombo、Ebenge Usip。
此作品受到 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License 许可。
@inproceedings{chapuis-etal-2020-hierarchical, title = "Hierarchical Pre-training for Sequence Labelling in Spoken Dialog", author = "Chapuis, Emile and Colombo, Pierre and Manica, Matteo and Labeau, Matthieu and Clavel, Chlo{\'e}", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.findings-emnlp.239", doi = "10.18653/v1/2020.findings-emnlp.239", pages = "2636--2648", abstract = "Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (SILICONE). SILICONE is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over 2.3 billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.", }