英文

SwDA 数据集卡片

数据集摘要

Switchboard 对话行为语料库(SwDA)扩展了 Switchboard-1 电话语音语料库第二版,提供了句子/话语级别的对话行为标签。这些标签总结了与该话语相关的句法、语义和语用信息。SwDA 项目是在 1990 年代末期由 UC Boulder 进行的。SwDA 并不与 Penn Treebank 3 的 Switchboard 解析相关联,将这两个资源对齐并不简单。此外,SwDA 并不包含 Switchboard 对话和参与者的元数据表。

支持的任务和排行榜

Model Accuracy Paper / Source Code
H-Seq2seq (Colombo et al., 2020) 85.0 1231321
SGNN (Ravi et al., 2018) 83.1 1232321
CASA (Raheja et al., 2019) 82.9 1233321
DAH-CRF (Li et al., 2019) 82.3 1234321
ALDMN (Wan et al., 2018) 81.5 1235321
CRF-ASN (Chen et al., 2018) 81.3 1236321
Pretrained H-Transformer (Chapuis et al., 2020) 79.3 [Hierarchical Pre-training for Sequence Labelling in Spoken Dialog] ( 1237321 )
Bi-LSTM-CRF (Kumar et al., 2017) 79.2 1238321 1239321
RNN with 3 utterances in context (Bothe et al., 2018) 77.34 12310321

语言

支持的语言是英语。

数据集结构

句子以 SWBD-DAMSL 作为 DA 进行标记。

数据实例

数据集中的一个示例为:

{"act_tag": 115, "caller": "A", "conversation_no": 4325, "damsl_act_tag": 26, "from_caller": 1632, "from_caller_birth_year": 1962, "from_caller_dialect_area": "WESTERN", "from_caller_education": 2, "from_caller_sex": "FEMALE", "length": 5, "pos": "Okay/UH ./.", "prompt": "FIND OUT WHAT CRITERIA THE OTHER CALLER WOULD USE IN SELECTING CHILD CARE SERVICES FOR A PRESCHOOLER. IS IT EASY OR DIFFICULT TO FIND SUCH CARE?", "ptb_basename": "4/sw4325", "ptb_treenumbers": "1", "subutterance_index": 1, "swda_filename": "sw00utt/sw_0001_4325.utt", "talk_day": "03/23/1992", "text": "Okay. /", "to_caller": 1519, "to_caller_birth_year": 1971, "to_caller_dialect_area": "SOUTH MIDLAND", "to_caller_education": 1, "to_caller_sex": "FEMALE", "topic_description": "CHILD CARE", "transcript_index": 0, "trees": "(INTJ (UH Okay) (. .) (-DFL- E_S))", "utterance_index": 1}

数据字段

  • swda_filename:(str) 文件名:目录/基本名称。
  • ptb_basename:(str) Treebank 文件名:添加 ".pos" 表示 POS,".mrg" 表示树。
  • conversation_no:(int) 对话标识,用于检索元数据数据库。
  • transcript_index:(int) 在转录中的行号(仅计算 utt 行)。
  • act_tag:(str 的列表) 对话行为标签(文件中通过 ||| 分隔)。详见对话行为注释获取更多细节。
  • damsl_act_tag:(str 的列表) 217 个变化标签的对话行为标签。
  • caller:(str) A、B、@A、@B、@@A、@@B。
  • utterance_index:(int) 话语的编码索引(如 A.49、B.27)。
  • subutterance_index:(int) 话语可以跨行分段,此字段表示内部位置。
  • text:(str) 话语的文本。
  • pos:(str) 话语的 POS 标记版本,来自 PtbBasename+.pos。
  • trees:(str) 包含此话语的树(在文件中通过 ||| 分隔)。使用 [Tree.fromstring(t) for t in row_value.split("|||")] 转换为(nltk.tree.Tree 的列表)。
  • ptb_treenumbers:(int 的列表) PtbBasename+.mrg 中的树编号。
  • talk_day:(str) 对话日期。
  • length:(int) 对话时长(秒)。
  • topic_description:(str) 正在讨论的话题的简短描述。
  • prompt:(str) 长描述/查询/指令。
  • from_caller:(int) 发送方(A)拨号方的数字标识。
  • from_caller_sex:(str) MALE、FEMALE。
  • from_caller_education:(int) 参与者教育水平,0、1、2、3、9。
  • from_caller_birth_year:(int) 发送方(A)拨号方的出生年份(YYYY)。
  • from_caller_dialect_area:(str) MIXED、NEW ENGLAND、NORTH MIDLAND、NORTHERN、NYC、SOUTH MIDLAND、SOUTHERN、UNK、WESTERN。
  • to_caller:(int) 接收方(B)拨号方的数字标识。
  • to_caller_sex:(str) MALE、FEMALE。
  • to_caller_education:(int) 参与者教育水平,0、1、2、3、9。
  • to_caller_birth_year:(int) 接收方(B)拨号方的出生年份(YYYY)。
  • to_caller_dialect_area:(str) MIXED、NEW ENGLAND、NORTH MIDLAND、NORTHERN、NYC、SOUTH MIDLAND、SOUTHERN、UNK、WESTERN。

对话行为注释

name act_tag example train_count full_count
1 Statement-non-opinion sd Me, I'm in the legal department. 72824 75145
2 Acknowledge (Backchannel) b Uh-huh. 37096 38298
3 Statement-opinion sv I think it's great 25197 26428
4 Agree/Accept aa That's exactly it. 10820 11133
5 Abandoned or Turn-Exit % So, - 10569 15550
6 Appreciation ba I can imagine. 4633 4765
7 Yes-No-Question qy Do you have to have any special training? 4624 4727
8 Non-verbal x [Laughter], [Throat_clearing] 3548 3630
9 Yes answers ny Yes. 2934 3034
10 Conventional-closing fc Well, it's been nice talking to you. 2486 2582
11 Uninterpretable % But, uh, yeah 2158 15550
12 Wh-Question qw Well, how old are you? 1911 1979
13 No answers nn No. 1340 1377
14 Response Acknowledgement bk Oh, okay. 1277 1306
15 Hedge h I don't know if I'm making any sense or not. 1182 1226
16 Declarative Yes-No-Question qy^d So you can afford to get a house? 1174 1219
17 Other fo_o_fw_by_bc Well give me a break, you know. 1074 883
18 Backchannel in question form bh Is that right? 1019 1053
19 Quotation ^q You can't be pregnant and have cats 934 983
20 Summarize/reformulate bf Oh, you mean you switched schools for the kids. 919 952
21 Affirmative non-yes answers na It is. 836 847
22 Action-directive ad Why don't you go first 719 746
23 Collaborative Completion ^2 Who aren't contributing. 699 723
24 Repeat-phrase b^m Oh, fajitas 660 688
25 Open-Question qo How about you? 632 656
26 Rhetorical-Questions qh Who would steal a newspaper? 557 575
27 Hold before answer/agreement ^h I'm drawing a blank. 540 556
28 Reject ar Well, no 338 346
29 Negative non-no answers ng Uh, not a whole lot. 292 302
30 Signal-non-understanding br Excuse me? 288 298
31 Other answers no I don't know 279 286
32 Conventional-opening fp How are you? 220 225
33 Or-Clause qrr or is it more of a company? 207 209
34 Dispreferred answers arp_nd Well, not so much that. 205 207
35 3rd-party-talk t3 My goodness, Diane, get down from there. 115 117
36 Offers, Options, Commits oo_co_cc I'll have to check that out 109 110
37 Self-talk t1 What's the word I'm looking for 102 103
38 Downplayer bd That's all right. 100 103
39 Maybe/Accept-part aap_am Something like that 98 105
40 Tag-Question ^g Right? 93 92
41 Declarative Wh-Question qw^d You are what kind of buff? 80 80
42 Apology fa I'm sorry. 76 79
43 Thanking ft Hey thanks a lot 67 78

数据拆分

我使用了 Probabilistic-RNN-DA-Classifier 仓库中的信息:与 Stolcke et al. (2000) 使用的相同的训练和测试拆分。开发集是训练集的子集,用于加快开发和测试过程,用于 Probabilistic Word Association for Dialogue Act Classification with Recurrent Neural Networks 论文。

Dataset # Transcripts # Utterances
Training 1115 192,768
Validation 21 3,196
Test 19 4,088

数据集创建

策划理由

[需要更多信息]

源数据

Initail Data Collection and Normalization

SwDA 与 Penn Treebank 3 的 Switchboard 解析并无直接关联,对齐这两个资源并不简单,详见 Calhoun et al. 2010,§2.4。此外,SwDA 未包含 Switchboard 对话和参与者的元数据表。

谁是源语言的制作者?

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

Christopher Potts ,斯坦福大学语言学系。

许可信息

本作品根据 Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. 许可。

引用信息

@techreport{Jurafsky-etal:1997,
    Address = {Boulder, CO},
    Author = {Jurafsky, Daniel and Shriberg, Elizabeth and Biasca, Debra},
    Institution = {University of Colorado, Boulder Institute of Cognitive Science},
    Number = {97-02},
    Title = {Switchboard {SWBD}-{DAMSL} Shallow-Discourse-Function Annotation Coders Manual, Draft 13},
    Year = {1997}}

@article{Shriberg-etal:1998,
    Author = {Shriberg, Elizabeth and Bates, Rebecca and Taylor, Paul and Stolcke, Andreas and Jurafsky, Daniel and Ries, Klaus and Coccaro, Noah and Martin, Rachel and Meteer, Marie and Van Ess-Dykema, Carol},
    Journal = {Language and Speech},
    Number = {3--4},
    Pages = {439--487},
    Title = {Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?},
    Volume = {41},
    Year = {1998}}

@article{Stolcke-etal:2000,
    Author = {Stolcke, Andreas and Ries, Klaus and Coccaro, Noah and Shriberg, Elizabeth and Bates, Rebecca and Jurafsky, Daniel and Taylor, Paul and Martin, Rachel and Meteer, Marie and Van Ess-Dykema, Carol},
    Journal = {Computational Linguistics},
    Number = {3},
    Pages = {339--371},
    Title = {Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech},
    Volume = {26},
    Year = {2000}}

贡献

感谢 @gmihaila 添加了该数据集。