数据集:

wino_bias

语言:

en

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1804.06876

许可:

mit
英文

Wino_Bias数据集数据卡

数据集摘要

WinoBias是一个针对性别偏见的Winograd-schema数据集,用于指代消解。该语料库包含Winograd-schema风格的句子,其中的实体对应于以职业表示的人(例如护士、医生、木匠)。

支持的任务和排行榜

底层任务是指代消解。

语言

英语

数据集结构

数据实例

数据集包含4个子集:type1_pro、type1_anti、type2_pro和type2_anti。

*_pro子集包含强化性别刻板印象的句子(例如机修工是男性,护士是女性),而 *_anti数据集包含“反刻板印象”的句子(例如机修工是女性,护士是男性)。

type1 (WB-Knowledge)子集包含需要世界知识来消解指代的句子,而type2(WB-Syntax)子集只需要句子中存在的句法信息来消解指代。

数据字段

- document_id = This is a variation on the document filename
- part_number = Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
- word_num = This is the word index of the word in that sentence.
- tokens = This is the token as segmented/tokenized in the Treebank.
- pos_tags = This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation   are marked with a XX tag. The verb is marked with just a VERB tag.
- parse_bit = This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column. When the parse information is missing, the first word of a sentence is tagged as "(TOP*" and the last word is tagged as "*)" and all intermediate words are tagged with a "*".
- predicate_lemma = The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a "-".
- predicate_framenet_id = This is the PropBank frameset ID of the predicate in predicate_lemma.
- word_sense = This is the word sense of the word in Column tokens.
- speaker = This is the speaker or author name where available.
- ner_tags = These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an "*".
- verbal_predicates = There is one column each of predicate argument structure information for the predicate mentioned in predicate_lemma. If there are no predicates tagged in a sentence this is a single column with all rows marked with an "*".

数据拆分

有开发集和测试集可用

数据集创建

策划理由

WinoBias数据集在2018年推出(详见 paper ),其最初的任务是指代消解,这是一个旨在识别指称同一实体或人的任务。

源数据

初始数据收集和规范化。

[需要更多信息]

谁是源语言的制作者?

该数据集由熟悉WinoBias项目的研究人员根据作者提供的两个典型模板创建,其中实体以合理的方式进行交互。

注释

注释过程

[需要更多信息]

谁是标注者?

“熟悉[WinoBias]项目的研究人员”

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

Recent work 表明,该数据集包含语法问题、错误或模棱两可的标签以及刻板印象混淆等局限性。

其他已知限制

[需要更多信息]

附加信息

数据集策划者

Jieyu Zhao,Tianlu Wang,Mark Yatskar,Vicente Ordonez和Kai-Wei Chan

授权信息

MIT许可证

引用信息

@article{DBLP:journals/corr/abs-1804-06876, author = {Jieyu Zhao and Tianlu Wang and Mark Yatskar and Vicente Ordonez and Kai{-}Wei Chang}, title = {Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods}, journal = {CoRR}, volume = {abs/1804.06876}, year = {2018}, url = { http://arxiv.org/abs/1804.06876} , archivePrefix = {arXiv}, eprint = {1804.06876}, timestamp = {Mon, 13 Aug 2018 16:47:01 +0200}, biburl = { https://dblp.org/rec/journals/corr/abs-1804-06876.bib} , bibsource = {dblp计算机科学文献数据库, https://dblp.org} }

贡献

感谢 @akshayb7 添加此数据集。由 @JieyuZhao 进行更新。