数据集:
wino_bias
任务:
标记分类语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1804.06876许可:
mitWinoBias是一个针对性别偏见的Winograd-schema数据集,用于指代消解。该语料库包含Winograd-schema风格的句子,其中的实体对应于以职业表示的人(例如护士、医生、木匠)。
底层任务是指代消解。
英语
数据集包含4个子集:type1_pro、type1_anti、type2_pro和type2_anti。
*_pro子集包含强化性别刻板印象的句子(例如机修工是男性,护士是女性),而 *_anti数据集包含“反刻板印象”的句子(例如机修工是女性,护士是男性)。
type1 (WB-Knowledge)子集包含需要世界知识来消解指代的句子,而type2(WB-Syntax)子集只需要句子中存在的句法信息来消解指代。
- document_id = This is a variation on the document filename - part_number = Some files are divided into multiple parts numbered as 000, 001, 002, ... etc. - word_num = This is the word index of the word in that sentence. - tokens = This is the token as segmented/tokenized in the Treebank. - pos_tags = This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag. - parse_bit = This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column. When the parse information is missing, the first word of a sentence is tagged as "(TOP*" and the last word is tagged as "*)" and all intermediate words are tagged with a "*". - predicate_lemma = The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a "-". - predicate_framenet_id = This is the PropBank frameset ID of the predicate in predicate_lemma. - word_sense = This is the word sense of the word in Column tokens. - speaker = This is the speaker or author name where available. - ner_tags = These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an "*". - verbal_predicates = There is one column each of predicate argument structure information for the predicate mentioned in predicate_lemma. If there are no predicates tagged in a sentence this is a single column with all rows marked with an "*".
有开发集和测试集可用
WinoBias数据集在2018年推出(详见 paper ),其最初的任务是指代消解,这是一个旨在识别指称同一实体或人的任务。
[需要更多信息]
谁是源语言的制作者?该数据集由熟悉WinoBias项目的研究人员根据作者提供的两个典型模板创建,其中实体以合理的方式进行交互。
[需要更多信息]
谁是标注者?“熟悉[WinoBias]项目的研究人员”
[需要更多信息]
[需要更多信息]
Recent work 表明,该数据集包含语法问题、错误或模棱两可的标签以及刻板印象混淆等局限性。
[需要更多信息]
Jieyu Zhao,Tianlu Wang,Mark Yatskar,Vicente Ordonez和Kai-Wei Chan
MIT许可证
@article{DBLP:journals/corr/abs-1804-06876, author = {Jieyu Zhao and Tianlu Wang and Mark Yatskar and Vicente Ordonez and Kai{-}Wei Chang}, title = {Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods}, journal = {CoRR}, volume = {abs/1804.06876}, year = {2018}, url = { http://arxiv.org/abs/1804.06876} , archivePrefix = {arXiv}, eprint = {1804.06876}, timestamp = {Mon, 13 Aug 2018 16:47:01 +0200}, biburl = { https://dblp.org/rec/journals/corr/abs-1804-06876.bib} , bibsource = {dblp计算机科学文献数据库, https://dblp.org} }
感谢 @akshayb7 添加此数据集。由 @JieyuZhao 进行更新。