数据集:
multi_nli
Multi-Genre Natural Language Inference(MultiNLI)语料库是一个众包收集的、包含433k个句对的数据集,这些句对标注了文本蕴含信息。该语料库的模型基于SNLI语料库,但不同之处在于它涵盖了各种口语和书面文本的类型,并支持跨类型的独特推广评估。该语料库作为2017年EMNLP哥本哈根RepEval研讨会的共享任务的基础。
该数据集只包含英文样本。
数据实例示例:
{ "promptID": 31193, "pairID": "31193n", "premise": "Conceptually cream skimming has two basic dimensions - product and geography.", "premise_binary_parse": "( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )", "premise_parse": "(ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))", "hypothesis": "Product and geography are what make cream skimming work. ", "hypothesis_binary_parse": "( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )", "hypothesis_parse": "(ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))", "genre": "government", "label": 1 }
所有拆分间的数据字段都相同。
train | validation_matched | validation_mismatched |
---|---|---|
392702 | 9815 | 9832 |
他们构建MultiNLI是为了能够明确评估模型在训练领域内句子表示质量以及它们在陌生领域中推导合理表示的能力。
他们通过从预先存在的文本来源中选择前提句子,并要求人工标注者撰写一个新的假设句子与之配对来创建每个句子对。
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是标注者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
大多数语料库在OANC的许可下发布,允许所有内容在宽松的条款下自由使用、修改和共享。小说部分的数据使用了几种宽松的许可证,Seven Swords采用了Creative Commons Share-Alike 3.0 Unported许可证,经作者明确许可,Living History和Password Incorrect采用了Creative Commons Attribution 3.0 Unported许可证;其余小说作品在美国是公共领域(但可能在其他地方有不同的许可证)。
@InProceedings{N18-1101, author = "Williams, Adina and Nangia, Nikita and Bowman, Samuel", title = "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference", booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", pages = "1112--1122", location = "New Orleans, Louisiana", url = "http://aclweb.org/anthology/N18-1101" }
感谢 @bhavitvyamalik 、 @patrickvonplaten 、 @thomwolf 、 @mariamabarham 添加了该数据集。