数据集:

GroNLP/ik-nlp-22_transqe

任务:

文本分类

子任务:

natural-language-inference

语言:

计算机处理:

translation

大小:

size_categories:unknown

语言创建人:

expert-generated machine-generated

批注创建人:

expert-generated

源数据集:

extended|esnli

其他:

quality-estimation

许可:

apache-2.0

数据集介绍文件清单

英文

IK-NLP-22项目3的数据集卡：基于翻译质量驱动的自然语言推理数据选择

数据集概述

该数据集包含全部 e-SNLI 数据集，使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型自动翻译成荷兰语。每个字段的翻译都使用了Unbabel的无参考版本的 COMET 度量标准进行注释，给出了两个质量估计分数。

该语料库的预期用途仅限于格罗宁根大学信息科学硕士学位（IK）的2022年自然语言处理课程的最终项目范围内，由 Arianna Bisazza 和 Gabriele Sarti 授课，并得到 Anjali Nair 的协助。

e-SNLI语料库由作者在Github上免费提供。此数据集是为教育目的而创建的，并基于Camburu等人的原始e-SNLI数据集。本数据集的所有内容版权归原始作者所有。

语言

该语料库的语言数据为英文（BCP-47 en）和荷兰语（BCP-47 nl）。

数据集结构

数据实例

默认情况下，数据集包含一个名为plain_text的配置，具有三个原始拆分train，validation和test。每个拆分包含以下字段：

Field	Description
premise_en	The original English premise.
premise_nl	The premise automatically translated to Dutch.
hypothesis_en	The original English hypothesis.
hypothesis_nl	The hypothesis automatically translated to Dutch.
label	The label of the data instance (0 for entailment, 1 for neutral, 2 for contradiction).
explanation_1_en	The first explanation for the assigned label in English.
explanation_1_nl	The first explanation automatically translated to Dutch.
explanation_2_en	The second explanation for the assigned label in English.
explanation_2_nl	The second explanation automatically translated to Dutch.
explanation_3_en	The third explanation for the assigned label in English.
explanation_3_nl	The third explanation automatically translated to Dutch.
da_premise	The quality estimation produced by the wmt20-comet-qe-da model for the premise translation.
da_hypothesis	The quality estimation produced by the wmt20-comet-qe-da model for the hypothesis translation.
da_explanation_1	The quality estimation produced by the wmt20-comet-qe-da model for the first explanation translation.
da_explanation_2	The quality estimation produced by the wmt20-comet-qe-da model for the second explanation translation.
da_explanation_3	The quality estimation produced by the wmt20-comet-qe-da model for the third explanation translation.
mqm_premise	The quality estimation produced by the wmt21-comet-qe-mqm model for the premise translation.
mqm_hypothesis	The quality estimation produced by the wmt21-comet-qe-mqm model for the hypothesis translation.
mqm_explanation_1	The quality estimation produced by the wmt21-comet-qe-mqm model for the first explanation translation.
mqm_explanation_2	The quality estimation produced by the wmt21-comet-qe-mqm model for the second explanation translation.
mqm_explanation_3	The quality estimation produced by the wmt21-comet-qe-mqm model for the third explanation translation.

解释2和3以及相关的质量估计分数仅存在于validation和test拆分中。

数据拆分

config	train	validation	test
plain_text	549'367	9842	9824

根据您的计算设置选择最合理的数据量进行分析，数据量越大越好。

数据示例

以下是来自test拆分的条目2000的示例：

{
  "premise_en": "A young woman wearing a yellow sweater and black pants is ice skating outdoors.",
  "premise_nl": "Een jonge vrouw met een gele trui en zwarte broek schaatst buiten.",
  "hypothesis_en": "a woman is practicing for the olympics",
  "hypothesis_nl": "een vrouw oefent voor de Olympische Spelen",
  "label": 1,
  "explanation_1_en": "You can not infer it's for the Olympics.",
  "explanation_1_nl": "Het is niet voor de Olympische Spelen.",
  "explanation_2_en": "Just because a girl is skating outdoors does not  mean she is practicing for the Olympics.",
  "explanation_2_nl": "Alleen omdat een meisje buiten schaatst betekent niet dat ze oefent voor de Olympische Spelen.",
  "explanation_3_en": "Ice skating doesn't imply practicing for the olympics.",
  "explanation_3_nl": "Schaatsen betekent niet oefenen voor de Olympische Spelen.",
  "da_premise": "0.6099",
  "mqm_premise": "0.1298",
  "da_hypothesis": "0.8504",
  "mqm_hypothesis": "0.1521",
  "da_explanation_1": "0.0001",
  "mqm_explanation_1": "0.1237",
  "da_explanation_2": "0.4017",
  "mqm_explanation_2": "0.1467",
  "da_explanation_3": "0.6069",
  "mqm_explanation_3": "0.1389"
}

数据集创建

通过以下步骤创建了数据集：

使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型将原始e-SNLI语料库的每个字段翻译成荷兰语。
使用Unbabel的两个无参考版本的 COMET 度量标准注释翻译的质量估计。

其他信息

数据集维护者

如果对此?数据集版本有问题，请通过ik-nlp-course@rug.nl与我们联系。

许可信息

该数据集的许可证为 Apache 2.0 License 。

引用信息

如果您在工作中使用这些语料库，请引用作者：

@incollection{NIPS2018_8163,
    title = {e-SNLI: Natural Language Inference with Natural Language Explanations},
    author = {Camburu, Oana-Maria and Rockt\"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil},
    booktitle = {Advances in Neural Information Processing Systems 31},
    editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
    pages = {9539--9549},
    year = {2018},
    publisher = {Curran Associates, Inc.},
    url = {http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf}
}

作者:

GroNLP

数据集大小:

50.95 MB