数据集:

GroNLP/ik-nlp-22_transqe

英文

IK-NLP-22项目3的数据集卡:基于翻译质量驱动的自然语言推理数据选择

数据集概述

该数据集包含全部 e-SNLI 数据集,使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型自动翻译成荷兰语。每个字段的翻译都使用了Unbabel的无参考版本的 COMET 度量标准进行注释,给出了两个质量估计分数。

该语料库的预期用途仅限于格罗宁根大学信息科学硕士学位(IK)的2022年自然语言处理课程的最终项目范围内,由 Arianna Bisazza Gabriele Sarti 授课,并得到 Anjali Nair 的协助。

e-SNLI语料库由作者在Github上免费提供。此数据集是为教育目的而创建的,并基于Camburu等人的原始e-SNLI数据集。本数据集的所有内容版权归原始作者所有。

语言

该语料库的语言数据为英文(BCP-47 en)和荷兰语(BCP-47 nl)。

数据集结构

数据实例

默认情况下,数据集包含一个名为plain_text的配置,具有三个原始拆分train,validation和test。每个拆分包含以下字段:

Field Description
premise_en The original English premise.
premise_nl The premise automatically translated to Dutch.
hypothesis_en The original English hypothesis.
hypothesis_nl The hypothesis automatically translated to Dutch.
label The label of the data instance (0 for entailment, 1 for neutral, 2 for contradiction).
explanation_1_en The first explanation for the assigned label in English.
explanation_1_nl The first explanation automatically translated to Dutch.
explanation_2_en The second explanation for the assigned label in English.
explanation_2_nl The second explanation automatically translated to Dutch.
explanation_3_en The third explanation for the assigned label in English.
explanation_3_nl The third explanation automatically translated to Dutch.
da_premise The quality estimation produced by the wmt20-comet-qe-da model for the premise translation.
da_hypothesis The quality estimation produced by the wmt20-comet-qe-da model for the hypothesis translation.
da_explanation_1 The quality estimation produced by the wmt20-comet-qe-da model for the first explanation translation.
da_explanation_2 The quality estimation produced by the wmt20-comet-qe-da model for the second explanation translation.
da_explanation_3 The quality estimation produced by the wmt20-comet-qe-da model for the third explanation translation.
mqm_premise The quality estimation produced by the wmt21-comet-qe-mqm model for the premise translation.
mqm_hypothesis The quality estimation produced by the wmt21-comet-qe-mqm model for the hypothesis translation.
mqm_explanation_1 The quality estimation produced by the wmt21-comet-qe-mqm model for the first explanation translation.
mqm_explanation_2 The quality estimation produced by the wmt21-comet-qe-mqm model for the second explanation translation.
mqm_explanation_3 The quality estimation produced by the wmt21-comet-qe-mqm model for the third explanation translation.

解释2和3以及相关的质量估计分数仅存在于validation和test拆分中。

数据拆分

config train validation test
plain_text 549'367 9842 9824

根据您的计算设置选择最合理的数据量进行分析,数据量越大越好。

数据示例

以下是来自test拆分的条目2000的示例:

{
  "premise_en": "A young woman wearing a yellow sweater and black pants is ice skating outdoors.",
  "premise_nl": "Een jonge vrouw met een gele trui en zwarte broek schaatst buiten.",
  "hypothesis_en": "a woman is practicing for the olympics",
  "hypothesis_nl": "een vrouw oefent voor de Olympische Spelen",
  "label": 1,
  "explanation_1_en": "You can not infer it's for the Olympics.",
  "explanation_1_nl": "Het is niet voor de Olympische Spelen.",
  "explanation_2_en": "Just because a girl is skating outdoors does not  mean she is practicing for the Olympics.",
  "explanation_2_nl": "Alleen omdat een meisje buiten schaatst betekent niet dat ze oefent voor de Olympische Spelen.",
  "explanation_3_en": "Ice skating doesn't imply practicing for the olympics.",
  "explanation_3_nl": "Schaatsen betekent niet oefenen voor de Olympische Spelen.",
  "da_premise": "0.6099",
  "mqm_premise": "0.1298",
  "da_hypothesis": "0.8504",
  "mqm_hypothesis": "0.1521",
  "da_explanation_1": "0.0001",
  "mqm_explanation_1": "0.1237",
  "da_explanation_2": "0.4017",
  "mqm_explanation_2": "0.1467",
  "da_explanation_3": "0.6069",
  "mqm_explanation_3": "0.1389"
}

数据集创建

通过以下步骤创建了数据集:

  • 使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型将原始e-SNLI语料库的每个字段翻译成荷兰语。

  • 使用Unbabel的两个无参考版本的 COMET 度量标准注释翻译的质量估计。

其他信息

数据集维护者

如果对此?数据集版本有问题,请通过ik-nlp-course@rug.nl与我们联系。

许可信息

该数据集的许可证为 Apache 2.0 License

引用信息

如果您在工作中使用这些语料库,请引用作者:

@incollection{NIPS2018_8163,
    title = {e-SNLI: Natural Language Inference with Natural Language Explanations},
    author = {Camburu, Oana-Maria and Rockt\"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil},
    booktitle = {Advances in Neural Information Processing Systems 31},
    editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
    pages = {9539--9549},
    year = {2018},
    publisher = {Curran Associates, Inc.},
    url = {http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf}
}