数据集:
GroNLP/ik-nlp-22_transqe
该数据集包含全部 e-SNLI 数据集,使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型自动翻译成荷兰语。每个字段的翻译都使用了Unbabel的无参考版本的 COMET 度量标准进行注释,给出了两个质量估计分数。
该语料库的预期用途仅限于格罗宁根大学信息科学硕士学位(IK)的2022年自然语言处理课程的最终项目范围内,由 Arianna Bisazza 和 Gabriele Sarti 授课,并得到 Anjali Nair 的协助。
e-SNLI语料库由作者在Github上免费提供。此数据集是为教育目的而创建的,并基于Camburu等人的原始e-SNLI数据集。本数据集的所有内容版权归原始作者所有。
该语料库的语言数据为英文(BCP-47 en)和荷兰语(BCP-47 nl)。
默认情况下,数据集包含一个名为plain_text的配置,具有三个原始拆分train,validation和test。每个拆分包含以下字段:
Field | Description |
---|---|
premise_en | The original English premise. |
premise_nl | The premise automatically translated to Dutch. |
hypothesis_en | The original English hypothesis. |
hypothesis_nl | The hypothesis automatically translated to Dutch. |
label | The label of the data instance (0 for entailment, 1 for neutral, 2 for contradiction). |
explanation_1_en | The first explanation for the assigned label in English. |
explanation_1_nl | The first explanation automatically translated to Dutch. |
explanation_2_en | The second explanation for the assigned label in English. |
explanation_2_nl | The second explanation automatically translated to Dutch. |
explanation_3_en | The third explanation for the assigned label in English. |
explanation_3_nl | The third explanation automatically translated to Dutch. |
da_premise | The quality estimation produced by the wmt20-comet-qe-da model for the premise translation. |
da_hypothesis | The quality estimation produced by the wmt20-comet-qe-da model for the hypothesis translation. |
da_explanation_1 | The quality estimation produced by the wmt20-comet-qe-da model for the first explanation translation. |
da_explanation_2 | The quality estimation produced by the wmt20-comet-qe-da model for the second explanation translation. |
da_explanation_3 | The quality estimation produced by the wmt20-comet-qe-da model for the third explanation translation. |
mqm_premise | The quality estimation produced by the wmt21-comet-qe-mqm model for the premise translation. |
mqm_hypothesis | The quality estimation produced by the wmt21-comet-qe-mqm model for the hypothesis translation. |
mqm_explanation_1 | The quality estimation produced by the wmt21-comet-qe-mqm model for the first explanation translation. |
mqm_explanation_2 | The quality estimation produced by the wmt21-comet-qe-mqm model for the second explanation translation. |
mqm_explanation_3 | The quality estimation produced by the wmt21-comet-qe-mqm model for the third explanation translation. |
解释2和3以及相关的质量估计分数仅存在于validation和test拆分中。
config | train | validation | test |
---|---|---|---|
plain_text | 549'367 | 9842 | 9824 |
根据您的计算设置选择最合理的数据量进行分析,数据量越大越好。
以下是来自test拆分的条目2000的示例:
{
"premise_en": "A young woman wearing a yellow sweater and black pants is ice skating outdoors.",
"premise_nl": "Een jonge vrouw met een gele trui en zwarte broek schaatst buiten.",
"hypothesis_en": "a woman is practicing for the olympics",
"hypothesis_nl": "een vrouw oefent voor de Olympische Spelen",
"label": 1,
"explanation_1_en": "You can not infer it's for the Olympics.",
"explanation_1_nl": "Het is niet voor de Olympische Spelen.",
"explanation_2_en": "Just because a girl is skating outdoors does not mean she is practicing for the Olympics.",
"explanation_2_nl": "Alleen omdat een meisje buiten schaatst betekent niet dat ze oefent voor de Olympische Spelen.",
"explanation_3_en": "Ice skating doesn't imply practicing for the olympics.",
"explanation_3_nl": "Schaatsen betekent niet oefenen voor de Olympische Spelen.",
"da_premise": "0.6099",
"mqm_premise": "0.1298",
"da_hypothesis": "0.8504",
"mqm_hypothesis": "0.1521",
"da_explanation_1": "0.0001",
"mqm_explanation_1": "0.1237",
"da_explanation_2": "0.4017",
"mqm_explanation_2": "0.1467",
"da_explanation_3": "0.6069",
"mqm_explanation_3": "0.1389"
}
通过以下步骤创建了数据集:
使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型将原始e-SNLI语料库的每个字段翻译成荷兰语。
使用Unbabel的两个无参考版本的 COMET 度量标准注释翻译的质量估计。
如果对此?数据集版本有问题,请通过ik-nlp-course@rug.nl与我们联系。
该数据集的许可证为 Apache 2.0 License 。
如果您在工作中使用这些语料库,请引用作者:
@incollection{NIPS2018_8163,
title = {e-SNLI: Natural Language Inference with Natural Language Explanations},
author = {Camburu, Oana-Maria and Rockt\"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {9539--9549},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf}
}