数据集:
GroNLP/ik-nlp-22_transqe
该数据集包含全部 e-SNLI 数据集,使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型自动翻译成荷兰语。每个字段的翻译都使用了Unbabel的无参考版本的 COMET 度量标准进行注释,给出了两个质量估计分数。
该语料库的预期用途仅限于格罗宁根大学信息科学硕士学位(IK)的2022年自然语言处理课程的最终项目范围内,由 Arianna Bisazza 和 Gabriele Sarti 授课,并得到 Anjali Nair 的协助。
e-SNLI语料库由作者在Github上免费提供。此数据集是为教育目的而创建的,并基于Camburu等人的原始e-SNLI数据集。本数据集的所有内容版权归原始作者所有。
该语料库的语言数据为英文(BCP-47 en)和荷兰语(BCP-47 nl)。
默认情况下,数据集包含一个名为plain_text的配置,具有三个原始拆分train,validation和test。每个拆分包含以下字段:
Field | Description |
---|---|
premise_en | The original English premise. |
premise_nl | The premise automatically translated to Dutch. |
hypothesis_en | The original English hypothesis. |
hypothesis_nl | The hypothesis automatically translated to Dutch. |
label | The label of the data instance (0 for entailment, 1 for neutral, 2 for contradiction). |
explanation_1_en | The first explanation for the assigned label in English. |
explanation_1_nl | The first explanation automatically translated to Dutch. |
explanation_2_en | The second explanation for the assigned label in English. |
explanation_2_nl | The second explanation automatically translated to Dutch. |
explanation_3_en | The third explanation for the assigned label in English. |
explanation_3_nl | The third explanation automatically translated to Dutch. |
da_premise | The quality estimation produced by the wmt20-comet-qe-da model for the premise translation. |
da_hypothesis | The quality estimation produced by the wmt20-comet-qe-da model for the hypothesis translation. |
da_explanation_1 | The quality estimation produced by the wmt20-comet-qe-da model for the first explanation translation. |
da_explanation_2 | The quality estimation produced by the wmt20-comet-qe-da model for the second explanation translation. |
da_explanation_3 | The quality estimation produced by the wmt20-comet-qe-da model for the third explanation translation. |
mqm_premise | The quality estimation produced by the wmt21-comet-qe-mqm model for the premise translation. |
mqm_hypothesis | The quality estimation produced by the wmt21-comet-qe-mqm model for the hypothesis translation. |
mqm_explanation_1 | The quality estimation produced by the wmt21-comet-qe-mqm model for the first explanation translation. |
mqm_explanation_2 | The quality estimation produced by the wmt21-comet-qe-mqm model for the second explanation translation. |
mqm_explanation_3 | The quality estimation produced by the wmt21-comet-qe-mqm model for the third explanation translation. |
解释2和3以及相关的质量估计分数仅存在于validation和test拆分中。
config | train | validation | test |
---|---|---|---|
plain_text | 549'367 | 9842 | 9824 |
根据您的计算设置选择最合理的数据量进行分析,数据量越大越好。
以下是来自test拆分的条目2000的示例:
{ "premise_en": "A young woman wearing a yellow sweater and black pants is ice skating outdoors.", "premise_nl": "Een jonge vrouw met een gele trui en zwarte broek schaatst buiten.", "hypothesis_en": "a woman is practicing for the olympics", "hypothesis_nl": "een vrouw oefent voor de Olympische Spelen", "label": 1, "explanation_1_en": "You can not infer it's for the Olympics.", "explanation_1_nl": "Het is niet voor de Olympische Spelen.", "explanation_2_en": "Just because a girl is skating outdoors does not mean she is practicing for the Olympics.", "explanation_2_nl": "Alleen omdat een meisje buiten schaatst betekent niet dat ze oefent voor de Olympische Spelen.", "explanation_3_en": "Ice skating doesn't imply practicing for the olympics.", "explanation_3_nl": "Schaatsen betekent niet oefenen voor de Olympische Spelen.", "da_premise": "0.6099", "mqm_premise": "0.1298", "da_hypothesis": "0.8504", "mqm_hypothesis": "0.1521", "da_explanation_1": "0.0001", "mqm_explanation_1": "0.1237", "da_explanation_2": "0.4017", "mqm_explanation_2": "0.1467", "da_explanation_3": "0.6069", "mqm_explanation_3": "0.1389" }
通过以下步骤创建了数据集:
使用 Helsinki-NLP/opus-mt-en-nl 神经机器翻译模型将原始e-SNLI语料库的每个字段翻译成荷兰语。
使用Unbabel的两个无参考版本的 COMET 度量标准注释翻译的质量估计。
如果对此?数据集版本有问题,请通过ik-nlp-course@rug.nl与我们联系。
该数据集的许可证为 Apache 2.0 License 。
如果您在工作中使用这些语料库,请引用作者:
@incollection{NIPS2018_8163, title = {e-SNLI: Natural Language Inference with Natural Language Explanations}, author = {Camburu, Oana-Maria and Rockt\"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil}, booktitle = {Advances in Neural Information Processing Systems 31}, editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett}, pages = {9539--9549}, year = {2018}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-language-explanations.pdf} }