数据集:

hackathon-pln-es/nli-es

预印本库:

arxiv:1809.05053

数据集介绍文件清单

中文

annotations_creators:

crowdsourced
other language_creators:
other
crowdsourced languages:
es licenses:
cc-by-sa-4.0 multilinguality:
monolingual pretty_name: ESnli size_categories:
unknown source_datasets:
extended|snli
extended|xnli
extended|multi_nli task_categories:
text-classification task_ids:
natural-language-inference

Dataset Card for nli-es

Dataset Summary

A Spanish Natural Language Inference dataset put together from the sources:

the Spanish slice of the XNLI dataset;
machine-translated Spanish version of the SNLI dataset
machine-translated Spanish version of the Multinli dataset

Supported Tasks and Leaderboards

[Needs More Information]

Languages

A small percentage of the dataset contains original Spanish text by human speakers. The rest was generated by automatic translation.

Dataset Structure

Data Instances

A line includes four values: a sentence1 (the premise); a sentence2 (the hypothesis); a label specifying the relationship between the two ("gold_label") and the ID number of the pair of sentences as given in the original dataset.

Labels can be "entailment" if the premise entails the hypothesis, "contradiction" if it contradicts it or "neutral" if it neither implies it nor denies it.

{ "gold_label": "neutral", "pairID": 1, "sentence1": "A ver si nos tenemos que poner todos en huelga hasta cobrar lo que queramos.", "sentence2": "La huelga es el método de lucha más eficaz para conseguir mejoras en el salario." }

Data Fields

gold_label: A string defining the relation between the sentence pair. Labels can be "entailment" if the premise entails the hypothesis, "contradiction" if it contradicts it or "neutral" if it neither implies it nor denies it.

pairID: A string identifying a pair sentence. It was inherited from the original datasets. NOTE: For the moment we are having trouble loading this column so we replaced every string with an int 0 as a placeholder. We hope to have the pairID back up soon.

sentence1: A string containing one sentence in Spanish, the premise. (See gold_label.)

sentence2: A string containing one sentence in Spanish, the hypothesis. (See gold_label.)

Data Splits

The whole dataset was used for training. We did not use an evaluation split as we used the SemEval-2015 Task 2.

Dataset Creation

Curation Rationale

This corpus was built to remedy the scarcity of annotated Spanish-language datasets for NLI. It was generated by translating from the SNLI original dataset to Spanish using Argos. While machine translation is far from an ideal source for semantic classification, it is an aid to enlarging the data available.

Source Data

Initial Data Collection and Normalization

Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Who are the source language producers?

Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Annotations

Annotation process

Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Who are the annotators?

Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Personal and Sensitive Information

In general, no sensitive information is conveyed in the sentences. Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Considerations for Using the Data

Social Impact of Dataset

The purpose of this dataset is to offer new tools for semantic textual similarity analysis of Spanish sentences.

Discussion of Biases

Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Other Known Limitations

The translation of the sentences was mostly unsupervised and may introduce some noise in the corpus. Machine translation from an English-language corpus is likely to generate syntactic and lexical forms that differ from those a human Spanish speaker would produce. For discussion on the biases and limitations of the original datasets, please refer to their respective documentations: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Additional Information

Dataset Curators

The nli-es dataset was put together by Anibal Pérez, Lautaro Gesuelli, Mauricio Mazuecos and Emilio Tomás Ariza.

Licensing Information

This corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License . Please refer to the respective documentations of the original datasets for information on their licenses: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/

Citation Information

If you need to cite this dataset, you can link to this readme.

作者:

hackathon-pln-es

数据集大小:

76.22 MB