数据集:
hackathon-pln-es/nli-es
预印本库:
arxiv:1809.05053annotations_creators:
A Spanish Natural Language Inference dataset put together from the sources:
[Needs More Information]
A small percentage of the dataset contains original Spanish text by human speakers. The rest was generated by automatic translation.
A line includes four values: a sentence1 (the premise); a sentence2 (the hypothesis); a label specifying the relationship between the two ("gold_label") and the ID number of the pair of sentences as given in the original dataset.
Labels can be "entailment" if the premise entails the hypothesis, "contradiction" if it contradicts it or "neutral" if it neither implies it nor denies it.
{ "gold_label": "neutral", "pairID": 1, "sentence1": "A ver si nos tenemos que poner todos en huelga hasta cobrar lo que queramos.", "sentence2": "La huelga es el método de lucha más eficaz para conseguir mejoras en el salario." }
gold_label: A string defining the relation between the sentence pair. Labels can be "entailment" if the premise entails the hypothesis, "contradiction" if it contradicts it or "neutral" if it neither implies it nor denies it.
pairID: A string identifying a pair sentence. It was inherited from the original datasets. NOTE: For the moment we are having trouble loading this column so we replaced every string with an int 0 as a placeholder. We hope to have the pairID back up soon.
sentence1: A string containing one sentence in Spanish, the premise. (See gold_label.)
sentence2: A string containing one sentence in Spanish, the hypothesis. (See gold_label.)
The whole dataset was used for training. We did not use an evaluation split as we used the SemEval-2015 Task 2.
This corpus was built to remedy the scarcity of annotated Spanish-language datasets for NLI. It was generated by translating from the SNLI original dataset to Spanish using Argos. While machine translation is far from an ideal source for semantic classification, it is an aid to enlarging the data available.
Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
Who are the source language producers?Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
Who are the annotators?Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
In general, no sensitive information is conveyed in the sentences. Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
The purpose of this dataset is to offer new tools for semantic textual similarity analysis of Spanish sentences.
Please refer to the respective documentations of the original datasets: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
The translation of the sentences was mostly unsupervised and may introduce some noise in the corpus. Machine translation from an English-language corpus is likely to generate syntactic and lexical forms that differ from those a human Spanish speaker would produce. For discussion on the biases and limitations of the original datasets, please refer to their respective documentations: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
The nli-es dataset was put together by Anibal Pérez, Lautaro Gesuelli, Mauricio Mazuecos and Emilio Tomás Ariza.
This corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License . Please refer to the respective documentations of the original datasets for information on their licenses: https://nlp.stanford.edu/projects/snli/ https://arxiv.org/pdf/1809.05053.pdf https://cims.nyu.edu/~sbowman/multinli/
If you need to cite this dataset, you can link to this readme.