数据集:
hda_nli_hindi
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
machine-generated源数据集:
extended|hindi_discourse许可:
An example of 'train' looks as follows.
{'hypothesis': 'यह एक वर्णनात्मक कथन है।', 'label': 1, 'premise': 'जैसे उस का सारा चेहरा अपना हो और आँखें किसी दूसरे की जो चेहरे पर पपोटों के पीछे महसूर कर दी गईं।', 'topic': 1}
Each row contatins 4 columns:
Source Dataset for the recasting process is the BBC Hindi Headlines Dataset( https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1 )
Initial Data Collection and NormalizationPlease refer to this paper for detailed information: https://www.aclweb.org/anthology/2020.lrec-1.149/
Annotation process has been described in Dataset Creation Section.
Who are the annotators?Annotation is done automatically by machine and corresponding recasting process.
No Personal and Sensitive Information is mentioned in the Datasets.
Pls refer to this paper: https://www.aclweb.org/anthology/2020.aacl-main.71
No known bias exist in the dataset. Pls refer to this paper: https://www.aclweb.org/anthology/2020.aacl-main.71
No other known limitations . Size of data may not be enough to train large models
Pls refer to this link: https://github.com/midas-research/hindi-nli-data
It is written in the repo : https://github.com/midas-research/hindi-nli-data that
Copyright (C) 2019 Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi (MIDAS, IIIT-Delhi). Pls contact authors for any information on the dataset.
@inproceedings{uppal-etal-2020-two,
title = "Two-Step Classification using Recasted Data for Low Resource Settings",
author = "Uppal, Shagun and
Gupta, Vivek and
Swaminathan, Avinash and
Zhang, Haimin and
Mahata, Debanjan and
Gosangi, Rakesh and
Shah, Rajiv Ratn and
Stent, Amanda",
booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.aacl-main.71",
pages = "706--719",
abstract = "An NLP model{'}s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.",
}
Thanks to @avinsit123 for adding this dataset.