数据集:
hda_nli_hindi
任务:
文本分类语言:
hi计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
machine-generated源数据集:
extended|hindi_discourse许可:
mit'train'的一个示例如下所示。
{'hypothesis': 'यह एक वर्णनात्मक कथन है।', 'label': 1, 'premise': 'जैसे उस का सारा चेहरा अपना हो और आँखें किसी दूसरे की जो चेहरे पर पपोटों के पीछे महसूर कर दी गईं।', 'topic': 1}
每行包含4个列:
重新投射过程的源数据集为BBC印地语新闻标题数据集( https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1 )。
初始数据收集和归一化有关详细信息,请参阅本文 https://www.aclweb.org/anthology/2020.lrec-1.149/ 。
数据集创建部分已经描述了注释过程。
谁是注释者?注释是由机器自动完成的,并进行了对应的重新投射过程。
数据集中未提到个人和敏感信息。
请参阅本文 https://www.aclweb.org/anthology/2020.aacl-main.71 。
数据集中没有已知的偏见。请参阅本文 https://www.aclweb.org/anthology/2020.aacl-main.71 。
没有其他已知限制。数据规模可能不足以训练大型模型。
请参阅链接 https://github.com/midas-research/hindi-nli-data 。
在repo https://github.com/midas-research/hindi-nli-data 中有写道:
版权所有(C)2019年印度信息技术印度普拉斯塔信息技术研究所多模态数字媒体分析实验室(MIDAS,IIIT-Delhi)。有关数据集的任何信息,请联系作者。
@inproceedings{uppal-etal-2020-two, title = "Two-Step Classification using Recasted Data for Low Resource Settings", author = "Uppal, Shagun and Gupta, Vivek and Swaminathan, Avinash and Zhang, Haimin and Mahata, Debanjan and Gosangi, Rakesh and Shah, Rajiv Ratn and Stent, Amanda", booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing", month = dec, year = "2020", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.aacl-main.71", pages = "706--719", abstract = "An NLP model{'}s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.", }
感谢 @avinsit123 添加此数据集。