数据集:
indic_glue
IndicGLUE 是为印度语言提供的自然语言理解基准。它包含了各种任务,并涵盖了11种主要的印度语言 - as, bn, gu, hi, kn, ml, mr, or, pa, ta, te.
Winograd Schema 挑战 (Levesque et al., 2011) 是一个阅读理解任务,系统必须阅读一个带有代词的句子,并从一系列选择项中选择该代词的指代。这些示例是手动构建的,以阻碍简单的统计方法:每一个例子都依赖于由句子中的单个词或短语提供的上下文信息。为了将问题转化为句对分类,我们通过将不明确的代词替换为每个可能的指代来构造句对。该任务是预测替换了代词的句子是否由原始句子推出。我们使用了一个小规模的评估集,其中包含了从小说书籍中词私下共享的新概念。虽然包含的训练集在两个类之间平衡,但测试集在两个类之间不平衡(65% 不推出)。此外,由于一个数据特征,开发集是对抗性的:假设有时在训练和开发示例之间共享,所以如果一个模型记忆训练示例,它们将在相应的开发集示例上预测错误的标签。与 QNLI 一样,每个例子都是单独评估的,因此模型在该任务上的得分与其在未转化的原始任务上的得分之间没有系统的对应关系。我们将转化后的数据集称为 WNLI (Winograd NLI)。此数据集由 AI4Bharat 将其翻译并公开发布,供3种印度语言使用。
'validation' 的一个示例如下所示。
This example was too long and was cropped: { "label": 0, "text": "\"ప్రయాణాల్లో ఉన్నవారికోసం బస్ స్టేషన్లు, రైల్వే స్టేషన్లలో పల్స్పోలియో బూతులను ఏర్పాటు చేసి చిన్నారులకు పోలియో చుక్కలు వేసేలా ఏర..." }bbca.hi
'train' 的一个示例如下所示。
This example was too long and was cropped: { "label": "pakistan", "text": "\"नेटिजन यानि इंटरनेट पर सक्रिय नागरिक अब ट्विटर पर सरकार द्वारा लगाए प्रतिबंधों के समर्थन या विरोध में अपने विचार व्यक्त करते है..." }copa.en
'validation' 的一个示例如下所示。
{ "choice1": "I swept the floor in the unoccupied room.", "choice2": "I shut off the light in the unoccupied room.", "label": 1, "premise": "I wanted to conserve energy.", "question": "effect" }copa.gu
'train' 的一个示例如下所示。
This example was too long and was cropped: { "choice1": "\"સ્ત્રી જાણતી હતી કે તેનો મિત્ર મુશ્કેલ સમયમાંથી પસાર થઈ રહ્યો છે.\"...", "choice2": "\"મહિલાને લાગ્યું કે તેના મિત્રએ તેની દયાળુ લાભ લીધો છે.\"...", "label": 0, "premise": "મહિલાએ તેના મિત્રની મુશ્કેલ વર્તન સહન કરી.", "question": "cause" }copa.hi
'validation' 的一个示例如下所示。
{ "choice1": "मैंने उसका प्रस्ताव ठुकरा दिया।", "choice2": "उन्होंने मुझे उत्पाद खरीदने के लिए राजी किया।", "label": 0, "premise": "मैंने सेल्समैन की पिच पर शक किया।", "question": "effect" }
所有分组中的数据字段都是相同的。
actsa-sc.tetrain | validation | test | |
---|---|---|---|
actsa-sc.te | 4328 | 541 | 541 |
train | test | |
---|---|---|
bbca.hi | 3467 | 866 |
train | validation | test | |
---|---|---|---|
copa.en | 400 | 100 | 500 |
train | validation | test | |
---|---|---|---|
copa.gu | 362 | 88 | 448 |
train | validation | test | |
---|---|---|---|
copa.hi | 362 | 88 | 449 |
@inproceedings{kakwani-etal-2020-indicnlpsuite, title = "{I}ndic{NLPS}uite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for {I}ndian Languages", author = "Kakwani, Divyanshu and Kunchukuttan, Anoop and Golla, Satish and N.C., Gokul and Bhattacharyya, Avik and Khapra, Mitesh M. and Kumar, Pratyush", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.findings-emnlp.445", doi = "10.18653/v1/2020.findings-emnlp.445", pages = "4948--4961", } @inproceedings{Levesque2011TheWS, title={The Winograd Schema Challenge}, author={H. Levesque and E. Davis and L. Morgenstern}, booktitle={KR}, year={2011} }
感谢 @sumanthd17 添加此数据集。