数据集:

csebuetnlp/xnli_bn

语言:

bn

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

machine-generated

源数据集:

extended
中文

Dataset Card for xnli_bn

Dataset Summary

This is a Natural Language Inference (NLI) dataset for Bengali, curated using the subset of MNLI data used in XNLI and state-of-the-art English to Bengali translation model introduced here .

Supported Tasks and Leaderboards

More information needed

Languages

  • Bengali

Usage

from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xnli_bn")

Dataset Structure

Data Instances

One example from the dataset is given below in JSON format.

{
  "sentence1": "আসলে, আমি এমনকি এই বিষয়ে চিন্তাও করিনি, কিন্তু আমি এত হতাশ হয়ে পড়েছিলাম যে, শেষ পর্যন্ত আমি আবার তার সঙ্গে কথা বলতে শুরু করেছিলাম",
  "sentence2": "আমি তার সাথে আবার কথা বলিনি।",
  "label": "contradiction"
}

Data Fields

The data fields are as follows:

  • sentence1 : a string feature indicating the premise.
  • sentence2 : a string feature indicating the hypothesis.
  • label : a classification label, where possible values are contradiction (0), entailment (1), neutral (2) .

Data Splits

split count
train 381449
validation 2419
test 4895

Dataset Creation

The dataset curation procedure was the same as the XNLI dataset: we translated the MultiNLI training data using the English to Bangla translation model introduced here . Due to the possibility of incursions of error during automatic translation, we used the Language-Agnostic BERT Sentence Embeddings (LaBSE) of the translations and original sentences to compute their similarity. All sentences below a similarity threshold of 0.70 were discarded.

Curation Rationale

More information needed

Source Data

XNLI

Initial Data Collection and Normalization

More information needed

Who are the source language producers?

More information needed

Annotations

More information needed

Annotation process

More information needed

Who are the annotators?

More information needed

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Social Impact of Dataset

More information needed

Discussion of Biases

More information needed

Other Known Limitations

More information needed

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use the dataset, please cite the following paper:

@misc{bhattacharjee2021banglabert,
      title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
      author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
      year={2021},
      eprint={2101.00204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @abhik1505040 and @Tahmid for adding this dataset.