数据集:

csebuetnlp/BanglaParaphrase

任务:

文生文

语言:

bn

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2210.05109
中文

Dataset Card for "BanglaParaphrase"

Dataset Summary

We present BanglaParaphrase, a high quality synthetic Bangla paraphrase dataset containing about 466k paraphrase pairs. The paraphrases ensures high quality by being semantically coherent and syntactically diverse.

Supported Tasks and Leaderboards

More information needed

Languages

  • bengali

Loading the dataset

from datasets import load_dataset

from datasets import load_dataset

ds = load_dataset("csebuetnlp/BanglaParaphrase")

Dataset Structure

Data Instances

One example from the train part of the dataset is given below in JSON format.

{
"source": "বেশিরভাগ সময় প্রকৃতির দয়ার ওপরেই বেঁচে থাকতেন উপজাতিরা।", 
"target": "বেশিরভাগ সময়ই উপজাতিরা প্রকৃতির দয়ার উপর নির্ভরশীল ছিল।"
}

Data Fields

  • 'source': A string representing the source sentence.
  • 'target': A string representing the target sentence.

Data Splits

Dataset with train-dev-test example counts are given below:

Language ISO 639-1 Code Train Validation Test
Bengali bn 419, 967 233, 31 233, 32

Dataset Creation

Curation Rationale

More information needed

Source Data

Roar Bangla

Initial Data Collection and Normalization

Detailed in the paper

Who are the source language producers?

Detailed in the paper

Annotations

Detailed in the paper

Annotation process

Detailed in the paper

Who are the annotators?

Detailed in the paper

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Social Impact of Dataset

More information needed

Discussion of Biases

More information needed

Other Known Limitations

More information needed

Additional Information

Dataset Curators

More information needed

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}

Contributions