数据集:

persiannlp/parsinlu_query_paraphrasing

语言:

fa

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

预印本库:

arxiv:2012.06154
中文

Dataset Card for PersiNLU (Query Paraphrasing)

Dataset Summary

A Persian query paraphrasng task (deciding whether two questions are paraphrases of each other). The questions are partially generated from Google auto-complete, and partially translated from the Quora paraphrasing dataset.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The text dataset is in Persian ( fa ).

Dataset Structure

Data Instances

Here is an example from the dataset:

{
  "q1": "اعمال حج تمتع از چه روزی شروع میشود؟",
  "q2": "ویار از چه روزی شروع میشود؟",
  "label": "0",
  "category": "natural"
}

Data Fields

  • q1 : the first question.
  • q2 : the second question.
  • category : whether the questions are mined from Quora ( qqp ) or they're extracted from Google auto-complete ( natural ).
  • label : 1 if the questions are paraphrases; 0 otherwise.

Data Splits

The train/dev/test splits contains 1830/898/1916 samples.

Dataset Creation

Curation Rationale

For details, check the corresponding draft .

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

CC BY-NC-SA 4.0 License

Citation Information

@article{huggingface:dataset,
    title = {ParsiNLU: A Suite of Language Understanding Challenges for Persian},
    authors = {Khashabi, Daniel and Cohan, Arman and Shakeri, Siamak and Hosseini, Pedram and Pezeshkpour, Pouya and Alikhani, Malihe and Aminnaseri, Moin and Bitaab, Marzieh and Brahman, Faeze and Ghazarian, Sarik and others},
    year={2020}
    journal = {arXiv e-prints},
    eprint = {2012.06154},    
}

Contributions

Thanks to @danyaljj for adding this dataset.