数据集:

sberquad

任务:

问答

子任务:

extractive-qa

语言:

ru

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found crowdsourced

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:1912.09723
中文

Dataset Card for sberquad

Dataset Summary

Sber Question Answering Dataset (SberQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Russian original analogue presented in Sberbank Data Science Journey 2017.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

Russian

Dataset Structure

Data Instances

{
    "context": "Первые упоминания о строении человеческого тела встречаются в Древнем Египте...",
    "id": 14754,
    "qas": [
        {
            "id": 60544,
            "question": "Где встречаются первые упоминания о строении человеческого тела?",
            "answers": [{"answer_start": 60, "text": "в Древнем Египте"}],
        }
    ]
}

Data Fields

  • id: a int32 feature
  • title: a string feature
  • context: a string feature
  • question: a string feature
  • answers: a dictionary feature containing:
    • text: a string feature
    • answer_start: a int32 feature

Data Splits

name train validation test
plain_text 45328 5036 23936

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

@InProceedings{sberquad,
doi       = {10.1007/978-3-030-58219-7_1},
author    = {Pavel Efimov and
             Andrey Chertok and
             Leonid Boytsov and
             Pavel Braslavski},
title     = {SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction},
year      = {2020},
publisher = {Springer International Publishing},
pages     = {3--15}
}

Contributions

Thanks to @alenusch for adding this dataset.