数据集:

fquad

任务:

问答

文本检索

子任务:

extractive-qa closed-domain-qa

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced found

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:2002.06071

许可:

cc-by-nc-sa-3.0

数据集介绍文件清单

中文

Dataset Card for FQuAD

Dataset Summary

FQuAD: French Question Answering Dataset We introduce FQuAD, a native French Question Answering Dataset.

FQuAD contains 25,000+ question and answer pairs. Finetuning CamemBERT on FQuAD yields a F1 score of 88% and an exact match of 77.9%. Developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.

Please, note this dataset is licensed for non-commercial purposes and users must agree to the following terms and conditions:

Use FQuAD only for internal research purposes.

Not make any copy except a safety one.

Not redistribute it (or part of it) in any way, even for free.

Not sell it or use it for any commercial purpose. Contact us for a possible commercial licence.

Mention the corpus origin and Illuin Technology in all publications about experiments using FQuAD.

Redistribute to Illuin Technology any improved or enriched version you could make of that corpus.

Request manually download of the data from: https://fquad.illuin.tech/

Supported Tasks and Leaderboards

closed-domain-qa , text-retrieval : This dataset is intended to be used for closed-domain-qa , but can also be used for information retrieval tasks.

Languages

This dataset is exclusively in French, with context data from Wikipedia and questions from French university students ( fr ).

Dataset Structure

Data Instances

default

Size of downloaded dataset files: 3.29 MB
Size of the generated dataset: 6.94 MB
Total amount of disk used: 10.23 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "answers": {
        "answers_starts": [161, 46, 204],
        "texts": ["La Vierge aux rochers", "documents contemporains", "objets de spéculations"]
    },
    "context": "\"Les deux tableaux sont certes décrits par des documents contemporains à leur création mais ceux-ci ne le font qu'indirectement ...",
    "questions": ["Que concerne principalement les documents ?", "Par quoi sont décrit les deux tableaux ?", "Quels types d'objets sont les deux tableaux aux yeux des chercheurs ?"]
}

Data Fields

The data fields are the same among all splits.

default

context : a string feature.
questions : a list of string features.
answers : a dictionary feature containing:
- texts : a string feature.
- answers_starts : a int32 feature.

Data Splits

The FQuAD dataset has 3 splits: train , validation , and test . The test split is however not released publicly at the moment. The splits contain disjoint sets of articles. The following table contains stats about each split.

Dataset Split	Number of Articles in Split	Number of paragraphs in split	Number of questions in split
Train	117	4921	20731
Validation	768	51.0%	3188
Test	10	532	2189

Dataset Creation

Curation Rationale

The FQuAD dataset was created by Illuin technology. It was developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.

Source Data

The text used for the contexts are from the curated list of French High-Quality Wikipedia articles .

Annotations

Annotations (spans and questions) are written by students of the CentraleSupélec school of engineering. Wikipedia articles were scraped and Illuin used an internally-developped tool to help annotators ask questions and indicate the answer spans. Annotators were given paragraph sized contexts and asked to generate 4/5 non-trivial questions about information in the context.

Personal and Sensitive Information

No personal or sensitive information is included in this dataset. This has been manually verified by the dataset curators.

Considerations for Using the Data

Users should consider this dataset is sampled from Wikipedia data which might not be representative of all QA use cases.

Social Impact of Dataset

The social biases of this dataset have not yet been investigated.

Discussion of Biases

The social biases of this dataset have not yet been investigated, though articles have been selected by their quality and objectivity.

Other Known Limitations

The limitations of the FQuAD dataset have not yet been investigated.

Additional Information

Dataset Curators

Illuin Technology: https://fquad.illuin.tech/

Licensing Information

The FQuAD dataset is licensed under the CC BY-NC-SA 3.0 license.

It allows personal and academic research uses of the dataset, but not commercial uses. So concretely, the dataset cannot be used to train a model that is then put into production within a business or a company. For this type of commercial use, we invite FQuAD users to contact the authors to discuss possible partnerships.

Citation Information

@ARTICLE{2020arXiv200206071
       author = {Martin, d'Hoffschmidt and Maxime, Vidal and
         Wacim, Belblidia and Tom, Brendlé},
        title = "{FQuAD: French Question Answering Dataset}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language},
         year = "2020",
        month = "Feb",
          eid = {arXiv:2002.06071},
        pages = {arXiv:2002.06071},
archivePrefix = {arXiv},
       eprint = {2002.06071},
 primaryClass = {cs.CL}
}

Contributions

Thanks to @thomwolf , @mariamabarham , @patrickvonplaten , @lewtun , @albertvillanova for adding this dataset. Thanks to @ManuelFay for providing information on the dataset creation process.

作者:

佚名

数据集大小:

14.9 KB