数据集:

fquad

语言:

fr

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced found

批注创建人:

crowdsourced

源数据集:

original

预印本库:

arxiv:2002.06071
中文

Dataset Card for FQuAD

Dataset Summary

FQuAD: French Question Answering Dataset We introduce FQuAD, a native French Question Answering Dataset.

FQuAD contains 25,000+ question and answer pairs. Finetuning CamemBERT on FQuAD yields a F1 score of 88% and an exact match of 77.9%. Developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.

Please, note this dataset is licensed for non-commercial purposes and users must agree to the following terms and conditions:

  • Use FQuAD only for internal research purposes.
  • Not make any copy except a safety one.
  • Not redistribute it (or part of it) in any way, even for free.
  • Not sell it or use it for any commercial purpose. Contact us for a possible commercial licence.
  • Mention the corpus origin and Illuin Technology in all publications about experiments using FQuAD.
  • Redistribute to Illuin Technology any improved or enriched version you could make of that corpus.
  • Request manually download of the data from: https://fquad.illuin.tech/

    Supported Tasks and Leaderboards

    • closed-domain-qa , text-retrieval : This dataset is intended to be used for closed-domain-qa , but can also be used for information retrieval tasks.

    Languages

    This dataset is exclusively in French, with context data from Wikipedia and questions from French university students ( fr ).

    Dataset Structure

    Data Instances

    default
    • Size of downloaded dataset files: 3.29 MB
    • Size of the generated dataset: 6.94 MB
    • Total amount of disk used: 10.23 MB

    An example of 'validation' looks as follows.

    This example was too long and was cropped:
    
    {
        "answers": {
            "answers_starts": [161, 46, 204],
            "texts": ["La Vierge aux rochers", "documents contemporains", "objets de spéculations"]
        },
        "context": "\"Les deux tableaux sont certes décrits par des documents contemporains à leur création mais ceux-ci ne le font qu'indirectement ...",
        "questions": ["Que concerne principalement les documents ?", "Par quoi sont décrit les deux tableaux ?", "Quels types d'objets sont les deux tableaux aux yeux des chercheurs ?"]
    }
    

    Data Fields

    The data fields are the same among all splits.

    default
    • context : a string feature.
    • questions : a list of string features.
    • answers : a dictionary feature containing:
      • texts : a string feature.
      • answers_starts : a int32 feature.

    Data Splits

    The FQuAD dataset has 3 splits: train , validation , and test . The test split is however not released publicly at the moment. The splits contain disjoint sets of articles. The following table contains stats about each split.

    Dataset Split Number of Articles in Split Number of paragraphs in split Number of questions in split
    Train 117 4921 20731
    Validation 768 51.0% 3188
    Test 10 532 2189

    Dataset Creation

    Curation Rationale

    The FQuAD dataset was created by Illuin technology. It was developped to provide a SQuAD equivalent in the French language. Questions are original and based on high quality Wikipedia articles.

    Source Data

    The text used for the contexts are from the curated list of French High-Quality Wikipedia articles .

    Annotations

    Annotations (spans and questions) are written by students of the CentraleSupélec school of engineering. Wikipedia articles were scraped and Illuin used an internally-developped tool to help annotators ask questions and indicate the answer spans. Annotators were given paragraph sized contexts and asked to generate 4/5 non-trivial questions about information in the context.

    Personal and Sensitive Information

    No personal or sensitive information is included in this dataset. This has been manually verified by the dataset curators.

    Considerations for Using the Data

    Users should consider this dataset is sampled from Wikipedia data which might not be representative of all QA use cases.

    Social Impact of Dataset

    The social biases of this dataset have not yet been investigated.

    Discussion of Biases

    The social biases of this dataset have not yet been investigated, though articles have been selected by their quality and objectivity.

    Other Known Limitations

    The limitations of the FQuAD dataset have not yet been investigated.

    Additional Information

    Dataset Curators

    Illuin Technology: https://fquad.illuin.tech/

    Licensing Information

    The FQuAD dataset is licensed under the CC BY-NC-SA 3.0 license.

    It allows personal and academic research uses of the dataset, but not commercial uses. So concretely, the dataset cannot be used to train a model that is then put into production within a business or a company. For this type of commercial use, we invite FQuAD users to contact the authors to discuss possible partnerships.

    Citation Information

    @ARTICLE{2020arXiv200206071
           author = {Martin, d'Hoffschmidt and Maxime, Vidal and
             Wacim, Belblidia and Tom, Brendlé},
            title = "{FQuAD: French Question Answering Dataset}",
          journal = {arXiv e-prints},
         keywords = {Computer Science - Computation and Language},
             year = "2020",
            month = "Feb",
              eid = {arXiv:2002.06071},
            pages = {arXiv:2002.06071},
    archivePrefix = {arXiv},
           eprint = {2002.06071},
     primaryClass = {cs.CL}
    }
    

    Contributions

    Thanks to @thomwolf , @mariamabarham , @patrickvonplaten , @lewtun , @albertvillanova for adding this dataset. Thanks to @ManuelFay for providing information on the dataset creation process.