数据集:

projecte-aina/xquad-ca

中文

Dataset Card for XQuAD-Ca

Dataset Summary

Professional translation into Catalan of XQuAD dataset .

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 ( Rajpurkar, Pranav et al., 2016 ) together with their professional translations into ten language: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Rumanian was added later. We added the 13th language to the corpus using also professional native Catalan translators.

XQuAD and XQuAD-Ca datasets are released under CC-by-sa licence.

Supported Tasks and Leaderboards

Cross-lingual-QA, Extractive-QA, Language Model

Languages

The dataset is in Catalan ( ca-CA )

Dataset Structure

Data Instances

One json file.

1189 examples.

{
  "data": [
    {
          "context": "Al llarg de la seva existència, Varsòvia ha estat una ciutat multicultural. Segons el cens del 1901, de 711.988 habitants, el 56,2 % eren catòlics, el 35,7 % jueus, el 5 % cristians ortodoxos grecs i el 2,8 % protestants. Vuit anys després, el 1909, hi havia 281.754 jueus (36,9 %), 18.189 protestants (2,4 %) i 2.818 mariavites (0,4 %). Això va provocar que es construïssin centenars de llocs de culte religiós a totes les parts de la ciutat. La majoria d’ells es van destruir després de la insurrecció de Varsòvia del 1944. Després de la guerra, les noves autoritats comunistes de Polònia van apocar la construcció d’esglésies i només se’n va construir un petit nombre.",
          "qas": [
            {
              "answers": [
                {
                  "text": "711.988",
                  "answer_start": 104
                }
              ],
              "id": "57338007d058e614000b5bdb",
              "question": "Quina era la població de Varsòvia l’any 1901?"
            },
            {
              "answers": [
                {
                  "text": "56,2 %",
                  "answer_start": 126
                }
              ],
              "id": "57338007d058e614000b5bdc",
              "question": "Dels habitants de Varsòvia l’any 1901, quin percentatge era catòlic?"
            },
            ...
          ]
        }
      ]
    }, 
    ...

   ]
} 

Data Fields

Follows Rajpurkar, Pranav et al., 2016 for SQuAD v1 datasets.

  • id (str): Unique ID assigned to the question.
  • title (str): Title of the Wikipedia article.
  • context (str): Wikipedia section text.
  • question (str): Question.
  • answers (list): List of answers to the question, each containing:
    • text (str): Span text answering to the question.
    • answer_start Starting offset of the span text answering to the question.

Data Splits

  • test.json: 1189 examples.

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language, and for compatibility with similar datasets in other languages, and to allow inter-lingual comparisons.

Source Data

Initial Data Collection and Normalization

This dataset is a professional translation of XQuAD into Catalan, commissioned by BSC TeMU within Projecte AINA .

For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations , or visit the XQuAD's webpage .

Who are the source language producers?

For more information on how XQuAD was created, refer to the paper, On the Cross-lingual Transferability of Monolingual Representations , or visit the XQuAD's webpage .

Annotations

This is a professional translation of the XQuAD corpus and its annotations.

Annotation process

[N/A]

Who are the annotators?

Translation was commissioned to a professional translation company.

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

This dataset contributes to the development of language models in Catalan, a low-resource language.

Discussion of Biases

[N/A]

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Carlos Rodríguez-Penagos ( carlos.rodriguez1@bsc.es ) and Carme Armentano-Oller ( carme.armentano@bsc.es ) from BSC-CNS .

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing Information

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

Citation Information

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

DOI

Contributions

[N/A]