数据集:

projecte-aina/vilaquad

任务:

问答

子任务:

extractive-qa

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2107.07903 arxiv:1606.05250

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for VilaQuAD

Dataset Summary

VilaQuAD, An extractive QA dataset for Catalan, from VilaWeb newswire text.

This dataset contains 2095 of Catalan language news articles along with 1 to 5 questions referring to each fragment (or context).

VilaQuad articles are extracted from the daily VilaWeb and used under CC-by-nc-sa-nd licence.

This dataset can be used to build extractive-QA and Language Models.

Supported Tasks and Leaderboards

Extractive-QA, Language Model.

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

Data Instances

{
  'id': 'P_556_C_556_Q1',
  'title': "El Macba posa en qüestió l'eufòria amnèsica dels anys vuitanta a l'estat espanyol",
  'context': "El Macba ha obert una nova exposició, 'Gelatina dura. Històries escamotejades dels 80', dedicada a revisar el discurs hegemònic que es va instaurar en aquella dècada a l'estat espanyol, concretament des del començament de la transició, el 1977, fins a la fita de Barcelona 92. És una mirada en clau espanyola, però també centralista, perquè més enllà dels esdeveniments ocorreguts a Catalunya i els artistes que els van combatre, pràcticament només s'hi mostren fets polítics i culturals generats des de Madrid. No es parla del País Basc, per exemple. Però, dit això, l'exposició revisa aquesta dècada de la història recent tot qüestionant un triomfalisme homogeneïtzador, que ja se sap que va arrasar una gran quantitat de sectors crítics i radicals de l'àmbit social, polític i cultural. Com diu la comissària, Teresa Grandas, de l'equip del Macba: 'El relat oficial dels anys vuitanta a l'estat espanyol va prioritzar la necessitat per damunt de la raó i va consolidar una mirada que privilegiava el futur abans que l'anàlisi del passat recent, obviant qualsevol consideració crítica respecte de la filiació amb el poder franquista.",
  'question': 'Com es diu la nova exposició que ha obert el Macba?',
  'answers': [
    {
      'text': "'Gelatina dura. Històries escamotejades dels 80'",
      'answer_start': 38
    }
  ]
}

Data Fields

Follows Rajpurkar, Pranav et al., (2016) for SQuAD v1 datasets.

id (str): Unique ID assigned to the question.
title (str): Title of the VilaWeb article.
context (str): VilaWeb section text.
question (str): Question.
answers (list): List of answers to the question, each containing:
- text (str): Span text answering to the question.
- answer_start Starting offset of the span text answering to the question.

Data Splits

train.json: 1295 contexts, 3882 questions
dev.json: 400 contexts, 1200 questions
test.json: 400 contexts, 1200 questions

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

Source Data

VilaWeb site

Initial Data Collection and Normalization

The source data are scraped articles from archives of Catalan newspaper website Vilaweb .

From a the online edition of the newspaper VilaWeb , 2095 articles were randomnly selected. These headlines were also used to create a Textual Entailment dataset. For the extractive QA dataset, creation of between 1 and 5 questions for each news context was commissioned, following an adaptation of the guidelines from SQuAD 1.0 ( Rajpurkar, Pranav et al. (2016) ). In total, 6282 pairs of a question and an extracted fragment that contains the answer were created.

For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines. We also created another QA dataset with wikipedia to ensure thematic and stylistic variety.

Who are the source language producers?

Professional journalists from the Catalan newspaper VilaWeb .

Annotations

Annotation process

We comissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQuAD 1.0 ( Rajpurkar, Pranav et al. (2016) ).

Who are the annotators?

Annotation was commissioned to an specialized company that hired a team of native language speakers.

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

We hope this dataset contributes to the development of language models in Catalan, a low-resource language.

Discussion of Biases

[N/A]

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing Information

This work is licensed under a Attribution-ShareAlike 4.0 International License .

Citation Information

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

DOI

Contributions

[N/A]

作者:

projecte-aina

数据集大小:

3.65 MB