数据集:

projecte-aina/vilasum

语言:

ca

计算机处理:

monolingual

语言创建人:

expert-generated

批注创建人:

machine-generated

预印本库:

arxiv:2202.06871
中文

Dataset Card for VilaSum

Dataset Summary

VilaSum is a summarization dataset for evaluation. It is extracted from a newswire corpus crawled from the Catalan news portal VilaWeb . The corpus consists of 13,843 instances that are composed by the headline and the body.

Supported Tasks and Leaderboards

The dataset can be used to train a model for abstractive summarization. Success on this task is typically measured by achieving a high Rouge score. The mbart-base-ca-casum model currently achieves a 35.04.

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

Data Instances

{
  'summary': 'Un vídeo corrobora les agressions a dues animalistes en un correbou del Mas de Barberans',
  'text': 'Noves imatges, a les quals ha tingut accés l'ACN, certifiquen les agressions i la destrucció del material d'enregistrament que han denunciat dues activistes d'AnimaNaturalis en la celebració d'un acte de bous a la plaça al Mas de Barberans (Montsià). En el vídeo es veu com unes quantes persones s'abalancen sobre les noies que reben estirades i cops mentre els intenten prendre les càmeres. Membres de la comissió taurina intervenen per aturar els presumptes agressors però es pot escoltar com part del públic victoreja la situació. Els Mossos d'Esquadra presentaran aquest dilluns al migdia l'atestat dels fets al Jutjat d'Amposta. Dissabte ja es van detenir quatre persones que van quedar en llibertat a l'espera de ser cridats pel jutge. Es tracta de tres homes i una dona de Sant Carles de la Ràpita, tots ells membres de la mateixa família.'
}

Data Fields

  • summary (str): Summary of the piece of news
  • text (str): The text of the piece of news

Data Splits

Due to the reduced size of the dataset, we use it only for evaluation as a test set.

  • test: 13,843 examples

Dataset Creation

Curation Rationale

We created this corpus to contribute to the development of language models in Catalan, a low-resource language. There exist few resources for summarization in Catalan.

Source Data

Initial Data Collection and Normalization

We obtained each headline and its corresponding body of each news piece on VilaWeb and applied the following cleaning pipeline: deduplicating the documents, removing the documents with empty attributes, and deleting some boilerplate sentences.

Who are the source language producers?

The news portal VilaWeb .

Annotations

The dataset is unannotated.

Annotation process

[N/A]

Who are the annotators?

[N/A]

Personal and Sensitive Information

Since all data comes from public websites, no anonymization process was performed.

Considerations for Using the Data

Social Impact of Dataset

We hope this corpus contributes to the development of summarization models in Catalan, a low-resource language.

Discussion of Biases

We are aware that since the data comes from unreliable web pages, some biases may be present in the dataset. Nonetheless, we have not applied any steps to reduce their impact.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by MT4All CEF project and the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing information

Creative Commons Attribution 4.0 International .

Citation Information

If you use any of these resources (datasets or models) in your work, please cite our latest preprint:

@misc{degibert2022sequencetosequence,
      title={Sequence-to-Sequence Resources for Catalan}, 
      author={Ona de Gibert and Ksenia Kharitonova and Blanca Calvo Figueras and Jordi Armengol-Estapé and Maite Melero},
      year={2022},
      eprint={2202.06871},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

[N/A]