数据集:

projecte-aina/ancora-ca-ner

语言:

ca

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

预印本库:

arxiv:2107.07903

许可:

cc-by-4.0
中文

Dataset Card for AnCora-Ca-NER

Dataset Summary

This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes.

AnCora corpus is used under CC-by licence.

This dataset was developed by BSC TeMU as part of the Projecte AINA , to enrich the Catalan Language Understanding Benchmark (CLUB) .

Supported Tasks and Leaderboards

Named Entities Recognition, Language Model

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

Data Instances

Three two-column files, one for each split.

    Fundació B-ORG
    Privada I-ORG
    Fira I-ORG
    de I-ORG
    Manresa I-ORG
    ha O
    fet O
    un O
    balanç O
    de O
    l' O
    activitat O
    del O
    Palau B-LOC
    Firal I-LOC

Data Fields

Every file has two columns, with the word form or punctuation symbol in the first one and the corresponding IOB tag in the second one.

Data Splits

We took the original train, dev and test splits from the UD version of the corpus

  • train: 10,630 examples
  • validation: 1,429 examples
  • test: 1,528 examples

Dataset Creation

Curation Rationale

We created this corpus to contribute to the development of language models in Catalan, a low-resource language.

Source Data

Initial Data Collection and Normalization

AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 tokens (some multi-word). The corpora are annotated for linguistic phenomena at different levels. AnCora corpus is mainly based on newswire texts. For more information, refer to Taulé, M., M.A. Martí, M. Recasens (2009): "AnCora: Multilevel Annotated Corpora for Catalan and Spanish” , Proceedings of 6th International Conference on language Resources and Evaluation.

Who are the source language producers?

Catalan AnCora corpus is compiled from articles from the following news outlets: EFE , ACN , El Periodico .

Annotations

Annotation process

We adapted the NER labels from AnCora corpus to a token-per-line, multi-column format.

Who are the annotators?

Original annotators from AnCora corpus .

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

We hope this corpus contributes to the development of language models in Catalan, a low-resource language.

Discussion of Biases

[N/A]

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing information

This work is licensed under a Attribution 4.0 International License .

Citation Information

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

DOI

Contributions

[N/A]