Hugging Face的徽标

语言：

om
am
rw
rn
ha
ig
pcm
so
sw
ti
yo
multilingual

afriberta_base

模型描述

AfriBERTa base是一个预训练的多语言语言模型，具有约111 million个参数。该模型拥有8层、6个注意力头、768个隐藏单元和3072个前馈神经网络大小。该模型在11种非洲语言上进行了预训练，包括Afaan Oromoo（也称为Oromo语）、Amharic语、Gahuza语（一种包含Kinyarwanda和Kirundi语的混合语言）、Hausa语、Igbo语、尼日利亚皮京语、索马里语、斯瓦希里语、提格利尼亚语和约鲁巴语。该模型在多个非洲语言上展示了具有竞争力的下游性能，包括其没有进行预训练的语言。

拟合用途和限制条件

如何使用

您可以使用这个模型与Transformers库一起完成任何下游任务。例如，假设我们想要在标记分类任务上微调此模型，我们可以执行以下操作：

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_base")
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_base")
# we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now
>>> tokenizer.model_max_length = 512

限制和偏差

该模型可能受限于其训练数据集，主要为特定时间段内的新闻文章。因此，它可能泛化能力不强。
该模型基于很少的数据训练（小于1 GB），因此可能没有看到足够的数据来学习非常复杂的语言关系。

训练数据

该模型是基于BBC新闻网站和Common Crawl的数据集进行训练的聚合数据。

训练过程

有关训练过程的信息，请参考AfriBERTa paper 或 repository 。

BibTeX条目和引用信息

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

作者:

Castorini

数据集大小:

1.03 GB