模型:
castorini/afriberta_base
语言:
AfriBERTa base是一个预训练的多语言语言模型,具有约111 million个参数。该模型拥有8层、6个注意力头、768个隐藏单元和3072个前馈神经网络大小。该模型在11种非洲语言上进行了预训练,包括Afaan Oromoo(也称为Oromo语)、Amharic语、Gahuza语(一种包含Kinyarwanda和Kirundi语的混合语言)、Hausa语、Igbo语、尼日利亚皮京语、索马里语、斯瓦希里语、提格利尼亚语和约鲁巴语。该模型在多个非洲语言上展示了具有竞争力的下游性能,包括其没有进行预训练的语言。
您可以使用这个模型与Transformers库一起完成任何下游任务。例如,假设我们想要在标记分类任务上微调此模型,我们可以执行以下操作:
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification >>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_base") >>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_base") # we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now >>> tokenizer.model_max_length = 512限制和偏差
该模型是基于BBC新闻网站和Common Crawl的数据集进行训练的聚合数据。
有关训练过程的信息,请参考AfriBERTa paper 或 repository 。
@inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", }