模型:
Davlan/xlm-roberta-large-masakhaner
任务:
标记分类预印本库:
arxiv:2103.11811语言:
数据集:
xlm-roberta-large-masakhaner 是Hugging Face首个用于10种非洲语言(阿姆哈拉语,豪萨语,伊博语,基尼亚鲁旺达语,卢干达语,尼日利亚皮金语,斯瓦希里语,沃洛夫语和约鲁巴语)的命名实体识别模型,基于经过微调的XLM-RoBERTa大型模型。它在NER任务上实现了最先进的性能。该模型已经训练用于识别四种类型的实体:日期和时间(DATE),位置(LOC),组织机构(ORG)和个人(PER)。具体而言,该模型是一个经过微调的xlm-roberta-large模型,基于从Masakhane MasakhaNER 数据集获取的非洲语言数据集进行训练。
您可以使用Transformers pipeline 与此模型一起用于NER。
from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained("Davlan/xlm-roberta-large-masakhaner") model = AutoModelForTokenClassification.from_pretrained("Davlan/xlm-roberta-large-masakhaner") nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria" ner_results = nlp(example) print(ner_results)限制和偏差
该模型受到其训练数据集的限制,该数据集包含一定时间范围内的实体标注新闻文章。这可能无法很好地推广到不同领域的所有用例。
该模型基于10个非洲NER数据集(阿姆哈拉语,豪萨语,伊博语,基尼亚鲁旺达语,卢干达语,尼日利亚皮金语,斯瓦希里语,沃洛夫语和约鲁巴语)上进行了微调,这些数据集来自Masakhane MasakhaNER 数据集。
训练数据集区分了实体的开始和连续,以便如果存在相同类型的连续实体,模型可以输出第二个实体的开始位置。与数据集一样,每个标记将被分类为以下类别之一:
Abbreviation | Description |
---|---|
O | Outside of a named entity |
B-DATE | Beginning of a DATE entity right after another DATE entity |
I-DATE | DATE entity |
B-PER | Beginning of a person’s name right after another person’s name |
I-PER | Person’s name |
B-ORG | Beginning of an organisation right after another organisation |
I-ORG | Organisation |
B-LOC | Beginning of a location right after another location |
I-LOC | Location |
该模型在一台NVIDIA V100 GPU上训练,使用了来自 original MasakhaNER paper 的推荐超参数,该模型对MasakhaNER语料库进行了训练和评估。
language | F1-score |
---|---|
amh | 75.76 |
hau | 91.75 |
ibo | 86.26 |
kin | 76.38 |
lug | 84.64 |
luo | 80.65 |
pcm | 89.55 |
swa | 89.48 |
wol | 70.70 |
yor | 82.05 |
@article{adelani21tacl, title = {Masakha{NER}: Named Entity Recognition for African Languages}, author = {David Ifeoluwa Adelani and Jade Abbott and Graham Neubig and Daniel D'souza and Julia Kreutzer and Constantine Lignos and Chester Palen-Michel and Happy Buzaaba and Shruti Rijhwani and Sebastian Ruder and Stephen Mayhew and Israel Abebe Azime and Shamsuddeen Muhammad and Chris Chinenye Emezue and Joyce Nakatumba-Nabende and Perez Ogayo and Anuoluwapo Aremu and Catherine Gitau and Derguene Mbaye and Jesujoba Alabi and Seid Muhie Yimam and Tajuddeen Gwadabe and Ignatius Ezeani and Rubungo Andre Niyongabo and Jonathan Mukiibi and Verrah Otiende and Iroro Orife and Davis David and Samba Ngom and Tosin Adewumi and Paul Rayson and Mofetoluwa Adeyemi and Gerald Muriuki and Emmanuel Anebi and Chiamaka Chukwuneke and Nkiruka Odu and Eric Peter Wairagala and Samuel Oyerinde and Clemencia Siro and Tobius Saul Bateesa and Temilola Oloyede and Yvonne Wambui and Victor Akinode and Deborah Nabagereka and Maurice Katusiime and Ayodele Awokoya and Mouhamadane MBOUP and Dibora Gebreyohannes and Henok Tilaye and Kelechi Nwaike and Degaga Wolde and Abdoulaye Faye and Blessing Sibanda and Orevaoghene Ahia and Bonaventure F. P. Dossou and Kelechi Ogueji and Thierno Ibrahima DIOP and Abdoulaye Diallo and Adewale Akinfaderin and Tendai Marengereke and Salomey Osei}, journal = {Transactions of the Association for Computational Linguistics (TACL)}, month = {}, url = {https://arxiv.org/abs/2103.11811}, year = {2021} }