阿拉伯语命名实体识别（NER）模型使用Flair嵌入

训练进行了94个时期，使用了线性衰减学习速率为2e-05，从0.225开始，并且使用了32批次的GloVe和Flair前向和后向嵌入。

原始数据集：

结果：

F1得分（总体）0.8666
F1得分（宏）0.8488

Named Entity Type	True Posititves	False Positives	False Negatives	Precision	Recall	class-F1
LOC	Location	539	51	68	0.9136	0.8880	0.9006
MISC	Miscellaneous	408	57	89	0.8774	0.8209	0.8482
ORG	Organisation	167	43	64	0.7952	0.7229	0.7574
PER	Person (no title)	501	65	60	0.8852	0.8930	0.8891

用法

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic

tagger = SequenceTagger.load("julien-c/flair-ner")
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')


# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

示例

2021-07-07 14:30:59,649 loading file /Users/mega/.flair/models/flair-ner/f22eb997f66ae2eacad974121069abaefca5fe85fce71b49e527420ff45b9283.941c7c30b38aef8d8a4eb5c1b6dd7fe8583ff723fef457382589ad6a4e859cfc
2021-07-07 14:31:04,654 loading file /Users/mega/.flair/models/flair-arabic-multi-ner/c7af7ddef4fdcc681fcbe1f37719348afd2862b12aa1cfd4f3b93bd2d77282c7.242d030cb106124f7f9f6a88fb9af8e390f581d42eeca013367a86d585ee6dd6
ic| sentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "George Washington went to Washington ."   [− Tokens: 6  − Token-Labels: "George <B-PER> Washington <E-PER> went to Washington <S-LOC> ."]>
ic| arSentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة ."   [− Tokens: 11  − Token-Labels: "عمرو <B-PER> عادلي <I-PER> أستاذ للاقتصاد السياسي المساعد في الجامعة <B-ORG> الأمريكية <I-ORG> بالقاهرة <B-LOC> ."]>
ic| entity: <PER-span (1,2): "George Washington">
ic| entity: <LOC-span (5): "Washington">
ic| entity: <PER-span (1,2): "عمرو عادلي">
ic| entity: <ORG-span (8,9): "الجامعة الأمريكية">
ic| entity: <LOC-span (10): "بالقاهرة">
ic| sentence.to_dict(tag_type='ner'): 
{"text":"عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .",
"labels":[],
{"entities":[{{{
               "text":"عمرو عادلي",
               "start_pos":0,
               "end_pos":10,
               "labels":[PER (0.9826)]},
            {"text":"الجامعة الأمريكية",
               "start_pos":45,
               "end_pos":62,
               "labels":[ORG (0.7679)]},
            {"text":"بالقاهرة",
               "start_pos":64,
               "end_pos":72,
               "labels":[LOC (0.8079)]}]}
"text":"George Washington went to Washington .",
"labels":[],
"entities":[{
           {"text":"George Washington",
            "start_pos":0,
            "end_pos":17,
            "labels":[PER (0.9968)]},
           {"text":"Washington""start_pos":26,
            "end_pos":36,
            "labels":[LOC (0.9994)]}}]}

模型配置

SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
  (rnn): LSTM(4196, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=15, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None

由于从右到左在从左到右的上下文中，可能会出现一些格式错误。您的代码可能会出现 this 等错误（2020-10-27访问的链接）

引用

如果您使用了这个模型，请考虑引用 this work ：

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}

作者:

Mohamed Megahed

数据集大小:

524.88 MB