模型:
CAMeL-Lab/bert-base-arabic-camelbert-ca
CAMeLBERT是一组在阿拉伯文本上进行预训练的BERT模型,具有不同大小和变种。我们发布了现代标准阿拉伯语(MSA),方言阿拉伯语(DA)和古典阿拉伯语(CA)的预训练语言模型,以及一个在这三种语言的混合上进行预训练的模型。我们还提供了在缩小了MSA变体数据集上进行预训练的其他模型(一半、四分之一、八分之一和十六分之一)。详情请参阅论文 " The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models "。
本模型卡描述了CAMeLBERT-CA(bert-base-arabic-camelbert-ca),它是在CA(古典阿拉伯语)数据集上进行预训练的模型。
| Model | Variant | Size | #Word | |
|---|---|---|---|---|
| bert-base-arabic-camelbert-mix | CA,DA,MSA | 167GB | 17.3B | |
| ✔ | bert-base-arabic-camelbert-ca | CA | 6GB | 847M | 
| bert-base-arabic-camelbert-da | DA | 54GB | 5.8B | |
| bert-base-arabic-camelbert-msa | MSA | 107GB | 12.6B | |
| bert-base-arabic-camelbert-msa-half | MSA | 53GB | 6.3B | |
| bert-base-arabic-camelbert-msa-quarter | MSA | 27GB | 3.1B | |
| bert-base-arabic-camelbert-msa-eighth | MSA | 14GB | 1.6B | |
| bert-base-arabic-camelbert-msa-sixteenth | MSA | 6GB | 746M | 
您可以将发布的模型用于遮蔽语言建模或下一个句子预测。但它主要用于在NLP任务(如命名实体识别、词性标注、情感分析、方言识别和诗歌分类)上进行微调。我们发布了我们的微调代码 here 。
使用方法您可以直接使用此模型进行遮蔽语言建模的流水线:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-ca')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.11048116534948349,
  'token': 3696,
  'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو الإسلام. [SEP]',
  'score': 0.03481195122003555,
  'token': 4677,
  'token_str': 'الإسلام'},
 {'sequence': '[CLS] الهدف من الحياة هو الموت. [SEP]',
  'score': 0.03402028977870941,
  'token': 4295,
  'token_str': 'الموت'},
 {'sequence': '[CLS] الهدف من الحياة هو العلم. [SEP]',
  'score': 0.027655426412820816,
  'token': 2789,
  'token_str': 'العلم'},
 {'sequence': '[CLS] الهدف من الحياة هو هذا. [SEP]',
  'score': 0.023059621453285217,
  'token': 2085,
  'token_str': 'هذا'}]
 注意:要下载我们的模型,您需要 transformers>= 3.5.0。否则,您可以手动下载模型。
以下是如何在PyTorch中使用此模型获取给定文本的特征的方法:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
 以及在TensorFlow中:
from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
 我们使用Google发布的 the original implementation 进行预训练。除非另有说明,我们使用原始英语BERT模型的超参数进行预训练。
| Task | Dataset | Variant | Mix | CA | DA | MSA | MSA-1/2 | MSA-1/4 | MSA-1/8 | MSA-1/16 | 
|---|---|---|---|---|---|---|---|---|---|---|
| NER | ANERcorp | MSA | 80.8% | 67.9% | 74.1% | 82.4% | 82.0% | 82.1% | 82.6% | 80.8% | 
| POS | PATB (MSA) | MSA | 98.1% | 97.8% | 97.7% | 98.3% | 98.2% | 98.3% | 98.2% | 98.2% | 
| ARZTB (EGY) | DA | 93.6% | 92.3% | 92.7% | 93.6% | 93.6% | 93.7% | 93.6% | 93.6% | |
| Gumar (GLF) | DA | 97.3% | 97.7% | 97.9% | 97.9% | 97.9% | 97.9% | 97.9% | 97.9% | |
| SA | ASTD | MSA | 76.3% | 69.4% | 74.6% | 76.9% | 76.0% | 76.8% | 76.7% | 75.3% | 
| ArSAS | MSA | 92.7% | 89.4% | 91.8% | 93.0% | 92.6% | 92.5% | 92.5% | 92.3% | |
| SemEval | MSA | 69.0% | 58.5% | 68.4% | 72.1% | 70.7% | 72.8% | 71.6% | 71.2% | |
| DID | MADAR-26 | DA | 62.9% | 61.9% | 61.8% | 62.6% | 62.0% | 62.8% | 62.0% | 62.2% | 
| MADAR-6 | DA | 92.5% | 91.5% | 92.2% | 91.9% | 91.8% | 92.2% | 92.1% | 92.0% | |
| MADAR-Twitter-5 | MSA | 75.7% | 71.4% | 74.2% | 77.6% | 78.5% | 77.3% | 77.7% | 76.2% | |
| NADI | DA | 24.7% | 17.3% | 20.1% | 24.9% | 24.6% | 24.6% | 24.9% | 23.8% | |
| Poetry | APCD | CA | 79.8% | 80.9% | 79.6% | 79.7% | 79.9% | 80.0% | 79.7% | 79.8% | 
| Variant | Mix | CA | DA | MSA | MSA-1/2 | MSA-1/4 | MSA-1/8 | MSA-1/16 | |
|---|---|---|---|---|---|---|---|---|---|
| Variant-wise-average [1] | MSA | 82.1% | 75.7% | 80.1% | 83.4% | 83.0% | 83.3% | 83.2% | 82.3% | 
| DA | 74.4% | 72.1% | 72.9% | 74.2% | 74.0% | 74.3% | 74.1% | 73.9% | |
| CA | 79.8% | 80.9% | 79.6% | 79.7% | 79.9% | 80.0% | 79.7% | 79.8% | |
| Macro-Average | ALL | 78.7% | 74.7% | 77.1% | 79.2% | 79.0% | 79.2% | 79.1% | 78.6% | 
[1] :按变体分组的平均值是指同一语言变体中一组任务的平均值。
本研究得到了Google TensorFlow研究云(TFRC)中的Cloud TPU支持。
@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}