模型:

CAMeL-Lab/bert-base-arabic-camelbert-msa

任务:

填充掩码

类库:

PyTorch TensorFlow JAX Transformers

语言:

其他:

bert AutoTrain Compatible

预印本库:

arxiv:2103.06678

许可:

apache-2.0

模型介绍文件清单

英文

CAMeLBERT: 用于阿拉伯语NLP任务的预训练模型集合

模型描述

CAMeLBERT 是一个集合，其中包括在阿拉伯语文本上进行预训练的BERT模型，尺寸和变种各不相同。我们发布了现代标准阿拉伯语（MSA）、方言阿拉伯语（DA）和古典阿拉伯语（CA）的预训练语言模型，以及在三者混合数据集上进行预训练的模型。此外，我们还提供了在缩小的MSA变种数据集上进行预训练的其他模型（半数、四分之一、八分之一和十六分之一）。详细信息可参见论文 " The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models "。

本模型卡片描述了CAMeLBERT-MSA（bert-base-arabic-camelbert-msa），这是一个在整个MSA数据集上进行预训练的模型。

Model	Variant	Size	#Word
bert-base-arabic-camelbert-mix	CA,DA,MSA	167GB	17.3B
bert-base-arabic-camelbert-ca	CA	6GB	847M
bert-base-arabic-camelbert-da	DA	54GB	5.8B
✔	bert-base-arabic-camelbert-msa	MSA	107GB	12.6B
bert-base-arabic-camelbert-msa-half	MSA	53GB	6.3B
bert-base-arabic-camelbert-msa-quarter	MSA	27GB	3.1B
bert-base-arabic-camelbert-msa-eighth	MSA	14GB	1.6B
bert-base-arabic-camelbert-msa-sixteenth	MSA	6GB	746M

预期使用

您可以将发布的模型用于屏蔽语言建模或下一个句子预测。然而，它主要用于在NLP任务（如NER、POS标注、情感分析、方言识别和诗歌分类）上进行微调。我们发布了我们的微调代码 here 。

如何使用

您可以直接使用此模型来进行掩码语言建模的流程：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-msa')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو العمل. [SEP]',
  'score': 0.08507660031318665,
  'token': 2854,
  'token_str': 'العمل'},
 {'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.058905381709337234,
  'token': 3696, 'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو النجاح. [SEP]',
  'score': 0.04660581797361374, 'token': 6232,
  'token_str': 'النجاح'},
 {'sequence': '[CLS] الهدف من الحياة هو الربح. [SEP]',
  'score': 0.04156001657247543,
  'token': 12413, 'token_str': 'الربح'},
 {'sequence': '[CLS] الهدف من الحياة هو الحب. [SEP]',
  'score': 0.03534102067351341,
  'token': 3088,
  'token_str': 'الحب'}]

注意：要下载我们的模型，您需要 transformers>=3.5.0。否则，您可以手动下载模型。

以下是使用此模型在PyTorch中获取给定文本特征的方法：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

以及在TensorFlow中的方法：

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-msa')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

训练数据

MSA（现代标准阿拉伯语）
- The Arabic Gigaword Fifth Edition
- Abu El-Khair Corpus
- OSIAN corpus
- Arabic Wikipedia
- 阿拉伯语的未洗牌版本 OSCAR corpus

训练过程

我们使用Google发布的 the original implementation 进行预训练。我们遵循原始英文BERT模型的超参数进行预训练，除非另有说明。

预处理

在从每个语料库中提取原始文本后，我们应用以下预处理：
首先，使用 the original BERT implementation 提供的工具删除无效字符并规范化空格。
我们还删除不含任何阿拉伯字符的行。
然后，我们使用 CAMeL Tools 删除变音符和加号。
最后，我们使用基于启发式的句子分段器将每行分割为句子。
我们使用整个数据集（167 GB文本）训练一个词块令牌化器，词汇表大小为30,000，使用 HuggingFace's tokenizers 。
我们不转换为小写字母，也不去除重音符号。

预训练

该模型总共在单个云TPU（v3-8）上进行了一百万个步骤的训练。
前 90,000 个步骤的批量大小为 1,024，其余步骤的批量大小为 256。
序列长度在 90% 的步骤中限制为 128 个标记，剩余的 10% 中为 512 个标记。
我们使用整词蒙版和重复因子为10。
对于最大序列长度为128个标记的数据集，我们将每个序列的最大预测数设置为20，对于最大序列长度为512个标记的数据集，我们将其设置为80。
我们使用随机种子12345、遮蔽语言模型概率0.15和短序列概率0.1。
使用Adam优化器进行训练，学习率为1e-4，β1=0.9和β2=0.999，权重衰减为0.01，学习率预热步骤为10,000步，在此之后线性衰减学习率。

评估结果

我们在五个NLP任务上评估我们的预训练语言模型：NER、POS标注、情感分析、方言识别和诗歌分类。
我们使用12个数据集进行微调和评估模型。
我们使用Hugging Face的transformers库对CAMeLBERT模型进行微调。
我们使用transformers v3.1.0和PyTorch v1.5.1。
微调是通过向最后一个隐藏状态添加完全连接的线性层完成的。
我们使用F1得分作为所有任务的评估指标。
用于微调的代码可在 here 中找到。

结果

Task	Dataset	Variant	Mix	CA	DA	MSA	MSA-1/2	MSA-1/4	MSA-1/8	MSA-1/16
NER	ANERcorp	MSA	80.8%	67.9%	74.1%	82.4%	82.0%	82.1%	82.6%	80.8%
POS	PATB (MSA)	MSA	98.1%	97.8%	97.7%	98.3%	98.2%	98.3%	98.2%	98.2%
ARZTB (EGY)	DA	93.6%	92.3%	92.7%	93.6%	93.6%	93.7%	93.6%	93.6%
Gumar (GLF)	DA	97.3%	97.7%	97.9%	97.9%	97.9%	97.9%	97.9%	97.9%
SA	ASTD	MSA	76.3%	69.4%	74.6%	76.9%	76.0%	76.8%	76.7%	75.3%
ArSAS	MSA	92.7%	89.4%	91.8%	93.0%	92.6%	92.5%	92.5%	92.3%
SemEval	MSA	69.0%	58.5%	68.4%	72.1%	70.7%	72.8%	71.6%	71.2%
DID	MADAR-26	DA	62.9%	61.9%	61.8%	62.6%	62.0%	62.8%	62.0%	62.2%
MADAR-6	DA	92.5%	91.5%	92.2%	91.9%	91.8%	92.2%	92.1%	92.0%
MADAR-Twitter-5	MSA	75.7%	71.4%	74.2%	77.6%	78.5%	77.3%	77.7%	76.2%
NADI	DA	24.7%	17.3%	20.1%	24.9%	24.6%	24.6%	24.9%	23.8%
Poetry	APCD	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%

结果（平均）

Variant	Mix	CA	DA	MSA	MSA-1/2	MSA-1/4	MSA-1/8	MSA-1/16
Variant-wise-average [1]	MSA	82.1%	75.7%	80.1%	83.4%	83.0%	83.3%	83.2%	82.3%
DA	74.4%	72.1%	72.9%	74.2%	74.0%	74.3%	74.1%	73.9%
CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%
Macro-Average	ALL	78.7%	74.7%	77.1%	79.2%	79.0%	79.2%	79.1%	78.6%

[1] : 按语言变体分组的任务平均指的是在同一语言变体中的任务的平均值。

致谢

本研究得到了来自Google TensorFlow Research Cloud（TFRC）的云TPU支持。

引用

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}

作者:

CAMeL Lab

数据集大小:

1.22 GB