模型:
aubmindlab/araelectra-base-discriminator
ELECTRA是一种用于自监督语言表示学习的方法。它可以使用相对较少的计算资源来预训练Transformer网络。ELECTRA模型的训练是区分“真实”输入标记和由另一个神经网络生成的“伪造”输入标记,类似于元学习中的判别器部分。AraELECTRA在阿拉伯问答数据集上实现了最先进的结果。
请参考AraELECTRA论文以获得详细描述。
from transformers import ElectraForPreTraining, ElectraTokenizerFast import torch discriminator = ElectraForPreTraining.from_pretrained("aubmindlab/araelectra-base-discriminator") tokenizer = ElectraTokenizerFast.from_pretrained("aubmindlab/araelectra-base-discriminator") sentence = "" fake_sentence = "" fake_tokens = tokenizer.tokenize(fake_sentence) fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") discriminator_outputs = discriminator(fake_inputs) predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2) [print("%7s" % token, end="") for token in fake_tokens] [print("%7s" % int(prediction), end="") for prediction in predictions.tolist()]
Model | HuggingFace Model Name | Size (MB/Params) |
---|---|---|
AraELECTRA-base-generator | 1235321 | 227MB/60M |
AraELECTRA-base-discriminator | 1236321 | 516MB/135M |
Model | Hardware | num of examples (seq len = 512) | Batch Size | Num of Steps | Time (in days) |
---|---|---|---|---|---|
AraELECTRA-base | TPUv3-8 | - | 256 | 2M | 24 |
用于新AraELECTRA模型的预训练数据也用于AraGPT2和AraBERTv2。
该数据集包含77GB或200,095,961行或8,655,948,860个词或82,232,988,358个字符(在应用Farasa分词之前)
对于新数据集,我们将未经洗牌的OSCAR语料库添加到AraBERTv1中使用的先前数据集中,但不包括之前爬取的网站:
建议在训练/测试任何数据集之前应用我们的预处理函数。
安装arabert Python包,以对AraBERT v1和v2进行文本分段或清理您的数据 pip install arabert
from arabert.preprocess import ArabertPreprocessor model_name="araelectra-base" arabert_prep = ArabertPreprocessor(model_name=model_name) text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري" arabert_prep.preprocess(text) >>> output: ولن نبالغ إذا قلنا : إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري
您可以在HuggingFace的Transformer库中找到PyTorch,TF2和TF1模型,用户名为aubmindlab
@inproceedings{antoun-etal-2021-araelectra, title = "{A}ra{ELECTRA}: Pre-Training Text Discriminators for {A}rabic Language Understanding", author = "Antoun, Wissam and Baly, Fady and Hajj, Hazem", booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop", month = apr, year = "2021", address = "Kyiv, Ukraine (Virtual)", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.wanlp-1.20", pages = "191--195", }
感谢TensorFlow Research Cloud(TFRC)提供免费访问Cloud TPU的机会,没有这个计划就无法完成,感谢 AUB MIND Lab 会员的持续支持。还要感谢 Yakshof 和Assafir提供的数据和存储访问权限。感谢Habib Rahal( https://www.behance.net/rahalhabib )为AraBERT树立了形象。
Wissam Antoun : Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com
Fady Baly : Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com