模型:

aubmindlab/aragpt2-base

任务:

类库:

PyTorch TensorFlow JAX TensorBoard Safetensors Transformers

数据集:

wikipedia Osian 1.5B-Arabic-Corpus oscar-arabic-unshuffled Assafir(private) 3AAssafir(private) 3Aoscar-arabic-unshuffled 3A1.5B-Arabic-Corpus 3AOsian 3Awikipedia

语言:

其他:

gpt2 text-generation-inference

预印本库:

arxiv:2012.15520

模型介绍文件清单

英文

Arabic GPT2

You can find more information in our paper AraGPT2

The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.

GPT2-base and medium uses the code from the gpt2 folder and can trains models from the minimaxir/gpt-2-simple repository.These models were trained using the lamb optimizer and follow the same architecture as gpt2 and are fully compatible with the transformers library.

GPT2-large and GPT2-mega were trained using the imcaspar/gpt2-ml library, and follow the grover architecture. You can use the pytorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers ).Both models are trained using the adafactor optimizer, since the adam and lamb optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.

AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.

使用方法

使用 transformers 测试模型：

from transformers import GPT2TokenizerFast, pipeline
#for base and medium
from transformers import GPT2LMHeadModel
#for large and mega
# pip install arabert
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel

from arabert.preprocess import ArabertPreprocessor

MODEL_NAME='aubmindlab/aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME)

text=""
text_clean = arabert_prep.preprocess(text)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)

#feel free to try different decoding settings
generation_pipeline(text,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']

使用 transformers 进行微调：

请按照链接 here 中的指南进行操作

使用我们的代码在TF 1.15.4上进行微调：

创建训练的TFRecords：

python create_pretraining_data.py
 --input_file=<RAW TEXT FILE with documents/article separated by an empty line>
 --output_file=<OUTPUT TFRecord>
 --tokenizer_dir=<Directory with the GPT2 Tokenizer files>

微调：

python3 run_pretraining.py \\r\n --input_file="gs://<GS_BUCKET>/pretraining_data/*" \\r\n --output_dir="gs://<GS_BUCKET>/pretraining_model/" \\r\n --config_file="config/small_hparams.json" \\r\n --batch_size=128 \\r\n --eval_batch_size=8 \\r\n --num_train_steps= \\r\n --num_warmup_steps= \\r\n --learning_rate= \\r\n --save_checkpoints_steps= \\r\n --max_seq_length=1024 \\r\n --max_eval_steps= \\r\n --optimizer="lamb" \\r\n --iterations_per_loop=5000 \\r\n --keep_checkpoint_max=10 \\r\n --use_tpu=True \\r\n --tpu_name=<TPU NAME> \\r\n --do_train=True \\r\n --do_eval=False

模型大小

Model	Optimizer	Context size	Embedding Size	Num of heads	Num of layers	Model Size / Num of Params
AraGPT2-base	lamb	1024	768	12	12	527MB / 135M
AraGPT2-medium	lamb	1024	1024	16	24	1.38G/370M
AraGPT2-large	adafactor	1024	1280	20	36	2.98GB/792M
AraGPT2-mega	adafactor	1024	1536	25	48	5.5GB/1.46B

所有模型都可以在HuggingFace的模型页面上找到，名称为 aubmindlab . 检查点以PyTorch，TF2和TF1格式提供。

计算

Model	Hardware	num of examples (seq len = 1024)	Batch Size	Num of Steps	Time (in days)
AraGPT2-base	TPUv3-128	9.7M	1792	125K	1.5
AraGPT2-medium	TPUv3-8	9.7M	1152	85K	1.5
AraGPT2-large	TPUv3-128	9.7M	256	220k	3
AraGPT2-mega	TPUv3-128	9.7M	256	780K	9

数据集

新的AraGPT2模型的预训练数据也用于AraBERTv2和AraELECTRA。

该数据集由77GB或200,095,961行或8,655,948,860个单词或82,232,988,358个字符（在应用Farasa分词之前）组成

对于新的数据集，我们在AraBERTv1使用的数据集上添加了未经洗牌的OSCAR语料库，但没有之前我们抓取的网站：

无序和过滤的OSCAR语料库。
2020/09/01开始的 Arabic Wikipedia dump
The 1.5B words Arabic Corpus
The OSIAN Corpus
Assafir新闻文章，特此感谢Assafir给我们提供数据。

免责声明

AraGPT2生成的文本是由在大量文本上训练的神经网络模型自动生成的，并不代表作者或他们所在机构的官方态度和偏好。AraGPT2生成的文本仅应用于研究和科学目的。如果它侵犯了您的权益，违反了社会道德，请不要传播。

如果您使用了这个模型，请引用我们的文章：

@inproceedings{antoun-etal-2021-aragpt2,
    title = "{A}ra{GPT}2: Pre-Trained Transformer for {A}rabic Language Generation",
    author = "Antoun, Wissam  and
      Baly, Fady  and
      Hajj, Hazem",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Virtual)",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.wanlp-1.21",
    pages = "196--207",
}

致谢

感谢TensorFlow Research Cloud（TFRC）提供免费访问云TPU的机会，没有这个计划，我们无法完成这个项目。再次感谢 AUB MIND Lab 成员们的持续支持。还要感谢 Yakshof 和 Assafir 提供的数据和存储访问。还要感谢 Habib Rahal（ https://www.behance.net/rahalhabib ），为AraBERT题留下了印象。

联系方式

Wissam Antoun : Linkedin | Twitter | Github | wfa07@mail.aub.edu | wissam.antoun@gmail.com

Fady Baly : Linkedin | Twitter | Github | fgb06@mail.aub.edu | baly.fady@gmail.com

作者:

AUB MIND LAB

数据集大小:

2.57 GB