模型:
cjvt/sloberta-sleng
任务:
填充掩码SloBERTa-SlEng is a masked language model, based on the SloBERTa Slovene model.
SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on. They are the same as in the SlEng-bert model. The new embedding weights were initialized from the SloBERTa embeddings.
The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora, the same as the SlEng-bert model.
The model was trained on English and Slovene tweets, Slovene corpora MaCoCu and Frenk , and a small subset of English Oscar corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words.