这是一个没有法罗群岛数据和不同的子词tokenizer训练的ScandiBERT模型版本。
该模型是在下表所示的数据上进行训练的。批次大小为8.8k,模型在24个V100卡上进行了72个epoch的训练,大约耗时2周。
Language | Data | Size |
---|---|---|
Icelandic | See IceBERT paper | 16 GB |
Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
Norwegian | NCC corpus | 42 GB |
Swedish | Swedish Gigaword Corpus | 3,4 GB |
如果您发现该模型有用,请引用
@inproceedings{snaebjarnarson-etal-2023-transfer, title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", author = "Snæbjarnarson, Vésteinn and Simonsen, Annika and Glavaš, Goran and Vulić, Ivan", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = "may 22--24", year = "2023", address = "Tórshavn, Faroe Islands", publisher = {Link{\"o}ping University Electronic Press, Sweden}, }