英文

ScandiBERT-no-faroese

这是一个没有法罗群岛数据和不同的子词tokenizer训练的ScandiBERT模型版本。

该模型是在下表所示的数据上进行训练的。批次大小为8.8k,模型在24个V100卡上进行了72个epoch的训练,大约耗时2周。

Language Data Size
Icelandic See IceBERT paper 16 GB
Danish Danish Gigaword Corpus (incl Twitter) 4,7 GB
Norwegian NCC corpus 42 GB
Swedish Swedish Gigaword Corpus 3,4 GB

如果您发现该模型有用,请引用

@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}