ScandiBERT

注意注意：模型已于2022年9月27日更新

该模型是基于下表所示的数据进行训练的。批大小为8.8k，使用24张V100卡进行了72个时期的训练，大约持续了2周。

Language	Data	Size
Icelandic	See IceBERT paper	16 GB
Danish	Danish Gigaword Corpus (incl Twitter)	4,7 GB
Norwegian	NCC corpus	42 GB
Swedish	Swedish Gigaword Corpus	3,4 GB
Faroese	FC3 + Sosialurinn + Bible	69 MB

注意：在较早的日期上，这里上传了一个训练一半的模型，但已经将其删除。模型已进行更新。

这是一个在大量的丹麦、法罗群岛、冰岛、挪威和瑞典文本集合上进行训练的斯堪的纳维亚BERT模型。目前是ScandEval排行榜上排名最高的模型 https://scandeval.github.io/pretrained/ 。

如果您发现此模型有用，请引用

@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}

作者:

Vésteinn Snæbjarnarson

数据集大小:

953.78 MB