模型:
vesteinn/ScandiBERT
任务:
填充掩码数据集:
vesteinn/FC3 vesteinn/IC3 mideind/icelandic-common-crawl-corpus-IC3 NbAiLab/NCC DDSC/partial-danish-gigaword-no-twitter 3ADDSC/partial-danish-gigaword-no-twitter 3ANbAiLab/NCC 3Amideind/icelandic-common-crawl-corpus-IC3 3Avesteinn/IC3 3Avesteinn/FC3语言:
is数字对象标识符:
10.57967/hf/0382许可:
agpl-3.0注意注意:模型已于2022年9月27日更新
该模型是基于下表所示的数据进行训练的。批大小为8.8k,使用24张V100卡进行了72个时期的训练,大约持续了2周。
Language | Data | Size |
---|---|---|
Icelandic | See IceBERT paper | 16 GB |
Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
Norwegian | NCC corpus | 42 GB |
Swedish | Swedish Gigaword Corpus | 3,4 GB |
Faroese | FC3 + Sosialurinn + Bible | 69 MB |
注意:在较早的日期上,这里上传了一个训练一半的模型,但已经将其删除。模型已进行更新。
这是一个在大量的丹麦、法罗群岛、冰岛、挪威和瑞典文本集合上进行训练的斯堪的纳维亚BERT模型。目前是ScandEval排行榜上排名最高的模型 https://scandeval.github.io/pretrained/ 。
如果您发现此模型有用,请引用
@inproceedings{snaebjarnarson-etal-2023-transfer, title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", author = "Snæbjarnarson, Vésteinn and Simonsen, Annika and Glavaš, Goran and Vulić, Ivan", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = "may 22--24", year = "2023", address = "Tórshavn, Faroe Islands", publisher = {Link{\"o}ping University Electronic Press, Sweden}, }