这是bert-base-finnish-cased-v1模型,使用了芬兰的jigsaw_toxicity_pred_fi数据集进行微调。该模型训练用于预测数据集中引入的6个不同毒性标签的概率。
语言模型:bert-base-finnish-v1
语言:芬兰语
下游任务:多标签毒性检测(多标签文本分类)
训练数据:jigsaw_toxicity_pred_fi
评估数据:jigsaw_toxicity_pred_fi
如果您使用了这个模型,请使用以下的bibtex引用我们。
@inproceedings{eskelinen-etal-2023-toxicity, title = "Toxicity Detection in {F}innish Using Machine Translation", author = "Eskelinen, Anni and Silvala, Laura and Ginter, Filip and Pyysalo, Sampo and Laippala, Veronika", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{\'o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.68", pages = "685--697", abstract = "Due to the popularity of social media platforms and the sheer amount of user-generated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.", }
可以通过huggingface管道使用该模型:
model = transformers.AutoModelForSequenceClassification.from_pretrained("TurkuNLP/bert-large-finnish-cased-toxicity") tokenizer = transformers.AutoTokenizer.from_pretrained("TurkuNLP/bert-large-finnish-cased-v1") pipe = transformers.pipeline(task="text-classification", model=model, tokenizer=tokenizer, function_to_apply="sigmoid", top_k=None)
batch_size = 12 epochs = 10 (trained for 4) base_LM_model = "bert-large-finnish-cased-v1" max_seq_len = 512 learning_rate = 2e-5
F1-micro = 0.66 F1-macro = 0.57 Precision (micro) = 0.58 Recall (micro) = 0.76