基于XLM-Roberta-Base的丹麦冒犯性文本检测

这个模型是在大约500万条来自 DR 的公共Facebook页面上的评论数据集上进行微调的版本。标签是使用弱监督方法自动生成的，基于 Snorkel 框架。

该模型在一个包含600个Facebook评论的测试集上取得了SOTA，这些评论是由三个注释者进行多数投票标注的，其中35.8%被标记为冒犯性:

Model	Precision	Recall	F1-score	F2-score
alexandrainst/da-offensive-detection-base (this)	74.81%	89.77%	81.61%	86.32%
1235321	74.13%	89.30%	81.01%	85.79%
1236321	97.32%	50.70%	66.67%	56.07%
1237321	86.43%	56.28%	68.17%	60.50%
1238321	75.41%	42.79%	54.60%	46.84%

使用模型

您可以通过运行以下命令来使用该模型:

>>> from transformers import pipeline
>>> offensive_text_pipeline = pipeline(model="alexandrainst/da-offensive-detection-base")
>>> offensive_text_pipeline("Din store idiot")
[{'label': 'Offensive', 'score': 0.9997463822364807}]

可以通过以下方式同时处理多个文档:

>>> offensive_text_pipeline(["Din store idiot", "ej hvor godt :)"])
[{'label': 'Offensive', 'score': 0.9997463822364807}, {'label': 'Not offensive', 'score': 0.9996451139450073}]

训练过程

训练超参数

训练时使用了以下超参数:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 32
gradient_accumulation_steps: 1
total_train_batch_size: 32
seed: 4242
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
max_steps: 500000
fp16: True
eval_steps: 1000
early_stopping_patience: 100

框架版本

Transformers 4.20.1
Pytorch 1.11.0+cu113
Datasets 2.3.2
Tokenizers 0.12.1

作者:

Alexandra Institute

数据集大小:

2.09 GB