Policy-DistilBERT-7d

模型描述

该模型是使用129,669个手动注释的句子进行训练的，用于将文本分类为以下七个政治类别之一：'经济'、'外部关系'、'社会构造'、'自由与民主'、'政治体制'、'福利和生活质量'或'社会群体'。

使用方法和限制

如何使用该模型

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "MoritzLaurer/policy-distilbert-7d"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "The new variant first detected in southern England in September is blamed for sharp rises in levels of positive tests in recent weeks in London, south-east England and the east of England"

input = tokenizer(text, truncation=True, return_tensors="pt")
output = model(input["input_ids"])
# the output corresponds to the following labels:
# 0: external relations, 1: freedom and democracy, 2: political system, 3: economy, 4: welfare and quality of life, 5: fabric of society, 6: social groups

# output to dictionary
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["external relations", "freedom and democracy", "political system", "economy", "welfare and quality of life", "fabric of society", "social groups"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)
#{'external relations': 0.0, 'freedom and democracy': 0.0, 'political system': 0.9, 'economy': 0.4, 
# 'welfare and quality of life': 98.3, 'fabric of society': 0.3, 'social groups': 0.0}

训练数据

Policy-DistilBERT-7d是在 Manifesto Project Dataset (MPDS2020a) 的英语子集上训练的。该模型使用来自8个英语国家（澳大利亚、加拿大、爱尔兰、以色列、新西兰、南非、英国、美国）的55个政党的164份政治宣言中的129,669个句子进行训练。这些宣言的发布时间为1992年至2019年。

Manifesto Project对政党纲领中的每个句子进行了手动注释，注释了7个主要政治领域：'经济'、'外部关系'、'社会构造'、'自由与民主'、'政治体制'、'福利和生活质量'或'社会群体' - 有关每个领域的确切定义，请参阅 codebook 。

训练过程

使用Hugging Face训练器训练了distilbert-base-uncased模型，使用以下超参数。这些超参数是在15%的验证集上进行超参数搜索确定的。

training_args = TrainingArguments(
    num_train_epochs=5,              # total number of training epochs
    learning_rate=4e-05,
    per_device_train_batch_size=4,   # batch size per device during training
    per_device_eval_batch_size=4,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.02,               # strength of weight decay
    fp16=True                        # mixed precision training
)

评估结果

使用15%的句子（85-15的训练-测试拆分）对模型进行了评估。

accuracy (balanced)	F1 (weighted)	precision	recall	accuracy (not balanced)
0.745	0.773	0.772	0.771	0.771

请注意，数据集中的标签分布是不平衡的：

Welfare and Quality of Life    0.327225
Economy                        0.259191
Fabric of Society              0.111800
Political System               0.095081
Social Groups                  0.094371
External Relations             0.063724
Freedom and Democracy          0.048608

因此，使用了 Balanced accuracy 和 weighted F1 对模型性能进行了评估。

局限性和偏见

该模型是使用上述8个国家的政党纲领中的句子（1992年至2019年）进行训练的，由 Manifesto Project 进行手动注释。因此，模型的输出在国家覆盖范围、时间跨度、领域定义和注释者的潜在偏见方面反映了数据集的局限性 - 与任何监督式机器学习模型一样。将该模型应用于其他类型的数据（其他类型的文本、其他国家等）将降低性能。

BibTeX条目和引用信息

@unpublished{
  title={Policy-DistilBERT},
  author={Moritz Laurer},
  year={2020},
  note={Unpublished paper}
}

作者:

Moritz Laurer

数据集大小:

255.74 MB