数据集:

clarin-pl/aspectemo

语言:

pl

计算机处理:

monolingual

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

许可:

mit
英文

AspectEmo

描述

AspectEmo语料库是波兰客户评论的一个扩展版本,是公开可用的波兰语客户评论语料库PolEmo 2.0的一个扩展版本,在许多情感分析项目中使用不同方法。AspectEmo语料库由四个子语料库组成,分别包含以下领域的在线客户评论: 学校、医学、酒店和产品。所有文档都以方面级别进行注释,并标记六个情感类别: 强烈负面(minus_m)、弱负面(minus_s)、中性(zero)、弱正面(plus_s)、强烈正面(plus_m)。

版本

version config name description default notes
1.0 "1.0" The version used in the paper. YES
2.0 - Some bugs fixed. NO work in progress

任务(输入、输出和度量)

基于方面的情感分析(ABSA)是一种将数据按方面进行分类并识别分配给每个方面的情感的文本分析方法。这是一个序列标记的任务。

输入('tokens'列): 标记序列

输出('labels'列): 预测的标记序列类别("O"加上6个可能的类别: 强烈负面(a_minus_m)、弱负面(a_minus_s)、中性(a_zero)、弱正面(a_plus_s)、强烈正面(a_plus_m)、模糊(a_amb))

领域: 学校、医学、酒店和产品

度量: F1分数(seqeval)

示例:

输入: ['Dużo', 'wymaga', ',', 'ale', 'bardzo', 'uczciwy', 'i', 'przyjazny', 'studentom', '.', 'Warto', 'chodzić', 'na', 'konsultacje', '.', 'Docenia', 'postępy', 'i', 'zaangażowanie', '.', 'Polecam', '.']

输入(由DeepL翻译): '要求很多,但非常诚实和对学生友好。值得去咨询。赞赏进步和承诺。我推荐。'

输出: ['O', 'a_plus_s', 'O', 'O', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'a_zero', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O']

数据拆分

Subset Cardinality (sentences)
train 1173
val 0
test 292

类分布(不包括"O")

Class train validation test
a_plus_m 0.359 - 0.369
a_minus_m 0.305 - 0.377
a_zero 0.234 - 0.182
a_minus_s 0.037 - 0.024
a_plus_s 0.037 - 0.015
a_amb 0.027 - 0.033

引用

@misc{11321/849,	
 title = {{AspectEmo} 1.0: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Sentiment Analysis},	
 author = {Koco{\'n}, Jan and Radom, Jarema and Kaczmarz-Wawryk, Ewa and Wabnic, Kamil and Zaj{\c a}czkowska, Ada and Za{\'s}ko-Zieli{\'n}ska, Monika},	
 url = {http://hdl.handle.net/11321/849},	
 note = {{CLARIN}-{PL} digital repository},	
 copyright = {The {MIT} License},	
 year = {2021}	
}

许可证

The MIT License

链接

HuggingFace

Source

Paper

示例

加载

from pprint import pprint

from datasets import load_dataset

dataset = load_dataset("clarin-pl/aspectemo")
pprint(dataset['train'][20])

# {'labels': [0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 5, 0, 0, 0, 0, 0, 0],
#  'tokens': ['Dużo',
#             'wymaga',
#             ',',
#             'ale',
#             'bardzo',
#             'uczciwy',
#             'i',
#             'przyjazny',
#             'studentom',
#             '.',
#             'Warto',
#             'chodzić',
#             'na',
#             'konsultacje',
#             '.',
#             'Docenia',
#             'postępy',
#             'i',
#             'zaangażowanie',
#             '.',
#             'Polecam',
#             '.']}

评估

import random
from pprint import pprint

from datasets import load_dataset, load_metric

dataset = load_dataset("clarin-pl/aspectemo")
references = dataset["test"]["labels"]

# generate random predictions
predictions = [
    [
        random.randrange(dataset["train"].features["labels"].feature.num_classes)
        for _ in range(len(labels))
    ]
    for labels in references
]

# transform to original names of labels
references_named = [
    [dataset["train"].features["labels"].feature.names[label] for label in labels]
    for labels in references
]
predictions_named = [
    [dataset["train"].features["labels"].feature.names[label] for label in labels]
    for labels in predictions
]

# transform to BILOU scheme
references_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in references_named
]
predictions_named = [
    [f"U-{label}" if label != "O" else label for label in labels]
    for labels in predictions_named
]

# utilise seqeval to evaluate
seqeval = load_metric("seqeval")
seqeval_score = seqeval.compute(
    predictions=predictions_named,
    references=references_named,
    scheme="BILOU",
    mode="strict",
)

pprint(seqeval_score)

# {'a_amb': {'f1': 0.00597237775289287,
#            'number': 91,
#            'precision': 0.003037782418834251,
#            'recall': 0.17582417582417584},
#  'a_minus_m': {'f1': 0.048306148055207034,
#                'number': 1039,
#                'precision': 0.0288551620760727,
#                'recall': 0.1482194417709336},
#  'a_minus_s': {'f1': 0.004682997118155619,
#                'number': 67,
#                'precision': 0.0023701002734731083,
#                'recall': 0.19402985074626866},
#  'a_plus_m': {'f1': 0.045933014354066985,
#               'number': 1015,
#               'precision': 0.027402473834443386,
#               'recall': 0.14187192118226602},
#  'a_plus_s': {'f1': 0.0021750951604132683,
#               'number': 41,
#               'precision': 0.001095690284879474,
#               'recall': 0.14634146341463414},
#  'a_zero': {'f1': 0.025159400310184387,
#             'number': 501,
#             'precision': 0.013768389287061486,
#             'recall': 0.14570858283433133},
#  'overall_accuracy': 0.13970115681233933,
#  'overall_f1': 0.02328248652368391,
#  'overall_precision': 0.012639312620633834,
#  'overall_recall': 0.14742193173565724}