模型:
pysentimiento/bertabaporu-pt-hate-speech
仓库: https://github.com/pysentimiento/pysentimiento/
针对葡萄牙语仇恨言论的训练模型。基础模型是 BERTabaporu ,一个针对葡萄牙推文训练的RoBERTa模型。
可以直接与 pysentimiento 一起使用。
from pysentimiento import create_analyzer analyzer = create_analyzer(task="hate_speech", lang="pt") analyzer.predict("você tem que matar todos os malditos negros") # Returns AnalyzerOutput(output=['Racism'], probas={Sexism: 0.027, Body: 0.016, Racism: 0.698, Ideology: 0.025, Homophobia: 0.017})
如果您在研究中使用了该模型,请引用pysentimiento和RoBERTuito的论文:
@misc{perez2021pysentimiento, title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks}, author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque}, year={2021}, eprint={2106.09462}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc {pablo_botton_da_costa_2022, author = { {pablo botton da costa} }, title = { bertabaporu-base-uncased (Revision 1982d0f) }, year = 2022, url = { https://huggingface.co/pablocosta/bertabaporu-base-uncased }, doi = { 10.57967/hf/0019 }, publisher = { Hugging Face } } @inproceedings{fortuna-etal-2019-hierarchically, title = "A Hierarchically-Labeled {P}ortuguese Hate Speech Dataset", author = "Fortuna, Paula and Rocha da Silva, Jo{\~a}o and Soler-Company, Juan and Wanner, Leo and Nunes, S{\'e}rgio", booktitle = "Proceedings of the Third Workshop on Abusive Language Online", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W19-3510", doi = "10.18653/v1/W19-3510", pages = "94--104", abstract = "Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning are applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. Firstly, non-experts annotated the tweets with binary labels ({`}hate{'} vs. {`}no-hate{'}). Secondly, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. This hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.", }