数据集:
kor_hate
大小:
1K<n<10K语言创建人:
found源数据集:
original预印本库:
arxiv:2005.12503许可:
cc-by-sa-4.0计算机处理:
monolingual语言:
ko任务:
文本分类韩国仇恨言论数据集是一个包含8367条人工标注的娱乐新闻评论的数据集,这些评论来自一家知名的韩国新闻聚合平台。每条评论都被评估为社交偏见(标签:性别、其他、无)、仇恨言论(标签:仇恨、冒犯、无)或性别偏见(标签:True、False)。该数据集的创建旨在支持识别匿名用户在在线平台上发布的有害评论。
该数据集中的文本为韩文,相关的BCP-47代码为ko-KR。
示例数据实例包含一个评论,其中包含新闻评论的文本,以及以下每个字段的标签:contain_gender_bias、bias和hate。
{'comments':'설마 ㅈ 현정 작가 아니지??' 'contain_gender_bias': 'True', 'bias': 'gender', 'hate': 'hate' }
数据被拆分为训练集和开发(测试)集。其中,训练集包含7896条带标注评论,测试集包含471条评论。
该数据集的创建旨在提供来自韩国在线娱乐新闻聚合器的有毒言论检测的首个人工标注韩文语料库。最近,两位年轻的韩国名人遭受了一系列悲剧事件,导致两家主要的韩国网络门户关闭了其平台上的评论部分。然而,这只是一个临时解决方案,根本问题尚未解决。这个数据集希望改进韩国的仇恨言论检测。
从2018年1月1日至2020年2月29日期间,共收集了1040万条评论,这些评论来自一家韩国在线娱乐新闻聚合器。使用分层抽样抽取了1580篇文章,并按其每篇文章的Wilson得分对前20条评论进行了提取。去除了重复评论、单词评论和超过100个字符的评论(因为这些评论可能传达各种观点)。随机选择了1万条评论进行注释。
资源语言的制作者是谁?语言制作者是2018年至2020年之间的韩国在线新闻平台的用户。
每条评论分配给三个随机的注释者进行多数决策。对于更加模棱两可的评论,注释者可以跳过该评论。有关更详细的指南,请参见附录A在 paper 中。
注释者是谁?注释由32名注释者执行,其中包括来自众包平台DeepNatural AI的29名注释者和三名NLP研究人员。
[N/A]
该数据集的目的是解决用户在在线平台上发布有害评论的社会问题。该数据集旨在改善在线有害评论的检测。
[需要更多信息]
[需要更多信息]
该数据集由Jihyung Moon、Won Ik Cho和Junbum Lee策划。
[N/A]
@inproceedings {moon-et-al-2020-beep title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection", author = "Moon, Jihyung and Cho, Won Ik and Lee, Junbum", booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4", pages = "25--31", abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.", }
感谢 @stevhliu 添加了该数据集。