数据集:

hebrew_sentiment

任务:

文本分类

子任务:

sentiment-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

mit

数据集介绍文件清单

英文

HebrewSentiment 数据集卡片

数据集简介

HebrewSentiment 是一个包含12,804条用户评论的数据集，这些评论是针对以色列总统鲁文·里夫林先生的官方Facebook页面的帖子。2015年10月，我们使用开源软件Netvizz（Rieder，2013年）爬取了里夫林总统在2014年6月至8月期间的所有帖子的所有评论，也就是里夫林担任总统的头三个月。尽管总统的帖子旨在化解紧张局势，呼吁宽容和同理心，但评论中表达的情感在公民之间极端分化，有些公民对总统表示热情感谢，而有些公民则严厉批评他的政策。在12,804条评论中，有370条是中性评论，8,512条是积极评论，3,922条是消极评论。

数据注释:

支持的任务和排行榜

情感分析

语言

希伯来语

数据集结构

tsv 格式：{希伯来语句子}\t{情感标签}

数据实例

רובי הייתי רוצה לראות ערביה נישאת ליהודי 1תמונה יפיפיה-שפו 0חייבים לעשות סוג של חרם כשכתבים שונאי ישראל עולים לשידור צריכים להעביר לערוץ אחר ואז תראו מה יעשה כוחו של הרייטינג ( בהקשר לדבריה של רינה מצליח ) 2

数据字段

文本 : 现代希伯来语输入文本。
标签 : 情感标签。0=积极，1=消极，2=离题。

数据切分

train	test
HebrewSentiment (token)	10243	2559
HebrewSentiment (morph)	10243	2559

数据集创建

策划理由

需要更多信息

数据来源

初始数据收集和规范化

以色列总统鲁文·里夫林先生的官方Facebook页面的用户评论。2015年10月，我们使用开源软件Netvizz（Rieder，2013年）爬取了里夫林总统在2014年6月至8月期间的所有帖子的所有评论，也就是里夫林担任总统的头三个月。

谁是源语言生产者？

需要更多信息

注释

注释过程

训练过的研究人员检查每条评论，并确定它的情感值，对于整体上积极的评论，分配值0，对于整体上消极的评论，分配值1，对于与帖子内容无关的评论，分配值2。我们通过要求第二位经过训练的研究人员对相同数据进行编码来验证编码方案。两位评估者之间存在较大的一致性（一致的数量：10623，不一致的数量：2105，Cohen's Kappa = 0.697，p = 0）。

谁是标注者？

研究人员

个人和敏感信息

需要更多信息

使用数据时的注意事项

数据集的社会影响

需要更多信息

偏见讨论

需要更多信息

其他已知限制

需要更多信息

其他信息

数据集创建者

OMIlab, 以色列公开大学

许可信息

MIT许可证

本软件按"原样"提供，不提供任何明示或暗示的担保，包括但不限于适销性、适用性和非侵权性的担保。无论在何种情况下，作者或版权持有人均不对任何索赔、损害或其他责任承担责任，无论是合同行为、侵权行为还是其他行为的结果，均起源于或与本软件或使用或其他权益有关。

引用信息

@inproceedings{amram-etal-2018-representations, title = "Representations and Architectures in Neural Sentiment Analysis for Morphologically Rich Languages: A Case Study from {M}odern {H}ebrew", author = "Amram, Adam and Ben David, Anat and Tsarfaty, Reut", booktitle = "Proceedings of the 27th International Conference on Computational Linguistics", month = aug, year = "2018", address = "Santa Fe, New Mexico, USA", publisher = "Association for Computational Linguistics", url = " https://www.aclweb.org/anthology/C18-1190" , pages = "2242--2252", abstract = "This paper empirically studies the effects of representation choices on neural sentiment analysis for Modern Hebrew, a morphologically rich language (MRL) for which no sentiment analyzer currently exists. We study two dimensions of representational choices: (i) the granularity of the input signal (token-based vs. morpheme-based), and (ii) the level of encoding of vocabulary items (string-based vs. character-based). We hypothesise that for MRLs, languages where multiple meaning-bearing elements may be carried by a single space-delimited token, these choices will have measurable effects on task perfromance, and that these effects may vary for different architectural designs {---} fully-connected, convolutional or recurrent. Specifically, we hypothesize that morpheme-based representations will have advantages in terms of their generalization capacity and task accuracy, due to their better OOV coverage. To empirically study these effects, we develop a new sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings. Our experiments show that representation choices empirical effects vary with architecture type. While fully-connected and convolutional networks slightly prefer token-based settings, RNNs benefit from a morpheme-based representation, in accord with the hypothesis that explicit morphological information may help generalize. Our endeavour also delivers the first state-of-the-art broad-coverage sentiment analyzer for Hebrew, with over 89{%} accuracy, alongside an established benchmark to further study the effects of linguistic representation choices on neural networks{'}} task performance.",}

贡献

感谢 @elronbandel 提供该数据集。

作者:

佚名

数据集大小:

26.82 KB