数据集:
yelp_polarity
大型Yelp评论数据集。这是一个用于二进制情感分类的数据集。我们提供了56万条极性强烈的Yelp评论用于训练,和3.8万条用于测试。来源该Yelp评论数据集由Yelp提供。它是从2015年的Yelp数据挑战中提取的数据。有关更多信息,请参考 http://www.yelp.com/dataset_challenge 。
Yelp情感极性数据集是由Xiang Zhang(xiang.zhang@nyu.edu)从上述数据集构建而成。它首先用作以下论文中的文本分类基准:Xiang Zhang,Junbo Zhao,Yann LeCun,《面向文本分类的字符级卷积网络》,神经信息处理系统28(NIPS 2015)。
描述
Yelp情感极性数据集是通过将1星和2星视为负面,3星和4星视为正面来构建的。对于每个极性,随机选择28万个训练样本和1.9万个测试样本。总共有56万个训练样本和3.8万个测试样本。负面极性为类别1,正面为类别2。
train.csv和test.csv文件包含了所有的训练样本和逗号分隔的值。它们有2列,分别对应类别索引(1和2)和评论文本。评论文本使用双引号(")进行转义,任何内部双引号都用两个双引号来转义("")。换行符用反斜杠后跟一个 "n" 字符(也就是 "")转义。
'train'的一个示例如下所示。
This example was too long and was cropped: { "label": 0, "text": "\"Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctor..." }
所有拆分的数据字段都是相同的。
plain_textname | train | test |
---|---|---|
plain_text | 560000 | 38000 |
@article{zhangCharacterlevelConvolutionalNetworks2015, archivePrefix = {arXiv}, eprinttype = {arxiv}, eprint = {1509.01626}, primaryClass = {cs}, title = {Character-Level {{Convolutional Networks}} for {{Text Classification}}}, abstract = {This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.}, journal = {arXiv:1509.01626 [cs]}, author = {Zhang, Xiang and Zhao, Junbo and LeCun, Yann}, month = sep, year = {2015}, }
感谢 @patrickvonplaten , @lewtun , @mariamabarham , @thomwolf , @julien-c 添加该数据集。