数据集:

norec

许可:

cc-by-nc-4.0

源数据集:

original

批注创建人:

expert-generated

语言创建人:

found

大小:

100K<n<1M

计算机处理:

monolingual

语言:

子任务:

named-entity-recognition

任务:

标记分类

数据集介绍文件清单

英文

NoReC数据集卡

数据集摘要

该数据集包含了挪威评论语料库（NoReC），用于训练和评估文档级情感分析模型。从主要挪威新闻源收集了超过43,000条全文评论，涵盖了不同领域，包括文学、电影、视频游戏、餐馆、音乐和剧院，以及跨不同类别的产品评论。每个评论都标有原始作者给出的1-6分的手动评分。

支持的任务和排行榜

[需要更多信息]

语言

数据集中的句子为挪威语（nb、nn、no）。

数据集结构

数据实例

下面是训练集的一个示例：

{'deprel': ['det',
  'amod',
  'cc',
  'conj',
  'nsubj',
  'case',
  'nmod',
  'cop',
  'case',
  'case',
  'root',
  'flat:name',
  'flat:name',
  'punct'],
 'deps': ['None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None'],
 'feats': ["{'Gender': 'Masc', 'Number': 'Sing', 'PronType': 'Dem'}",
  "{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}",
  'None',
  "{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}",
  "{'Definite': 'Def', 'Gender': 'Masc', 'Number': 'Sing'}",
  'None',
  'None',
  "{'Mood': 'Ind', 'Tense': 'Pres', 'VerbForm': 'Fin'}",
  'None',
  'None',
  'None',
  'None',
  'None',
  'None'],
 'head': ['5',
  '5',
  '4',
  '2',
  '11',
  '7',
  '5',
  '11',
  '11',
  '11',
  '0',
  '11',
  '11',
  '11'],
 'idx': '000000-02-01',
 'lemmas': ['den',
  'andre',
  'og',
  'sist',
  'sesong',
  'av',
  'Rome',
  'være',
  'ute',
  'på',
  'DVD',
  'i',
  'Norge',
  '$.'],
 'misc': ['None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  "{'SpaceAfter': 'No'}",
  'None'],
 'pos_tags': [5, 0, 4, 0, 7, 1, 11, 3, 1, 1, 11, 1, 11, 12],
 'text': 'Den andre og siste sesongen av Rome er ute på DVD i Norge.',
 'tokens': ['Den',
  'andre',
  'og',
  'siste',
  'sesongen',
  'av',
  'Rome',
  'er',
  'ute',
  'på',
  'DVD',
  'i',
  'Norge',
  '.'],
 'xpos_tags': ['None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None',
  'None']}

数据字段

数据实例具有以下字段：

deprel：[需要更多信息]
deps：[需要更多信息]
feats：[需要更多信息]
head：[需要更多信息]
idx：索引
lemmas：所有标记的词形
misc：[需要更多信息]
pos_tags：词性标签
text：文本字符串
tokens：标记
xpos_tags：[需要更多信息]

词性标签对应以下标签：“ADJ”（0），“ADP”（1），“ADV”（2），“AUX”（3），“CCONJ”（4），“DET”（5），“INTJ”（6），“NOUN”（7），“NUM”（8），“PART”（9），“PRON”（10），“PROPN”（11），“PUNCT”（12），“SCONJ”（13），“SYM”（14），“VERB”（15），“X”（16），

数据拆分

训练集、验证集和测试集分别包含680792、101106和101594个句子。

数据集创建

策划理由

[需要更多信息]

源数据

[需要更多信息]

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

标注

[需要更多信息]

标注过程

[需要更多信息]

谁是标注者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

@InProceedings{VelOvrBer18,
  author = {Erik Velldal and Lilja {\O}vrelid and 
            Eivind Alexander Bergem and  Cathrine Stadsnes and 
            Samia Touileb and Fredrik J{\o}rgensen},
  title = {{NoReC}: The {N}orwegian {R}eview {C}orpus},
  booktitle = {Proceedings of the 11th edition of the 
               Language Resources and Evaluation Conference},
  year = {2018},
  address = {Miyazaki, Japan},
  pages = {4186--4191}
}

贡献者

感谢 @abhishekkrthakur 添加了此数据集。

作者:

佚名

数据集大小:

17.89 KB