数据集:
trec
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
Text REtrieval Conference (TREC) 问题分类数据集包含5500个有标签的训练集问题和另外500个测试集问题。
该数据集有6个粗粒度类标签和50个细粒度类标签。每个句子的平均长度为10个单词,词汇量为8700个单词。
数据收集来源包括:USC发布的4500个英文问题(Hovy et al.,2001),大约500个手工构造的罕见类别问题,894个TREC 8和TREC 9问题,以及作为测试集的TREC 10的500个问题。这些问题都经过了手工标注。
该数据集的语言为英语( en )。
“train”示例如下。
{
'text': 'How did serfdom develop in and then leave Russia ?',
'coarse_label': 2,
'fine_label': 26
}
所有划分中的数据字段相同。
name | train | test |
---|---|---|
default | 5452 | 500 |
@inproceedings{li-roth-2002-learning,
title = "Learning Question Classifiers",
author = "Li, Xin and
Roth, Dan",
booktitle = "{COLING} 2002: The 19th International Conference on Computational Linguistics",
year = "2002",
url = "https://www.aclweb.org/anthology/C02-1150",
}
@inproceedings{hovy-etal-2001-toward,
title = "Toward Semantics-Based Answer Pinpointing",
author = "Hovy, Eduard and
Gerber, Laurie and
Hermjakob, Ulf and
Lin, Chin-Yew and
Ravichandran, Deepak",
booktitle = "Proceedings of the First International Conference on Human Language Technology Research",
year = "2001",
url = "https://www.aclweb.org/anthology/H01-1069",
}