数据集:
mac_morpho
任务:
标记分类子任务:
part-of-speech语言:
pt计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0Mac-Morpho是一份标注了部分词性标记的巴西葡萄牙语文本语料库。它的第一个版本发布于2003年 [1],随后进行了两次修订以提高资源的质量 [2, 3]。该语料库可下载并分为训练、开发和测试部分。它们分别占据语料库总量的76%,4%和20%(不寻常的数字原因是该语料库首先被划分为80%/20%的训练/测试集,然后将训练部分的5%留作开发集)。此划分在[3]中被使用,并且鼓励针对Mac-Morpho进行新的词性标注研究,以便进行一致的比较。
[1] Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V. 2003. An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR 2003
[2] Fonseca, E.R., Rosa, J.L.G. 2013. Mac-morpho revisited: Towards robust part-of-speech. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology – STIL
[3] Fonseca, E.R., Aluísio, Sandra Maria, Rosa, J.L.G. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society.
[还需要更多信息]
葡萄牙语
Mac-Morpho数据集的一个示例如下:
{ "id": "0", "pos_tags": [14, 19, 14, 15, 22, 7, 14, 9, 14, 9, 3, 15, 3, 3, 24], "tokens": ["Jersei", "atinge", "média", "de", "Cr$", "1,4", "milhão", "na", "venda", "da", "Pinhal", "em", "São", "Paulo", "."] }
词性标记对应以下列表:
"PREP+PROADJ", "IN", "PREP+PRO-KS", "NPROP", "PREP+PROSUB", "KC", "PROPESS", "NUM", "PROADJ", "PREP+ART", "KS", "PRO-KS", "ADJ", "ADV-KS", "N", "PREP", "PROSUB", "PREP+PROPESS", "PDEN", "V", "PREP+ADV", "PCP", "CUR", "ADV", "PU", "ART"
数据集分为训练集、验证集和测试集。划分的大小如下:
Train | Val | Test |
---|---|---|
37948 | 1997 | 9987 |
[还需要更多信息]
[还需要更多信息]
谁是源语言的生产者?[还需要更多信息]
[还需要更多信息]
谁是注释者?[还需要更多信息]
[还需要更多信息]
[还需要更多信息]
[还需要更多信息]
[还需要更多信息]
[还需要更多信息]
[还需要更多信息]
@article{fonseca2015evaluating, title={Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese}, author={Fonseca, Erick R and Rosa, Jo{\~a}o Lu{\'\i}s G and Alu{\'\i}sio, Sandra Maria}, journal={Journal of the Brazilian Computer Society}, volume={21}, number={1}, pages={2}, year={2015}, publisher={Springer} }
感谢 @jonatasgrosman 添加了该数据集。