数据集:

persian_ner

语言:

fa

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-4.0
英文

[波斯NER] 的数据集卡片

数据集摘要

该数据集包括7,682个波斯语句子,分为250,015个标记和它们的NER标签。它可分成3个折叠,轮流用作训练集和测试集。NER标签采用IOB格式。

支持的任务和领先排行榜

[需要更多信息]

语言

[需要更多信息]

数据集结构

数据实例

数据字段

  • id:示例的id
  • tokens:示例文本的标记
  • ner_tags:每个标记的NER标签

NER标签对应于以下列表:

"O", "I-event", "I-fac", "I-loc", "I-org", "I-pers", "I-pro", "B-event", "B-fac", "B-loc", "B-org", "B-pers", "B-pro"

数据拆分

训练和测试拆分

数据集创建

策展理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

谁是源语言制作人?

Hanieh Poostchi,Ehsan Zare Borzeshi,Mohammad Abdous,Massimo Piccardi

注释

注释过程

[需要更多信息]

谁是注释者?

Hanieh Poostchi,Ehsan Zare Borzeshi,Mohammad Abdous,Massimo Piccardi

个人和敏感信息

[需要更多信息]

使用数据的考虑事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

该数据集仅用于学术用途

数据集策展人

[需要更多信息]

许可信息

Creative Commons Attribution 4.0 International License.

引用信息

@inproceedings{poostchi-etal-2016-personer, title = "{P}erso{NER}: {P}ersian Named-Entity Recognition", author = "Poostchi, Hanieh and Zare Borzeshi, Ehsan and Abdous, Mohammad and Piccardi, Massimo", booktitle = "Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: Technical Papers", month = dec, year = "2016", address = "Osaka, Japan", publisher = "The COLING 2016 Organizing Committee", url = " https://www.aclweb.org/anthology/C16-1319" , pages = "3381--3389", abstract = "Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network.",}

贡献

感谢 @KMFODA 添加了这个数据集。