数据集:

masakhane/masakhaner2

任务:

标记分类

子任务:

named-entity-recognition

语言:

language:bbj

计算机处理:

multilingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2103.11811 arxiv:2210.12391

其他:

ner masakhaner masakhane

许可:

afl-3.0

数据集介绍文件清单

中文

Dataset Card for [Dataset Name]

Dataset Summary

MasakhaNER 2.0 is the largest publicly available high-quality dataset for named entity recognition (NER) in 20 African languages created by the Masakhane community.

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[PER Wolff] , currently a journalist in [LOC Argentina] , played with [PER Del Bosque] in the final years of the seventies in [ORG Real Madrid] .

MasakhaNER 2.0 is a named entity dataset consisting of PER, ORG, LOC, and DATE entities annotated by Masakhane for 20 African languages

The train/validation/test sets are available for all the 20 languages.

For more details see https://arxiv.org/abs/2210.12391

Supported Tasks and Leaderboards

[More Information Needed]

named-entity-recognition : The performance in this task is measured with F1 (higher is better). A named entity is correct only if it is an exact match of the corresponding entity in the data.

Languages

There are 20 languages available :

Bambara (bam)
Ghomala (bbj)
Ewe (ewe)
Fon (fon)
Hausa (hau)
Igbo (ibo)
Kinyarwanda (kin)
Luganda (lug)
Dholuo (luo)
Mossi (mos)
Chichewa (nya)
Nigerian Pidgin
chShona (sna)
Kiswahili (swą)
Setswana (tsn)
Twi (twi)
Wolof (wol)
isiXhosa (xho)
Yorùbá (yor)
isiZulu (zul)

Dataset Structure

Data Instances

The examples look like this for Yorùbá:

from datasets import load_dataset
data = load_dataset('masakhane/masakhaner2', 'yor') 

# Please, specify the language code

# A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. 
{'id': '0',
 'ner_tags': [B-DATE, I-DATE, 0, 0, 0, 0, 0, B-PER, I-PER, I-PER, O, O, O, O],
 'tokens': ['Wákàtí', 'méje', 'ti', 'ré', 'kọjá', 'lọ', 'tí', 'Luis', 'Carlos', 'Díaz', 'ti', 'di', 'awati', '.']
}

Data Fields

id : id of the sample
tokens : the tokens of the example text
ner_tags : the NER tags of each token

The NER tags correspond to this list:

"O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-DATE", "I-DATE",

In the NER tags, a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and dates & time (DATE).

It is assumed that named entities are non-recursive and non-overlapping. In case a named entity is embedded in another named entity usually, only the top level entity is marked.

Data Splits

For all languages, there are three splits.

The original splits were named train , dev and test and they correspond to the train , validation and test splits.

The splits have the following sizes :

Language	train	validation	test
Bambara	4463	638	1274
Ghomala	3384	483	966
Ewe	3505	501	1001
Fon.	4343	621	1240
Hausa	5716	816	1633
Igbo	7634	1090	2181
Kinyarwanda	7825	1118	2235
Luganda	4942	706	1412
Luo	5161	737	1474
Mossi	4532	648	1613
Nigerian-Pidgin	5646	806	1294
Chichewa	6250	893	1785
chiShona	6207	887	1773
Kiswahili	6593	942	1883
Setswana	3289	499	996
Akan/Twi	4240	605	1211
Wolof	4593	656	1312
isiXhosa	5718	817	1633
Yoruba	6877	983	1964
isiZulu	5848	836	1670

Dataset Creation

Curation Rationale

The dataset was introduced to introduce new resources to 20 languages that were under-served for natural language processing.

[More Information Needed]

Source Data

The source of the data is from the news domain, details can be found here https://arxiv.org/abs/2210.12391

Initial Data Collection and Normalization

The articles were word-tokenized, information on the exact pre-processing pipeline is unavailable.

Who are the source language producers?

The source language was produced by journalists and writers employed by the news agency and newspaper mentioned above.

Annotations

Annotation process

Details can be found here https://arxiv.org/abs/2103.11811

Who are the annotators?

Annotators were recruited from Masakhane

Personal and Sensitive Information

The data is sourced from newspaper source and only contains mentions of public figures or individuals

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.

Additional Information

Dataset Curators

Licensing Information

The licensing status of the data is CC 4.0 Non-Commercial

Citation Information

Provide the BibTex -formatted reference for the dataset. For example:

@article{Adelani2022MasakhaNER2A,
  title={MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition},
  author={David Ifeoluwa Adelani and Graham Neubig and Sebastian Ruder and Shruti Rijhwani and Michael Beukman and Chester Palen-Michel and Constantine Lignos and Jesujoba Oluwadara Alabi and Shamsuddeen Hassan Muhammad and Peter Nabende and Cheikh M. Bamba Dione and Andiswa Bukula and Rooweither Mabuya and Bonaventure F. P. Dossou and Blessing K. Sibanda and Happy Buzaaba and Jonathan Mukiibi and Godson Kalipe and Derguene Mbaye and Amelia Taylor and Fatoumata Kabore and Chris C. Emezue and Anuoluwapo Aremu and Perez Ogayo and Catherine W. Gitau and Edwin Munkoh-Buabeng and Victoire Memdjokam Koagne and Allahsera Auguste Tapo and Tebogo Macucwa and Vukosi Marivate and Elvis Mboning and Tajuddeen R. Gwadabe and Tosin P. Adewumi and Orevaoghene Ahia and Joyce Nakatumba-Nabende and Neo L. Mokono and Ignatius M Ezeani and Chiamaka Ijeoma Chukwuneke and Mofetoluwa Adeyemi and Gilles Hacheme and Idris Abdulmumin and Odunayo Ogundepo and Oreen Yousuf and Tatiana Moteu Ngoli and Dietrich Klakow},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.12391}
}

Contributions

Thanks to @dadelani for adding this dataset.

作者:

masakhane

数据集大小:

19.11 KB