数据集:
masakhane/masakhanews
任务:
文本分类子任务:
topic-classification计算机处理:
multilingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
afl-3.0MasakhaNEWS 是非洲广泛使用的16种语言中最大的公开可用的新闻主题分类数据集。
全部16种语言的训练/验证/测试集均可获得。
[需要更多信息]
提供16种语言:
尤鲁巴语的示例如下所示:
from datasets import load_dataset data = load_dataset('masakhane/masakhanews', 'yor') # Please, specify the language code # A data point example is below: { 'label': 0, 'headline': "'The barriers to entry have gone - go for it now'", 'text': "j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 'headline_text': "'The barriers to entry have gone - go for it now' j Lalvani, CEO of Vitabiotics and former Dragons' Den star, shares his business advice for our CEO Secrets series.\nProduced, filmed and edited by Dougal Shaw", 'url': '/news/business-61880859' }
新闻主题对应于此列表:
"business", "entertainment", "health", "politics", "religion", "sports", "technology"
对于所有语言,有三个拆分。
原始拆分的名称为“train”、“dev”和“test”,它们分别对应“训练”、“验证”和“测试”拆分。
拆分的大小如下所示:
Language | train | validation | test |
---|---|---|---|
Amharic | 1311 | 188 | 376 |
English | 3309 | 472 | 948 |
French | 1476 | 211 | 422 |
Hausa | 2219 | 317 | 637 |
Igbo | 1356 | 194 | 390 |
Lingala | 608 | 87 | 175 |
Luganda | 771 | 110 | 223 |
Oromo | 1015 | 145 | 292 |
Nigerian-Pidgin | 1060 | 152 | 305 |
Rundi | 1117 | 159 | 322 |
chiShona | 1288 | 185 | 369 |
Somali | 1021 | 148 | 294 |
Kiswahili | 1658 | 237 | 476 |
Tigrinya | 947 | 137 | 272 |
isiXhosa | 1032 | 147 | 297 |
Yoruba | 1433 | 206 | 411 |
引入该数据集是为了为自然语言处理下的20种少资源语言提供新的资源。
[需要更多信息]
数据来源于新闻领域,详细信息可在此处找到****
初始数据收集和规范化文章进行了词级标记,但目前无法获取有关确切预处理流程的信息。
源语言制作者是谁?源语言由上述新闻机构和报纸雇佣的记者和作家创作。
可在此处找到详细信息**
谁是注释者?注释者来自 Masakhane
数据来源于报纸资源,只包含公众人物或个体的提及
[需要更多信息]
[需要更多信息]
用户应注意,数据集仅包含新闻文本,这可能限制开发系统在其他领域的适用性。
数据的许可状态为CC 4.0 Non-Commercial
提供数据集的格式化引用,例如:
@article{Adelani2023MasakhaNEWS, title={MasakhaNEWS: News Topic Classification for African languages}, author={David Ifeoluwa Adelani and Marek Masiak and Israel Abebe Azime and Jesujoba Oluwadara Alabi and Atnafu Lambebo Tonja and Christine Mwase and Odunayo Ogundepo and Bonaventure F. P. Dossou and Akintunde Oladipo and Doreen Nixdorf and Chris Chinenye Emezue and Sana Sabah al-azzawi and Blessing K. Sibanda and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Tunde Oluwaseyi Ajayi and Tatiana Moteu Ngoli and Brian Odhiambo and Abraham Toluwase Owodunni and Nnaemeka C. Obiefuna and Shamsuddeen Hassan Muhammad and Saheed Salahudeen Abdullahi and Mesay Gemeda Yigezu and Tajuddeen Gwadabe and Idris Abdulmumin and Mahlet Taye Bame and Oluwabusayo Olufunke Awoyomi and Iyanuoluwa Shode and Tolulope Anu Adelani and Habiba Abdulganiy Kailani and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Onyekachi Raphael Ogbu and Chinedu E. Mbonu and Chiamaka I. Chukwuneke and Samuel Fanijo and Jessica Ojo and Oyinkansola F. Awosan and Tadesse Kebede Guge and Sakayo Toadoum Sari and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Kanda Patrick Tshinu and Thina Diko and Siyanda Nxakama and Abdulmejid Tuni Johar and Sinodos Gebre and Muhidin Mohamed and Shafie Abdi Mohamed and Fuad Mire Hassan and Moges Ahmed Mehamed and Evrard Ngabire and and Pontus Stenetorp}, journal={ArXiv}, year={2023}, volume={} }
感谢 @dadelani 添加了此数据集。