数据集:

ai4bharat/IndicWikiBio

语言:

计算机处理:

multilingual

大小:

size_categories:1960<n<11,502

语言创建人:

found

批注创建人:

no-annotation

源数据集:

none. Originally generated from www.wikimedia.org. none.+Originally+generated+from+www.wikimedia.org.

预印本库:

arxiv:2203.05437

许可:

cc-by-nc-4.0

数据集介绍文件清单

英文

"IndicWikiBio"数据集的数据卡片

数据集概要

"IndicNLG Suite"发布的WikiBio数据集。每个示例有四个字段：id、infobox、序列化的infobox和summary。我们在包括as、bn、hi、kn、ml、or、pa、ta、te在内的九种语言中创建了这个数据集。数据集的总大小为57,426条。

支持的任务和榜单

任务：WikiBio

榜单：目前没有此数据集的榜单。

语言

阿萨姆语 (as)
孟加拉语 (bn)
卡纳达语 (kn)
印地语 (hi)
马拉雅拉姆语 (ml)
奥里亚语 (or)
旁遮普语 (pa)
泰米尔语 (ta)
泰卢固语 (te)

数据集结构

数据实例

下面是来自hi数据集中一个随机示例的JSON格式。

{
"id": 26, 
"infobox": "name_1:सी॰\tname_2:एल॰\tname_3:रुआला\toffice_1:सांसद\toffice_2:-\toffice_3:मिजोरम\toffice_4:लोक\toffice_5:सभा\toffice_6:निर्वाचन\toffice_7:क्षेत्र\toffice_8:।\toffice_9:मिजोरम\tterm_1:2014\tterm_2:से\tterm_3:2019\tnationality_1:भारतीय", 
"serialized_infobox": "<TAG> name </TAG> सी॰ एल॰ रुआला <TAG> office </TAG> सांसद - मिजोरम लोक सभा निर्वाचन क्षेत्र । मिजोरम <TAG> term </TAG> 2014 से 2019 <TAG> nationality </TAG> भारतीय", 
"summary": "सी॰ एल॰ रुआला भारत की सोलहवीं लोक सभा के सांसद हैं।"
}

数据字段

id (string) : 唯一标识符。
infobox (string) : 原始Infobox。
serialized_infobox (string) : 序列化的输入Infobox。
summary (string) : Infobox的总结/维基百科页面的第一行。

数据分割

以下是所有语言中每个分割中的样本数。

语言 | ISO 639-1 Code | 训练 | 测试 | 验证 | ---------- | ---------- | ---------- | ---------- | ---------- | 阿萨姆语 | as | 1,300 | 391 | 381 | 孟加拉语 | bn | 4,615 | 1,521 | 1,567 | 印地语 | hi | 5,684 | 1,919 | 1,853 | 卡纳达语 | kn | 1,188 | 389 | 383 | 马拉雅拉姆语 | ml | 5,620 | 1,835 | 1,896 | 奥里亚语 | or | 1,687 | 558 | 515 | 旁遮普语 | pa | 3,796 | 1,227 | 1,331 | 泰米尔语 | ta | 8,169 | 2,701 | 2,632 | 泰卢固语 | te | 2,594 | 854 | 820 |

数据集创建

策划理由

Detailed in the paper

来源数据

None

Initial Data Collection and Normalization

Detailed in the paper

谁是源语言的生产者？

Detailed in the paper

注释

[需要更多信息]

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知局限

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

引用信息

如果您使用任何数据集、模型或代码模块，请引用以下论文：

@inproceedings{Kumar2022IndicNLGSM,
  title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
  author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
  year={2022},
  url = "https://arxiv.org/abs/2203.05437",

贡献

Detailed in the paper

作者:

ai4bharat

数据集大小:

42.81 MB