数据集:

ai4bharat/Aksharantar

任务:

语言:

计算机处理:

multilingual

语言创建人:

crowdsourced expert-generated machine-generated

源数据集:

original

预印本库:

arxiv:2205.03018

许可:

数据集介绍文件清单

英文

Aksharantar的数据集卡片

数据集概述

Aksharantar是包含20种印度语言的最大公开可用的音译数据集。该语料库包含2600万个印度语言-英语音译对。

支持的任务和排行榜

[需要更多信息]

语言

Assamese (asm)	Hindi (hin)	Maithili (mai)	Marathi (mar)	Punjabi (pan)	Tamil (tam)
Bengali (ben)	Kannada (kan)	Malayalam (mal)	Nepali (nep)	Sanskrit (san)	Telugu (tel)
Bodo(brx)	Kashmiri (kas)	Manipuri (mni)	Oriya (ori)	Sindhi (snd)	Urdu (urd)
Gujarati (guj)	Konkani (kok)

数据集结构

数据实例

A random sample from Hindi (hin) Train dataset.

{
'unique_identifier': 'hin1241393', 
'native word': 'स्वाभिमानिक', 
'english word': 'swabhimanik', 
'source': 'IndicCorp', 
'score': -0.1028788579
}

数据字段

唯一标识符 (string): 每个集合(训练、测试、验证)中的3个字母语言代码后跟一个唯一数字。
原词 (string): 印度语言中的一个词。
英语词 (string): 原词的英语音译（罗马化词）。
来源 (string): 数据的来源。
得分 (num): IndicXlit (模型)根据罗马化词给出印度单词的字符级别对数概率。平均阈值为0.35的配对被认为是有效的。

对于已创建的数据源，根据配对在语言中的目标/采样方法，可以是以下之一:
- Dakshina数据集
- IndicCorp
- Samanantar
- Wikidata
- 现有来源
- 命名实体印度 (AK-NEI)
- 命名实体外国 (AK-NEF)
- 从均匀采样方法中的数据 (Ak-Uni)
- 从最常见单词采样方法中的数据 (Ak-Freq)

数据拆分

Subset	asm-en	ben-en	brx-en	guj-en	hin-en	kan-en	kas-en	kok-en	mai-en	mal-en	mni-en	mar-en	nep-en	ori-en	pan-en	san-en	sid-en	tam-en	tel-en	urd-en
Training	179K	1231K	36K	1143K	1299K	2907K	47K	613K	283K	4101K	10K	1453K	2397K	346K	515K	1813K	60K	3231K	2430K	699K
Validation	4K	11K	3K	12K	6K	7K	4K	4K	4K	8K	3K	8K	3K	3K	9K	3K	8K	9K	8K	12K
Test	5531	5009	4136	7768	5693	6396	7707	5093	5512	6911	4925	6573	4133	4256	4316	5334	-	4682	4567	4463

数据集创建

论文中提供了信息。 Aksharantar: Towards building open transliteration tools for the next billion users

策展理由

[需要更多信息]

源数据

初始数据收集和标准化

论文中提供了信息。 Aksharantar: Towards building open transliteration tools for the next billion users

谁是源语言生成者?

[需要更多信息]

注释

论文中提供了信息。 Aksharantar: Towards building open transliteration tools for the next billion users

注释过程

论文中提供了信息。 Aksharantar: Towards building open transliteration tools for the next billion users

谁是注释者?

论文中提供了信息。 Aksharantar: Towards building open transliteration tools for the next billion users

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策展人

[需要更多信息]

许可信息

这些数据根据以下许可方案发布:

手动收集的数据: 根据CC-BY许可发布。
开采的数据集（来自Samanantar和IndicCorp）: 根据CC0许可发布。
现有来源: 根据CC0许可发布。

CC-BY许可证

CC0许可声明

我们不拥有此数据所提取文本中的任何内容。
我们许可开采数据的实际包装方式根据 Creative Commons CC0 license (“no rights reserved”) 。
在法律允许的范围内， AI4Bharat 已放弃对Aksharantar手动收集的数据和现有来源的所有版权和相关或相邻权利。
这项工作的发布地: 印度。

引用信息

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, 
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

贡献

作者:

ai4bharat

数据集大小:

695.36 MB