数据集:

ai4bharat/samanantar

任务:

文本生成

翻译

语言:

计算机处理:

translation

大小:

size_categories:unknown

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2104.05596

其他:

conditional-text-generation

许可:

cc-by-nc-4.0

数据集介绍文件清单

英文

数据集卡片：Samanantar

数据集简介

Samanantar是印度语言（阿萨姆语、孟加拉语、古吉拉特语、印地语、卡纳达语、马拉雅拉姆语、马拉地语、奥里亚语、旁遮普语、泰米尔语和泰卢固语）最大的公开可用平行语料库集合。

该语料库包含英语到印度语言之间的4960万个句子对。

支持的任务和排行榜

[需要更多信息]

语言

Samanantar包含英语（en）和11种印度语言之间的平行句子：

阿萨姆语（as）
孟加拉语（bn）
古吉拉特语（gu）
印地语（hi）
卡纳达语（kn）
马拉雅拉姆语（ml）
马拉地语（mr）
奥里亚语（or）
旁遮普语（pa）
泰米尔语（ta）
泰卢固语（te）

数据集结构

数据实例

{
  'idx': 0,
  'src': 'Prime Minister Narendra Modi met Her Majesty Queen Maxima of the Kingdom of the Netherlands today.',
  'tgt': 'নতুন দিল্লিতে সোমবার প্রধানমন্ত্রী শ্রী নরেন্দ্র মোদীর সঙ্গে নেদারন্যান্ডসের মহারানী ম্যাক্সিমা সাক্ষাৎ করেন।',
  'data_source': 'pmi'
}

数据字段

idx（int）：ID。
src（string）：源语言句子（英语）。
tgt（string）：目标语言句子（11种印度语言之一）。
data_source（string）：数据来源。对于创建的数据源，根据目标语言，可能是以下之一：
- anuvaad_catchnews
- anuvaad_DD_National
- anuvaad_DD_sports
- anuvaad_drivespark
- anuvaad_dw
- anuvaad_financialexpress
- anuvaad-general_corpus
- anuvaad_goodreturns
- anuvaad_indianexpress
- anuvaad_mykhel
- anuvaad_nativeplanet
- anuvaad_newsonair
- anuvaad_nouns_dictionary
- anuvaad_ocr
- anuvaad_oneindia
- anuvaad_pib
- anuvaad_pib_archives
- anuvaad_prothomalo
- anuvaad_timesofindia
- asianetnews
- betterindia
- bridge
- business_standard
- catchnews
- coursera
- dd_national
- dd_sports
- dwnews
- drivespark
- fin_express
- goodreturns
- gu_govt
- jagran-business
- jagran-education
- jagran-sports
- ie_business
- ie_education
- ie_entertainment
- ie_general
- ie_lifestyle
- ie_news
- ie_sports
- ie_tech
- indiccorp
- jagran-entertainment
- jagran-lifestyle
- jagran-news
- jagran-tech
- khan_academy
- Kurzgesagt
- marketfeed
- mykhel
- nativeplanet
- nptel
- ocr
- oneindia
- pa_govt
- pmi
- pranabmukherjee
- sakshi
- sentinel
- thewire
- toi
- tribune
- vsauce
- wikipedia
- zeebiz

数据拆分

[需要更多信息]

数据集创建

策划理由

[需要更多信息]

原始数据

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用该数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

Creative Commons Attribution-NonCommercial 4.0 International .

引文信息

@misc{ramesh2021samanantar,
      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2021},
      eprint={2104.05596},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

贡献者

感谢 @albertvillanova 添加此数据集。

作者:

ai4bharat

数据集大小:

10.79 KB