数据集:

ai4bharat/IndicSentenceSummarization

语言:

计算机处理:

multilingual

大小:

size_categories:5K<n<112K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original for Hindi, and modified [IndicGLUE](https original+for+Hindi,+and+modified+[IndicGLUE](https

预印本库:

arxiv:2203.05437

许可:

cc-by-nc-4.0

数据集介绍文件清单

英文

"IndicSentenceSummarization" 数据集卡片

数据集摘要

"IndicSentenceSummarization" 是作为 IndicNLG Suite 的一部分发布的句子摘要数据集。每个输入句子都与一个作为摘要的输出相关联。我们使用11种语言创建了这个数据集，包括as、bn、gu、hi、kn、ml、mr、or、pa、ta、te。数据集的总大小为431K。

支持的任务和排行榜

任务：句子摘要

排行榜：目前该数据集没有排行榜。

语言

阿萨姆语 (as)
孟加拉语 (bn)
古吉拉特语 (gu)
卡纳达语 (kn)
印地语 (hi)
马拉雅拉姆语 (ml)
马拉地语 (mr)
奥里亚语 (or)
旁遮普语 (pa)
泰米尔语 (ta)
泰卢固语 (te)

数据集结构

数据实例

下面是 hi 数据集中的一个随机示例，以 JSON 格式给出。

{'id': '5',
 'input': 'जम्मू एवं कश्मीर के अनंतनाग जिले में शनिवार को सुरक्षाबलों के साथ मुठभेड़ में दो आतंकवादियों को मार गिराया गया।',
 'target': 'जम्मू-कश्मीर : सुरक्षाबलों के साथ मुठभेड़ में 2 आतंकवादी ढेर',
 'url': 'https://www.indiatv.in/india/national-jammu-kashmir-two-millitant-killed-in-encounter-with-security-forces-574529'
}

数据字段

id (string) : 唯一标识符。
input (string) : 输入句子。
target (strings) : 输出摘要。
url (string) : 句子的源网址。

数据拆分

下面是各个语言所有拆分中的样本数量。

语言 | ISO 639-1 代码 | 训练集 | 验证集 | 测试集 |---------- | ---------- | ---------- | ---------- | ---------- |阿萨姆语 | as | 10,812 | 5,232 | 5,452 |孟加拉语 | bn | 17,035 | 2,355 | 2,384 |古吉拉特语 | gu | 54,788 | 8,720 | 8,460 |印地语 | hi | 78,876 | 16,935 | 16,835 |卡纳达语 | kn | 61,220 | 9,024 | 1,485 |马拉雅拉姆语 | ml | 2,855 | 1,520 | 1,580 |马拉地语 | mr | 27,066 | 3,249 | 3,309 |奥里亚语 | or | 12,065 | 1,539 | 1,440 |旁遮普语 | pa | 31,630 | 4,004 | 3,967 |泰米尔语 | ta | 23,098 | 2,874 | 2,948 |泰卢固语 | te | 7,119 | 878 | 862 |

数据集创建

策划理由

Detailed in the paper

源数据

这是 IndicHeadlineGeneration 数据集的修改子集。

初始化数据收集和规范化

Detailed in the paper

谁是源语言的生产者？

Detailed in the paper

注释

[需要更多信息]

注释过程

[需要更多信息]

谁是标注者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

本仓库内容仅限于非商业研究目的，受 Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) 的版权保护。数据集内容的版权属于原始版权持有人。

引用信息

如果您使用了任何数据集、模型或代码模块，请引用以下论文：

@inproceedings{Kumar2022IndicNLGSM,
  title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
  author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
  year={2022},
  url = "https://arxiv.org/abs/2203.05437",

贡献

Detailed in the paper

作者:

ai4bharat

数据集大小:

175.85 MB