数据集:

mlsum

任务:

摘要生成

翻译

文本分类

子任务:

news-articles-summarization multi-class-classification multi-label-classification

语言:

计算机处理:

multilingual

大小:

100K<n<1M 10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

extended|cnn_dailymail original

许可:

other

数据集介绍文件清单

英文

MLSUM数据集卡片

数据集摘要

我们介绍了MLSUM数据集，这是第一个大规模的多语言摘要数据集。它从在线报纸中获取，包含五种不同语言（即法语、德语、西班牙语、俄语、土耳其语）的1.5+万篇文章/摘要对。与来自CNN/Daily mail的英语报纸一起，收集的数据形成了一个大规模的多语言数据集，为文本摘要领域的新研究方向提供了支持。我们报告了基于最先进系统的跨语言比较分析。这些分析突出了现有偏见，推动了多语言数据集的使用。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

下载的数据集文件大小：346.58 MB
生成的数据集大小：940.93 MB
使用的总磁盘空间量：1.29 GB

“验证集”的示例如下。

{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}

下载的数据集文件大小：513.31 MB
生成的数据集大小：1.34 GB
使用的总磁盘空间量：1.85 GB

“验证集”的示例如下。

{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}

下载的数据集文件大小：619.99 MB
生成的数据集大小：1.61 GB
使用的总磁盘空间量：2.23 GB

“验证集”的示例如下。

{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}

下载的数据集文件大小：106.22 MB
生成的数据集大小：276.17 MB
使用的总磁盘空间量：382.39 MB

“训练集”的示例如下。

{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}

下载的数据集文件大小：247.50 MB
生成的数据集大小：694.99 MB
使用的总磁盘空间量：942.48 MB

“训练集”的示例如下。

{
    "date": "01/01/2001",
    "summary": "A text",
    "text": "This is a text",
    "title": "A sample",
    "topic": "football",
    "url": "https://www.google.com"
}

数据字段

所有拆分中的数据字段都是相同的。

text：字符串特征。
summary：字符串特征。
topic：字符串特征。
url：字符串特征。
title：字符串特征。
date：字符串特征。

text：字符串特征。
summary：字符串特征。
topic：字符串特征。
url：字符串特征。
title：字符串特征。
date：字符串特征。

text：字符串特征。
summary：字符串特征。
topic：字符串特征。
url：字符串特征。
title：字符串特征。
date：字符串特征。

text：字符串特征。
summary：字符串特征。
topic：字符串特征。
url：字符串特征。
title：字符串特征。
date：字符串特征。

text：字符串特征。
summary：字符串特征。
topic：字符串特征。
url：字符串特征。
title：字符串特征。
date：字符串特征。

数据拆分

name	train	validation	test
de	220887	11394	10701
es	266367	10358	13920
fr	392902	16059	15828
ru	25556	750	757
tu	249277	11565	12775

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

资源语言的生产者是谁？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

数据集的使用仅限于非商业研究目的。版权归原始版权持有人所有。请参阅 https://github.com/recitalAI/MLSUM#mlsum

引用信息

@article{scialom2020mlsum,
  title={MLSUM: The Multilingual Summarization Corpus},
  author={Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  journal={arXiv preprint arXiv:2004.14900},
  year={2020}
}

贡献

感谢 @RachelKer 、 @albertvillanova 、 @thomwolf 添加了这个数据集。

作者:

佚名

数据集大小:

27.97 KB

MLSUM数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏见讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献