数据集:

csebuetnlp/xlsum

任务:

摘要生成

文本生成

语言:

计算机处理:

multilingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:1607.01759

其他:

conditional-text-generation

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

英文

"XL-Sum" 数据集卡片

数据集摘要

我们介绍了XLSum数据集，它由来自BBC的135万对经过专业注释的文章摘要对组成，这些对是通过一组精心设计的启发式方法提取的。该数据集涵盖45种语言，从资源较低到较高，其中许多语言目前没有公开数据集。 XL-Sum是高度抽象，简洁且质量高的，如人工和内在评估所示。

支持的任务和排行榜

More information needed

语言

阿姆哈拉语
阿拉伯语
阿塞拜疆语
孟加拉语
缅甸语
简体中文
繁体中文
英语
法语
古吉拉特语
豪萨语
印地语
伊博语
印尼语
日语
基隆迪语
韩语
吉尔吉斯语
马拉地语
尼泊尔语
奥罗莫语
普什图语
波斯语
皮钦语
葡萄牙语
旁遮普语
俄语
苏格兰盖尔语
塞尔维亚语西里尔字母
塞尔维亚语拉丁字母
僧伽罗语
索马里语
西班牙语
斯瓦希里语
泰米尔语
泰卢固语
泰语
提格里尼亚语
土耳其语
乌克兰语
乌尔都语
乌兹别克语
越南语
威尔士语
约鲁巴语

数据集结构

数据实例

下面是来自英语数据集的一个示例，以JSON格式给出。

{
  "id": "technology-17657859",
  "url": "https://www.bbc.com/news/technology-17657859",
  "title": "Yahoo files e-book advert system patent applications",
  "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.",
  "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""
}

数据字段

'id': 表示文章ID的字符串。
'url': 表示文章URL的字符串。
'title': 包含文章标题的字符串。
'summary': 包含文章摘要的字符串。
'text' : 包含文章内容的字符串。

数据拆分

对于所有语言，我们使用了80％-10％-10％的拆分，有一些例外情况。英语的拆分为 93％-3.5％-3.5％，以类似于 CNN/DM 和 XSum 的评估集大小，而苏格兰盖尔语，吉尔吉斯语和僧伽罗语的样本相对较少，它们的评估集增加到500个样本，以进行更可靠的评估。为了防止多语言训练中的数据泄漏，在两个汉语和塞尔维亚语的变体中使用了相同的文章进行评估。下面给出了各自数据集的下载链接和训练-开发-测试示例计数：

Language	ISO 639-1 Code	BBC subdomain(s)	Train	Dev	Test	Total
Amharic	am	1233321	5761	719	719	7199
Arabic	ar	1234321	37519	4689	4689	46897
Azerbaijani	az	1235321	6478	809	809	8096
Bengali	bn	1236321	8102	1012	1012	10126
Burmese	my	1237321	4569	570	570	5709
Chinese (Simplified)	zh-CN	1238321 , 1239321	37362	4670	4670	46702
Chinese (Traditional)	zh-TW	12310321 , 12311321	37373	4670	4670	46713
English	en	12312321 , 12313321 *	306522	11535	11535	329592
French	fr	12314321	8697	1086	1086	10869
Gujarati	gu	12315321	9119	1139	1139	11397
Hausa	ha	12316321	6418	802	802	8022
Hindi	hi	12317321	70778	8847	8847	88472
Igbo	ig	12318321	4183	522	522	5227
Indonesian	id	12319321	38242	4780	4780	47802
Japanese	ja	12320321	7113	889	889	8891
Kirundi	rn	12321321	5746	718	718	7182
Korean	ko	12322321	4407	550	550	5507
Kyrgyz	ky	12323321	2266	500	500	3266
Marathi	mr	12324321	10903	1362	1362	13627
Nepali	np	12325321	5808	725	725	7258
Oromo	om	12326321	6063	757	757	7577
Pashto	ps	12327321	14353	1794	1794	17941
Persian	fa	12328321	47251	5906	5906	59063
Pidgin **	n/a	12329321	9208	1151	1151	11510
Portuguese	pt	12330321	57402	7175	7175	71752
Punjabi	pa	12331321	8215	1026	1026	10267
Russian	ru	12332321 , 12333321 *	62243	7780	7780	77803
Scottish Gaelic	gd	12334321	1313	500	500	2313
Serbian (Cyrillic)	sr	12335321	7275	909	909	9093
Serbian (Latin)	sr	12336321	7276	909	909	9094
Sinhala	si	12313321	3249	500	500	4249
Somali	so	12338321	5962	745	745	7452
Spanish	es	12339321	38110	4763	4763	47636
Swahili	sw	12340321	7898	987	987	9872
Tamil	ta	12341321	16222	2027	2027	20276
Telugu	te	12342321	10421	1302	1302	13025
Thai	th	12343321	6616	826	826	8268
Tigrinya	ti	12344321	5451	681	681	6813
Turkish	tr	12345321	27176	3397	3397	33970
Ukrainian	uk	12333321	43201	5399	5399	53999
Urdu	ur	12347321	67665	8458	8458	84581
Uzbek	uz	12348321	4728	590	590	5908
Vietnamese	vi	12349321	32111	4013	4013	40137
Welsh	cy	12350321	9732	1216	1216	12164
Yoruba	yo	12351321	6350	793	793	7936

* BBC Sinhala和BBC Ukrainian的许多文章是用英语和俄语写的。它们通过 Fasttext 进行了检测并进行了移动。

** 西非派金英语

数据集创建

策划理由

More information needed

源数据

BBC News

初始数据收集和标准化

Detailed in the paper

源语言制片人是谁？

Detailed in the paper

注释

Detailed in the paper

注释流程

Detailed in the paper

注释人员是谁？

Detailed in the paper

个人和敏感信息

More information needed

使用数据的注意事项

其他信息

数据集策划者

More information needed

许可信息

本存储库的内容仅限于非商业研究目的，受 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 约束。数据集内容的版权属于原始版权持有人。

引用信息

如果您使用任何数据集、模型或代码模块，请引用以下论文：

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}

贡献

感谢 @abhik1505040 和 @Tahmid 添加了这个数据集。

作者:

csebuetnlp

数据集大小:

1.28 GB

"XL-Sum" 数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

其他信息

数据集策划者

许可信息

引用信息

贡献