数据集:
csebuetnlp/xlsum
计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1607.01759许可:
cc-by-nc-sa-4.0我们介绍了XLSum数据集,它由来自BBC的135万对经过专业注释的文章摘要对组成,这些对是通过一组精心设计的启发式方法提取的。该数据集涵盖45种语言,从资源较低到较高,其中许多语言目前没有公开数据集。 XL-Sum是高度抽象,简洁且质量高的,如人工和内在评估所示。
下面是来自英语数据集的一个示例,以JSON格式给出。
{ "id": "technology-17657859", "url": "https://www.bbc.com/news/technology-17657859", "title": "Yahoo files e-book advert system patent applications", "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.", "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\"" }
对于所有语言,我们使用了80%-10%-10%的拆分,有一些例外情况。 英语 的拆分为 93%-3.5%-3.5%,以类似于 CNN/DM 和 XSum 的评估集大小,而 苏格兰盖尔语,吉尔吉斯语和僧伽罗语 的样本相对较少,它们的评估集增加到500个样本,以进行更可靠的评估。为了防止多语言训练中的数据泄漏,在两个汉语和塞尔维亚语的变体中使用了相同的文章进行评估。下面给出了各自数据集的下载链接和训练-开发-测试示例计数:
Language | ISO 639-1 Code | BBC subdomain(s) | Train | Dev | Test | Total |
---|---|---|---|---|---|---|
Amharic | am | 1233321 | 5761 | 719 | 719 | 7199 |
Arabic | ar | 1234321 | 37519 | 4689 | 4689 | 46897 |
Azerbaijani | az | 1235321 | 6478 | 809 | 809 | 8096 |
Bengali | bn | 1236321 | 8102 | 1012 | 1012 | 10126 |
Burmese | my | 1237321 | 4569 | 570 | 570 | 5709 |
Chinese (Simplified) | zh-CN | 1238321 , 1239321 | 37362 | 4670 | 4670 | 46702 |
Chinese (Traditional) | zh-TW | 12310321 , 12311321 | 37373 | 4670 | 4670 | 46713 |
English | en | 12312321 , 12313321 * | 306522 | 11535 | 11535 | 329592 |
French | fr | 12314321 | 8697 | 1086 | 1086 | 10869 |
Gujarati | gu | 12315321 | 9119 | 1139 | 1139 | 11397 |
Hausa | ha | 12316321 | 6418 | 802 | 802 | 8022 |
Hindi | hi | 12317321 | 70778 | 8847 | 8847 | 88472 |
Igbo | ig | 12318321 | 4183 | 522 | 522 | 5227 |
Indonesian | id | 12319321 | 38242 | 4780 | 4780 | 47802 |
Japanese | ja | 12320321 | 7113 | 889 | 889 | 8891 |
Kirundi | rn | 12321321 | 5746 | 718 | 718 | 7182 |
Korean | ko | 12322321 | 4407 | 550 | 550 | 5507 |
Kyrgyz | ky | 12323321 | 2266 | 500 | 500 | 3266 |
Marathi | mr | 12324321 | 10903 | 1362 | 1362 | 13627 |
Nepali | np | 12325321 | 5808 | 725 | 725 | 7258 |
Oromo | om | 12326321 | 6063 | 757 | 757 | 7577 |
Pashto | ps | 12327321 | 14353 | 1794 | 1794 | 17941 |
Persian | fa | 12328321 | 47251 | 5906 | 5906 | 59063 |
Pidgin ** | n/a | 12329321 | 9208 | 1151 | 1151 | 11510 |
Portuguese | pt | 12330321 | 57402 | 7175 | 7175 | 71752 |
Punjabi | pa | 12331321 | 8215 | 1026 | 1026 | 10267 |
Russian | ru | 12332321 , 12333321 * | 62243 | 7780 | 7780 | 77803 |
Scottish Gaelic | gd | 12334321 | 1313 | 500 | 500 | 2313 |
Serbian (Cyrillic) | sr | 12335321 | 7275 | 909 | 909 | 9093 |
Serbian (Latin) | sr | 12336321 | 7276 | 909 | 909 | 9094 |
Sinhala | si | 12313321 | 3249 | 500 | 500 | 4249 |
Somali | so | 12338321 | 5962 | 745 | 745 | 7452 |
Spanish | es | 12339321 | 38110 | 4763 | 4763 | 47636 |
Swahili | sw | 12340321 | 7898 | 987 | 987 | 9872 |
Tamil | ta | 12341321 | 16222 | 2027 | 2027 | 20276 |
Telugu | te | 12342321 | 10421 | 1302 | 1302 | 13025 |
Thai | th | 12343321 | 6616 | 826 | 826 | 8268 |
Tigrinya | ti | 12344321 | 5451 | 681 | 681 | 6813 |
Turkish | tr | 12345321 | 27176 | 3397 | 3397 | 33970 |
Ukrainian | uk | 12333321 | 43201 | 5399 | 5399 | 53999 |
Urdu | ur | 12347321 | 67665 | 8458 | 8458 | 84581 |
Uzbek | uz | 12348321 | 4728 | 590 | 590 | 5908 |
Vietnamese | vi | 12349321 | 32111 | 4013 | 4013 | 40137 |
Welsh | cy | 12350321 | 9732 | 1216 | 1216 | 12164 |
Yoruba | yo | 12351321 | 6350 | 793 | 793 | 7936 |
* BBC Sinhala和BBC Ukrainian的许多文章是用英语和俄语写的。它们通过 Fasttext 进行了检测并进行了移动。
** 西非派金英语
本存储库的内容仅限于非商业研究目的,受 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 约束。数据集内容的版权属于原始版权持有人。
如果您使用任何数据集、模型或代码模块,请引用以下论文:
@inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", }
感谢 @abhik1505040 和 @Tahmid 添加了这个数据集。