数据集:
GEM/xlsum
任务:
摘要生成语言:
language:und计算机处理:
unknown语言创建人:
unknown批注创建人:
none源数据集:
original预印本库:
arxiv:1607.01759许可:
cc-by-nc-sa-4.0你可以在 GEM Website 找到主要数据卡片。
XLSum 是一个支持 44 种语言的高度多语言摘要数据集。数据来源于 BBC 新闻文章。
你可以通过以下方式加载数据集:
import datasets data = datasets.load_dataset('GEM/xlsum')
数据加载器可以在 here 找到。
网站 论文@inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", }联系人姓名
Tahmid Hasan
联系人电子邮件tahmidhasan@cse.buet.ac.bd
是否有排行榜?是
排行榜链接 排行榜详情排行榜根据生成摘要的 ROUGE 得分(R1/R2/RL)对模型进行排名。
是
涵盖的语言阿姆哈拉语,阿拉伯语,阿塞拜疆语,孟加拉语,缅甸语,中文(简体),中文(繁体),英语,法语,古吉拉特语,豪萨语,印地语,伊博语,印度尼西亚语,日语,基隆迪语,韩语,吉尔吉斯语,马拉地语,尼泊尔语,奥罗莫语,普什图语,波斯语,加纳皮钦英语,葡萄牙语,旁遮普语,俄语,苏格兰盖尔语,塞尔维亚语,罗曼罗塞语,僧伽罗语,索马里语,西班牙语,斯瓦希里语,泰米尔语,特鲁古语,泰卢固语,泰语,提格雷尼亚语,土耳其语,乌克兰语,乌尔都语乌兹别克语,越南语,威尔士语,约鲁巴语
许可证cc-by-nc-sa-4.0:知识共享署名-非商业性使用-相同方式共享4.0国际
预期用途抽象汇总主要集中在英语中,因为大多数大型的抽象汇总数据集只有英语。尽管最近一些努力进行了多语言抽象汇总数据集的筹集,但它们在涵盖的语言数量、训练样本数量或两者方面都有限。为此,XL-Sum 提供了一个大规模的抽象汇总数据集,其中包含来自英国广播公司网站的135万篇新闻文章,支持45种语言。它旨在用于多语言和每种语言的汇总任务。
主要任务汇总
交流目标在45种语言中对新闻类文本进行汇总。
学术机构
整理机构孟加拉国工程和技术大学
谁将数据集添加到GEM中?Tahmid Hasan(孟加拉国工程和技术大学),Abhik Bhattacharjee(孟加拉国工程和技术大学)
{ "gem_id": "GEM-xlsum_english-train-1589", "url": "[BBC news](https://www.bbc.com/news)/technology-17657859", "title": "Yahoo files e-book advert system patent applications", "summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.", "text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\"" }数据拆分
数据集中的拆分由语言名称指定,具体如下:
我们对所有语言使用了80% - 10% - 10%的拆分,但有些例外情况。对于评估集大小,英语的拆分为93% - 3.5% - 3.5%,以使其与 CNN/DM 和 XSum 的评估集大小类似;由于苏格兰盖尔语、吉尔吉斯语和僧伽罗语的样本相对较少,它们的评估集分别增加到500个样本,以获取更可靠的评估结果。中文和塞尔维亚的两个变体使用相同的文章进行评估,以防止多语言训练中的数据泄漏。下面是包含训练集、开发集、测试集示例计数的各个数据集下载链接:
Language | ISO 639-1 Code | BBC subdomain(s) | Train | Dev | Test | Total |
---|---|---|---|---|---|---|
Amharic | am | 12311321 | 5761 | 719 | 719 | 7199 |
Arabic | ar | 12312321 | 37519 | 4689 | 4689 | 46897 |
Azerbaijani | az | 12313321 | 6478 | 809 | 809 | 8096 |
Bengali | bn | 12314321 | 8102 | 1012 | 1012 | 10126 |
Burmese | my | 12315321 | 4569 | 570 | 570 | 5709 |
Chinese (Simplified) | zh-CN | 12316321 /simp, 12317321 /simp | 37362 | 4670 | 4670 | 46702 |
Chinese (Traditional) | zh-TW | 12316321 /trad, 12317321 /trad | 37373 | 4670 | 4670 | 46713 |
English | en | 12320321 , 12321321 * | 306522 | 11535 | 11535 | 329592 |
French | fr | 12322321 | 8697 | 1086 | 1086 | 10869 |
Gujarati | gu | 12323321 | 9119 | 1139 | 1139 | 11397 |
Hausa | ha | 12324321 | 6418 | 802 | 802 | 8022 |
Hindi | hi | 12325321 | 70778 | 8847 | 8847 | 88472 |
Igbo | ig | 12326321 | 4183 | 522 | 522 | 5227 |
Indonesian | id | 12327321 | 38242 | 4780 | 4780 | 47802 |
Japanese | ja | 12328321 | 7113 | 889 | 889 | 8891 |
Kirundi | rn | 12329321 | 5746 | 718 | 718 | 7182 |
Korean | ko | 12330321 | 4407 | 550 | 550 | 5507 |
Kyrgyz | ky | 12331321 | 2266 | 500 | 500 | 3266 |
Marathi | mr | 12332321 | 10903 | 1362 | 1362 | 13627 |
Nepali | np | 12333321 | 5808 | 725 | 725 | 7258 |
Oromo | om | 12334321 | 6063 | 757 | 757 | 7577 |
Pashto | ps | 12335321 | 14353 | 1794 | 1794 | 17941 |
Persian | fa | 12336321 | 47251 | 5906 | 5906 | 59063 |
Pidgin ** | pcm | 12337321 | 9208 | 1151 | 1151 | 11510 |
Portuguese | pt | 12338321 | 57402 | 7175 | 7175 | 71752 |
Punjabi | pa | 12339321 | 8215 | 1026 | 1026 | 10267 |
Russian | ru | 12340321 , 12341321 * | 62243 | 7780 | 7780 | 77803 |
Scottish Gaelic | gd | 12342321 | 1313 | 500 | 500 | 2313 |
Serbian (Cyrillic) | sr | 12343321 /cyr | 7275 | 909 | 909 | 9093 |
Serbian (Latin) | sr | 12343321 /lat | 7276 | 909 | 909 | 9094 |
Sinhala | si | 12321321 | 3249 | 500 | 500 | 4249 |
Somali | so | 12346321 | 5962 | 745 | 745 | 7452 |
Spanish | es | 12347321 | 38110 | 4763 | 4763 | 47636 |
Swahili | sw | 12348321 | 7898 | 987 | 987 | 9872 |
Tamil | ta | 12349321 | 16222 | 2027 | 2027 | 20276 |
Telugu | te | 12350321 | 10421 | 1302 | 1302 | 13025 |
Thai | th | 12351321 | 6616 | 826 | 826 | 8268 |
Tigrinya | ti | 12352321 | 5451 | 681 | 681 | 6813 |
Turkish | tr | 12353321 | 27176 | 3397 | 3397 | 33970 |
Ukrainian | uk | 12341321 | 43201 | 5399 | 5399 | 53999 |
Urdu | ur | 12355321 | 67665 | 8458 | 8458 | 84581 |
Uzbek | uz | 12356321 | 4728 | 590 | 590 | 5908 |
Vietnamese | vi | 12357321 | 32111 | 4013 | 4013 | 40137 |
Welsh | cy | 12358321 | 9732 | 1216 | 1216 | 12164 |
Yoruba | yo | 12359321 | 6350 | 793 | 793 | 7936 |
*在BBC藏文和BBC乌克兰文的许多文章都是用英语和俄语编写的。我们使用了 Fasttext 对其进行了识别并进行了移动。**西非平英语
传统的抽象文本摘要集中在英语和其他资源丰富的语言上。 XL-Sum 提供了一个包含高质量文章-摘要对的大型集合,涵盖了45种语言,这些语言从资源丰富到极低资源都有。这使得研究界能够探索不同模型对多种语言和单独语言的摘要能力。我们认为将 XL-Sum 添加到 GEM 中使抽象文本摘要领域更加多样化和包容,以服务于研究界。我们希望我们在这项工作中的努力能够鼓励社区超越英语,特别是对于低资源和中资源语言,为这些传统上服务不足的语言社区带来技术进步。
类似数据集是
唯一语言覆盖范围是
与其他 GEM 数据集的区别摘要非常简洁和抽象。
数据集衡量能力简洁性,抽象性和总体的摘要能力。
否
是否有其他拆分?否
简洁性,抽象性,总体摘要能力。
指标ROUGE
建议的评估ROUGE 是用于文本摘要的事实上的评估指标。然而,它是针对评估英文文本而设计的。由于指标本质上依赖于文本的分词/词干提取/不必要字符的删除等,得分在很大程度上取决于这些因素。对原始的 ROUGE 评估进行了一些修改,例如仅删除标点符号,语言特定的分词/词干提取,以实现在不同语言的源摘要和目标摘要之间可靠的比较。
以前的结果是否可用?否
最先进的文本摘要模型是以数据为驱动的,即需要大量的文章-摘要对才能有效地训练。因此,抽象汇总主要集中在英语中,因为大多数大规模的抽象汇总数据集只提供英语。尽管最近一些努力对多语言抽象汇总数据集进行整理,但在涵盖的语言数量、训练样本数量或两者方面受到限制。为此,我们整理了 XL-Sum,这是一个大规模的抽象汇总数据集,包含来自英国广播公司网站的135万篇新闻文章。
交流目标在以英语为中心的抽象文本汇总领域引入新语言,并支持多语言和每种语言的汇总。
来自不同来源的数据是
来源详细信息英国广播公司(BBC)新闻网站。
找到
在哪里找到的?多个网站
语言制作者语言内容由 BBC 雇佣的专业新闻编辑撰写。
涵盖的主题新闻
数据验证未验证
数据预处理我们对所有文本实例使用了 'NFKC' 归一化。
数据是否经过筛选?通过算法筛选
筛选条件我们设计了一个爬虫,通过访问每个页面中的不同文章链接,从主页开始递归访问页面。我们能够利用 BBC 所有网站的相似结构,并且能够从所有网站抓取文章。我们在进一步处理之前丢弃了没有文本内容(主要是由多媒体内容组成的页面,但缺少文本内容)的页面。我们针对爬取页面的 HTML 结构设计了一些启发式算法,通过仔细检查爬取页面的 HTML 结构,使抽取变得有效:
没有
注释服务?没有
是
同意政策详情BBC 的政策指定其网站上的文本内容仅可用于非商业研究。
可能
PII 的类别通用 PII
是否进行了 PII 识别?无识别
没有
没有
是
数据集如何满足需求该数据集引入了许多之前没有被整理过的语言的摘要语料库。
没有
语言制作者是否代表该语言?是
仅供研究使用,仅限非商业使用
语言数据的版权限制仅供研究使用,仅限非商业使用。
人类评估显示,大多数语言中有高比例的良好摘要,几乎没有摘要包含任何冲突信息,而大约三分之一的摘要包含的信息无法直接从源文章推断出来。由于通常有多篇关于重要事件的文章,训练集和评估集之间可能存在重叠。
不适合的应用数据集仅限于新闻领域。因此,不建议使用在该数据集上训练的模型来摘要其他领域(如文学、科学文本)的文本。模型生成的摘要中可能存在幻觉的另一个问题。
不建议的用例ROUGE 根据最多具有4-gram重叠的摘要的整体质量进行评估。因此,在一篇关于印度的文章中,如果由于模型的幻觉使生成的摘要中的单词“印度”被替换为“巴基斯坦”,整体得分不会显著降低,但整个含义可能会改变。