数据集:

JulesBelveze/tldr_news

任务:

摘要生成

文生文

文本生成

子任务:

news-articles-headline-generation text-simplification language-modeling

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

other

批注创建人:

other

源数据集:

original

数据集介绍文件清单

英文

tldr_news 数据集卡片

数据集概述

tldr_news 数据集通过收集每日技术新闻通讯（可用于 here ）构建而成。然后，针对每条新闻，提取其标题和相应的内容。此外，通讯中包含不同的章节。我们为每条新闻添加了这些额外信息。

这样的数据集可用于训练模型，根据输入的文本生成标题。

支持的任务和排行榜

此数据集没有官方支持任务或排行榜。然而，它可用于以下任务：

摘要
生成标题

语言

英语

数据集结构

数据实例

一个数据点包括一个“标题”及其对应的“内容”。例如：

{
  "headline": "Cana Unveils Molecular Beverage Printer, a ‘Netflix for Drinks’ That Can Make Nearly Any Type of Beverage ",
  "content": "Cana has unveiled a drink machine that can synthesize almost any drink. The machine uses a cartridge that contains flavor compounds that can be combined to create the flavor of nearly any type of drink. It is about the size of a toaster and could potentially save people from throwing hundreds of containers away every month by allowing people to create whatever drinks they want at home. Around $30 million was spent building Cana’s proprietary hardware platform and chemistry system. Cana plans to start full production of the device and will release pricing by the end of February.",
  "category": "Science and Futuristic Technology"
}

数据字段

headline (str)：新闻的标题
content (str)：新闻内容
category (str)：通讯章节

数据拆分

all：所有现有的每日通讯（共 here 个）

数据集创建

策划原因

此数据集通过收集所有现有的每日通讯（共 here 个）而获得。

然后，处理每个通讯以提取所有不同的新闻。然后针对每条收集到的新闻，提取其标题和新闻内容。

源数据

初始数据收集和规范化

数据集是从 https://tldr.tech/newsletter 获得的。

为了清理样本并构建更适合生成标题的数据集，我们采取了几个规范化步骤：

标题最初包含估计的阅读时间（括号中）；我们从标题中删除了此信息。

有些新闻是赞助内容，因此不属于任何通讯章节。对于这种样本，我们创建了额外的类别“赞助商”。

源语言制作者是谁？

是TLDR tech背后的人（或团队）。

注释

注释过程

免责声明：该数据集是从每日通讯中生成的。作者并不打算将这些通讯用于此目的。

注释者是谁？

通讯是由TLDR tech的人编写的。

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

此数据集仅包含技术新闻。在此数据集上训练的模型可能无法推广到其他领域。

其他已知限制

[需要更多信息]

其他信息

数据集策划者

该数据集是通过收集来自此网站的通讯获得的： https://tldr.tech/newsletter

贡献

感谢 @JulesBelveze 添加此数据集。

作者:

JulesBelveze

数据集大小:

57.46 KB