I-数据集摘要

english_historical_quotes是一个包含许多历史名言的数据集。该数据集可用于多标签文本分类和文本生成。每条名言的内容均为英文。

II-支持的任务和排行榜

Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of classifying quotes by author as well as by topic (using tags). Success on this task is typically measured by achieving a high or low accuracy.
Text-generation : The dataset can be used to train a model to generate quotes by fine-tuning an existing pretrained model on the corpus composed of all quotes (or quotes by author).

III-语言

数据集中的文本为英文（en）。

IV-数据集结构

数据实例

数据集中典型实例的JSON格式示例：

{"quote":"几乎任何人都可以成为作家，业务是从这种状态中收集金钱和名声。","author":"A. A. Milne","categories": "['business', 'money']"}

数据字段

author : The author of the quote.
quote : The text of the quote.
tags: The tags could be characterized as topics around the quote.

数据拆分

数据集是一个块，可以使用Hugging Face datasets 中的``.train_test_split()方法进行进一步处理。

V-数据集创建

策划理由

目标是与HuggingFace社区分享优质数据集，以便他们可以在自然语言处理任务中使用并推进人工智能。

数据来源

数据从多个开放访问的互联网档案中汇总而来。然后经过手动细化，移除了重复和虚假的名言。

它是我网站 dixit.app 的基础，该网站可通过语义搜索查找历史名言。

VI-其他信息

数据集策划者

Aymeric Roucher

许可证信息该作品使用 MIT 许可证。

作者:

A-Roucher

数据集大小:

4.24 MB