数据集:
big_patent
任务:
摘要生成语言:
en计算机处理:
monolingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1906.03741许可:
cc-by-4.0BIGPATENT 数据集包含了130万份美国专利文件记录,以及人工编写的摘要。每个美国专利申请都归类于一个合作专利分类(CPC)代码。共有九个分类类别:
当前默认的是2.1.2版本(修复大小写的原始字符串)和“all” CPC代码:
from datasets import load_dataset ds = load_dataset("big_patent") # default is 'all' CPC codes ds = load_dataset("big_patent", "all") # the same as above ds = load_dataset("big_patent", "a") # only 'a' CPC codes ds = load_dataset("big_patent", codes=["a", "b"])
要使用1.0.0版本(小写分词词语),请同时传入参数代码和版本:
ds = load_dataset("big_patent", codes="all", version="1.0.0") ds = load_dataset("big_patent", codes="a", version="1.0.0") ds = load_dataset("big_patent", codes=["a", "b"], version="1.0.0")
[需要更多信息]
英语
每个实例包含一对描述和摘要。描述是从专利的描述部分提取的,而摘要是从摘要部分提取的。
{ 'description': 'FIELD OF THE INVENTION \n [0001] This invention relates to novel calcium phosphate-coated implantable medical devices and processes of making same. The unique calcium-phosphate coated implantable medical devices minimize...', 'abstract': 'This invention relates to novel calcium phosphate-coated implantable medical devices...' }
train | validation | test | |
---|---|---|---|
all | 1207222 | 67068 | 67072 |
a | 174134 | 9674 | 9675 |
b | 161520 | 8973 | 8974 |
c | 101042 | 5613 | 5614 |
d | 10164 | 565 | 565 |
e | 34443 | 1914 | 1914 |
f | 85568 | 4754 | 4754 |
g | 258935 | 14385 | 14386 |
h | 257019 | 14279 | 14279 |
y | 124397 | 6911 | 6911 |
[需要更多信息]
[需要更多信息]
谁是源语言生成者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@article{DBLP:journals/corr/abs-1906-03741, author = {Eva Sharma and Chen Li and Lu Wang}, title = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent Summarization}, journal = {CoRR}, volume = {abs/1906.03741}, year = {2019}, url = {http://arxiv.org/abs/1906.03741}, eprinttype = {arXiv}, eprint = {1906.03741}, timestamp = {Wed, 26 Jun 2019 07:14:58 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
感谢 @mattbui 添加了该数据集。