模型:

allenai/specter2

任务:

特征提取

类库:

PyTorch Transformers

数据集:

allenai/scirepeval 3Aallenai/scirepeval

语言:

其他:

bert

许可:

apache-2.0

模型介绍文件清单

英文

SPECTER 2.0 (Base)

SPECTER 2.0 是 SPECTER 的继任者，当与 adapters 配对时，能够为科学任务生成针对特定任务的嵌入表示。通过结合科学论文或简短文本查询的标题和摘要，该模型可用于生成用于下游应用的有效嵌入表示。

注意：对于一般的嵌入表示目的，请使用 allenai/specter2_proximity 。

为了在下游任务类型上获得最佳性能，请按照下面的示例加载与基础模型关联的适配器。

模型详细信息

模型描述

SPECTER 2.0 在超过600万篇科学论文引文三元组上进行了训练，这些三元组可以在 here 上获得。之后，它还在所有 SciRepEval 的训练任务上附加了与任务格式相关的适配器模块进行训练。

训练的任务格式如下：

分类
回归
接近性
临时搜索

它在 SciRepEval: A Multi-Format Benchmark for Scientific Document Representations 中做了进一步的工作，并在该基准测试上评估训练后的模型。

开发者：Amanpreet Singh、Mike D'Arcy、Arman Cohan、Doug Downey、Sergey Feldman
共享者：Allen AI
模型类型：bert-base-uncased + adapters
许可证：Apache 2.0
从模型调优： allenai/scibert

模型来源

仓库： https://github.com/allenai/SPECTER2_0
论文： https://api.semanticscholar.org/CorpusID:254018137
演示： Usage

用途

直接使用

Model	Name and HF link	Description
Retrieval*	12314321	Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search
Adhoc Query	12315321	Encode short raw text queries for search tasks. (Candidate papers can be encoded with proximity)
Classification	12316321	Encode papers to feed into linear classifiers as features
Regression	12317321	Encode papers to feed into linear regressors as features

*检索模型应足够满足未提及的下游任务类型

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('allenai/specter2')

#load base model
model = AutoModel.from_pretrained('allenai/specter2')

#load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it
model.load_adapter("allenai/specter2_proximity", source="hf", load_as="proximity", set_active=True)
#other possibilities: allenai/specter2_<classification|regression|adhoc_query>

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract
text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = self.tokenizer(text_batch, padding=True, truncation=True,
                                   return_tensors="pt", return_token_type_ids=False, max_length=512)
output = model(**inputs)
# take the first token in the batch as the embedding
embeddings = output.last_hidden_state[:, 0, :]

下游使用

有关评估和下游使用，请参阅 https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md 。

训练详细信息

训练数据

基础模型是根据论文之间的引文链接进行训练的，而适配器则是在四种格式的8个大规模任务上进行训练的。所有数据都是 SciRepEval 基准测试的一部分，可以在 here 上获得。

引文链接是形如

{"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}

的三元组，包括查询论文、正引文和负引文，可以来自与查询论文或引文的研究领域相同或不同的领域。

训练过程

请参阅 SPECTER paper 。

训练超参数

该模型使用 SciRepEval 进行两个阶段的训练：

基础模型：首先在上述引文三元组上训练基础模型。

批大小 = 1024，最大输入长度 = 512，学习率 = 2e-5，epochs = 2，预热步骤 = 10%，fp16

适配器：然后，在 SciRepEval 训练任务上训练任务格式特定的适配器，在此过程中，从上述样本中再选取600K个三元组添加到训练数据中。

批大小 = 256，最大输入长度 = 512，学习率 = 1e-4，epochs = 6，预热 = 1000步，fp16

评估

我们在 SciRepEval 上评估该模型，这是一个用于科学嵌入任务的大规模评估基准测试，它是 [SciDocs] 的子集。我们还在 MDCR 上评估并确立了新的 SoTA（顶尖水平）。

Model	SciRepEval In-Train	SciRepEval Out-of-Train	SciRepEval Avg	MDCR(MAP, Recall@5)
12324321	n/a	n/a	n/a	(33.7, 28.5)
12325321	54.7	57.4	68.0	(30.6, 25.5)
12326321	55.6	57.8	69.0	(32.6, 27.3)
12327321	61.9	59.0	70.9	(35.3, 29.6)
12328321	62.3	59.2	71.2	(38.4, 33.0)

如果您使用 SPECTER 2.0，请引用以下作品：

SPECTER paper ：

@inproceedings{specter2020cohan,
  title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}},
  author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
  booktitle={ACL},
  year={2020}
}

SciRepEval paper

@article{Singh2022SciRepEvalAM,
  title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
  author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.13308}
}

作者:

Allen Institute for AI

数据集大小:

420.28 MB