模型:
allenai/specter2
SPECTER 2.0 是 SPECTER 的继任者,当与 adapters 配对时,能够为科学任务生成针对特定任务的嵌入表示。通过结合科学论文或简短文本查询的标题和摘要,该模型可用于生成用于下游应用的有效嵌入表示。
注意:对于一般的嵌入表示目的,请使用 allenai/specter2_proximity 。
为了在下游任务类型上获得最佳性能,请按照下面的示例加载与基础模型关联的适配器。
SPECTER 2.0 在超过600万篇科学论文引文三元组上进行了训练,这些三元组可以在 here 上获得。之后,它还在所有 SciRepEval 的训练任务上附加了与任务格式相关的适配器模块进行训练。
训练的任务格式如下:
它在 SciRepEval: A Multi-Format Benchmark for Scientific Document Representations 中做了进一步的工作,并在该基准测试上评估训练后的模型。
Model | Name and HF link | Description |
---|---|---|
Retrieval* | 12314321 | Encode papers as queries and candidates eg. Link Prediction, Nearest Neighbor Search |
Adhoc Query | 12315321 | Encode short raw text queries for search tasks. (Candidate papers can be encoded with proximity) |
Classification | 12316321 | Encode papers to feed into linear classifiers as features |
Regression | 12317321 | Encode papers to feed into linear regressors as features |
*检索模型应足够满足未提及的下游任务类型
from transformers import AutoTokenizer, AutoModel # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('allenai/specter2') #load base model model = AutoModel.from_pretrained('allenai/specter2') #load the adapter(s) as per the required task, provide an identifier for the adapter in load_as argument and activate it model.load_adapter("allenai/specter2_proximity", source="hf", load_as="proximity", set_active=True) #other possibilities: allenai/specter2_<classification|regression|adhoc_query> papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'}, {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}] # concatenate title and abstract text_batch = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers] # preprocess the input inputs = self.tokenizer(text_batch, padding=True, truncation=True, return_tensors="pt", return_token_type_ids=False, max_length=512) output = model(**inputs) # take the first token in the batch as the embedding embeddings = output.last_hidden_state[:, 0, :]
有关评估和下游使用,请参阅 https://github.com/allenai/scirepeval/blob/main/evaluation/INFERENCE.md 。
基础模型是根据论文之间的引文链接进行训练的,而适配器则是在四种格式的8个大规模任务上进行训练的。所有数据都是 SciRepEval 基准测试的一部分,可以在 here 上获得。
引文链接是形如
{"query": {"title": ..., "abstract": ...}, "pos": {"title": ..., "abstract": ...}, "neg": {"title": ..., "abstract": ...}}
的三元组,包括查询论文、正引文和负引文,可以来自与查询论文或引文的研究领域相同或不同的领域。
请参阅 SPECTER paper 。
该模型使用 SciRepEval 进行两个阶段的训练:
批大小 = 1024,最大输入长度 = 512,学习率 = 2e-5,epochs = 2,预热步骤 = 10%,fp16
批大小 = 256,最大输入长度 = 512,学习率 = 1e-4,epochs = 6,预热 = 1000步,fp16
我们在 SciRepEval 上评估该模型,这是一个用于科学嵌入任务的大规模评估基准测试,它是 [SciDocs] 的子集。我们还在 MDCR 上评估并确立了新的 SoTA(顶尖水平)。
Model | SciRepEval In-Train | SciRepEval Out-of-Train | SciRepEval Avg | MDCR(MAP, Recall@5) |
---|---|---|---|---|
12324321 | n/a | n/a | n/a | (33.7, 28.5) |
12325321 | 54.7 | 57.4 | 68.0 | (30.6, 25.5) |
12326321 | 55.6 | 57.8 | 69.0 | (32.6, 27.3) |
12327321 | 61.9 | 59.0 | 70.9 | (35.3, 29.6) |
12328321 | 62.3 | 59.2 | 71.2 | (38.4, 33.0) |
如果您使用 SPECTER 2.0,请引用以下作品:
@inproceedings{specter2020cohan, title={{SPECTER: Document-level Representation Learning using Citation-informed Transformers}}, author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld}, booktitle={ACL}, year={2020} }
@article{Singh2022SciRepEvalAM, title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations}, author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman}, journal={ArXiv}, year={2022}, volume={abs/2211.13308} }