模型:
entropy/roberta_zinc_480m
这是一个基于大约480m个SMILES字符串的Roberta风格掩码语言模型,来自 ZINC database 。该模型具有大约102m个参数,并且使用4096批次大小进行了150000次迭代训练,达到了约0.122的验证损失。该模型可用于生成SMILES字符串的嵌入。
from transformers import RobertaTokenizerFast, RobertaForMaskedLM, DataCollatorWithPadding tokenizer = RobertaTokenizerFast.from_pretrained("entropy/roberta_zinc_480m", max_len=128) model = RobertaForMaskedLM.from_pretrained('entropy/roberta_zinc_480m') collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt') smiles = ['Brc1cc2c(NCc3ccccc3)ncnc2s1', 'Brc1cc2c(NCc3ccccn3)ncnc2s1', 'Brc1cc2c(NCc3cccs3)ncnc2s1', 'Brc1cc2c(NCc3ccncc3)ncnc2s1', 'Brc1cc2c(Nc3ccccc3)ncnc2s1'] inputs = collator(tokenizer(smiles)) outputs = model(**inputs, output_hidden_states=True) full_embeddings = outputs[1][-1] mask = inputs['attention_mask'] embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))