该存储库包含泰语预训练语言表示(HoogBERTa_base),可用于特征提取和掩码语言建模任务。
由于我们使用subword-nmt BPE编码,输入需要在输入HoogBERTa之前进行预分词处理,使用 BEST 标准。
pip install attacut
要从hub初始化模型,请使用以下命令
from transformers import AutoTokenizer, AutoModel from attacut import tokenize import torch tokenizer = AutoTokenizer.from_pretrained("new5558/HoogBERTa") model = AutoModel.from_pretrained("new5558/HoogBERTa")
要基于RoBERTa架构提取标记特征,请使用以下命令
model.eval() sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ" all_sent = [] sentences = sentence.split(" ") for sent in sentences: all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]")) sentence = " _ ".join(all_sent) tokenized_text = tokenizer(sentence, return_tensors = 'pt') token_ids = tokenized_text['input_ids'] with torch.no_grad(): features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
对于批处理,请使用
model.eval() sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"] inputList = [] for sentX in sentenceL: sentences = sentX.split(" ") all_sent = [] for sent in sentences: all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]")) sentence = " _ ".join(all_sent) inputList.append(sentence) tokenized_text = tokenizer(inputList, padding = True, return_tensors = 'pt') token_ids = tokenized_text['input_ids'] with torch.no_grad(): features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
要将HoogBERTa用作嵌入层,请使用
with torch.no_grad(): features = model(token_ids, output_hidden_states = True).hidden_states[-1] # where token_ids is a tensor with type "long".
请引用如下:
@inproceedings{porkaew2021hoogberta, title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation}, author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi}, booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)}, year = {2021}, address={Online} }