模型:
new5558/HoogBERTa-NER-lst20
该存储库包含为命名实体识别(NER)任务微调的泰语预训练语言表示(HoogBERTa_base)。
因为我们使用subword-nmt BPE编码,所以在输入HoogBERTa之前,需要对输入进行预分词处理,使用标准的 BEST 进行分词处理
pip install attacut
要从hub初始化模型,请使用以下命令
from transformers import RobertaTokenizerFast, RobertaForTokenClassification from attacut import tokenize import torch tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa-NER-lst20") model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa-NER-lst20")
要进行NER标记,请使用以下命令
from transformers import pipeline nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none") sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ" all_sent = [] sentences = sentence.split(" ") for sent in sentences: all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]")) sentence = " _ ".join(all_sent) print(nlp(sentence))
对于批量处理,请使用以下命令
from transformers import pipeline nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none") sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"] inputList = [] for sentX in sentenceL: sentences = sentX.split(" ") all_sent = [] for sent in sentences: all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]")) sentence = " _ ".join(all_sent) inputList.append(sentence) print(nlp(inputList))
请引用为:
@inproceedings{porkaew2021hoogberta, title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation}, author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi}, booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)}, year = {2021}, address={Online} }