模型:

new5558/HoogBERTa-NER-lst20

英文

HoogBERTa

该存储库包含为命名实体识别(NER)任务微调的泰语预训练语言表示(HoogBERTa_base)。

文档

先决条件

因为我们使用subword-nmt BPE编码,所以在输入HoogBERTa之前,需要对输入进行预分词处理,使用标准的 BEST 进行分词处理

pip install attacut

开始

要从hub初始化模型,请使用以下命令

from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch

tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa-NER-lst20")
model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa-NER-lst20")

要进行NER标记,请使用以下命令

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

sentence = " _ ".join(all_sent)

print(nlp(sentence))

对于批量处理,请使用以下命令

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
  sentences = sentX.split(" ")
  all_sent = []
  for sent in sentences:
      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

  sentence = " _ ".join(all_sent)
  inputList.append(sentence)

print(nlp(inputList))

Huggingface模型

  • HoogBERTaEncoder
  • HoogBERTaMuliTaskTagger :
  • 引用

    请引用为:

    @inproceedings{porkaew2021hoogberta,
      title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
      author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
      booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
      year = {2021},
      address={Online}
    }
    

    下载全文 PDF 查看代码 Github