HoogBERTa

该存储库包含为命名实体识别（NER）任务微调的泰语预训练语言表示（HoogBERTa_base）。

文档

先决条件

因为我们使用subword-nmt BPE编码，所以在输入HoogBERTa之前，需要对输入进行预分词处理，使用标准的 BEST 进行分词处理

pip install attacut

开始

要从hub初始化模型，请使用以下命令

from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch

tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa-NER-lst20")
model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa-NER-lst20")

要进行NER标记，请使用以下命令

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

sentence = " _ ".join(all_sent)

print(nlp(sentence))

对于批量处理，请使用以下命令

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
  sentences = sentX.split(" ")
  all_sent = []
  for sent in sentences:
      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

  sentence = " _ ".join(all_sent)
  inputList.append(sentence)

print(nlp(inputList))

Huggingface模型

HoogBERTaEncoder

HoogBERTa ：特征提取和掩码语言建模

HoogBERTaMuliTaskTagger ：

HoogBERTa-NER-lst20 ：基于LST20的命名实体识别（NER）
HoogBERTa-POS-lst20 ：基于LST20的词性标注（POS）
HoogBERTa-SENTENCE-lst20 ：基于LST20的从句边界分类

引用

请引用为：

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}

下载全文 PDF 查看代码 Github

作者:

norapat buppodom

数据集大小:

548.85 MB