vi-word-segmentation

这个模型是在vlsp 2013越南语分词数据集上微调的 NlpHUST/electra-base-vn 的版本。它在评估集上取得以下结果：

Loss: 0.0501
Precision: 0.9833
Recall: 0.9838
F1: 0.9835
Accuracy: 0.9911

模型描述

需要更多信息

适用和限制

您可以将此模型与 Transformers 管道一起用于命名实体识别。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("NlpHUST/vi-word-segmentation")
model = AutoModelForTokenClassification.from_pretrained("NlpHUST/vi-word-segmentation")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)
example = "Phát biểu tại phiên thảo luận về tình hình kinh tế xã hội của Quốc hội sáng 28/10 , Bộ trưởng Bộ LĐ-TB&XH Đào Ngọc Dung khái quát , tại phiên khai mạc kỳ họp , lãnh đạo chính phủ đã báo cáo , đề cập tương đối rõ ràng về việc thực hiện các chính sách an sinh xã hội"

ner_results = nlp(example)
example_tok = ""
for e in ner_results:
    if "##" in e["word"]:
        example_tok = example_tok + e["word"].replace("##","")
    elif e["entity"] =="I":
        example_tok = example_tok + "_" + e["word"]
    else:
        example_tok = example_tok + " " + e["word"]
print(example_tok)

Phát_biểu tại phiên thảo_luận về tình_hình kinh_tế xã_hội của Quốc_hội sáng 28 / 10 , Bộ_trưởng Bộ LĐ - TB [UNK] XH Đào_Ngọc_Dung khái_quát , tại phiên khai_mạc kỳ họp , lãnh_đạo chính_phủ đã báo_cáo , đề_cập tương_đối rõ_ràng về việc thực_hiện các chính_sách an_sinh xã_hội

训练和评估数据

需要更多信息

训练过程

训练超参数

在训练过程中使用了以下超参数：

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5.0

训练结果

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0168	1.0	4712	0.0284	0.9813	0.9825	0.9819	0.9904
0.0107	2.0	9424	0.0350	0.9789	0.9814	0.9802	0.9895
0.005	3.0	14136	0.0364	0.9826	0.9843	0.9835	0.9909
0.0033	4.0	18848	0.0434	0.9830	0.9831	0.9830	0.9908
0.0017	5.0	23560	0.0501	0.9833	0.9838	0.9835	0.9911

框架版本

Transformers 4.22.2
Pytorch 1.12.1+cu113
Datasets 2.4.0
Tokenizers 0.12.1

作者:

NLP HUST

数据集大小:

509.41 MB