模型:
savasy/bert-turkish-text-classification
此模型是使用文本分类数据对 https://github.com/stefan-it/turkish-bert 进行微调,其中包含7个类别,如下所示
code_to_label={
'LABEL_0': 'dunya ',
'LABEL_1': 'ekonomi ',
'LABEL_2': 'kultur ',
'LABEL_3': 'saglik ',
'LABEL_4': 'siyaset ',
'LABEL_5': 'spor ',
'LABEL_6': 'teknoloji '}
用于微调的是以下土耳其基准数据集
https://www.kaggle.com/savasy/ttc4900
开始之前,请按照以下操作安装transformers
pip install transformers
# Code:
# import libraries
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")
# build and load model, it take time depending on your internet connection
model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")
# make pipeline
nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
# apply model
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]
code_to_label={
'LABEL_0': 'dunya ',
'LABEL_1': 'ekonomi ',
'LABEL_2': 'kultur ',
'LABEL_3': 'saglik ',
'LABEL_4': 'siyaset ',
'LABEL_5': 'spor ',
'LABEL_6': 'teknoloji '}
code_to_label[nlp("bla bla")[0]['label']]
# > 'kultur '
## loading data for Turkish text classification
import pandas as pd
# https://www.kaggle.com/savasy/ttc4900
df=pd.read_csv("7allV03.csv")
df.columns=["labels","text"]
df.labels=pd.Categorical(df.labels)
traind_df=...
eval_df=...
# model
from simpletransformers.classification import ClassificationModel
import torch,sklearn
model_args = {
"use_early_stopping": True,
"early_stopping_delta": 0.01,
"early_stopping_metric": "mcc",
"early_stopping_metric_minimize": False,
"early_stopping_patience": 5,
"evaluate_during_training_steps": 1000,
"fp16": False,
"num_train_epochs":3
}
model = ClassificationModel(
"bert",
"dbmdz/bert-base-turkish-cased",
use_cuda=cuda_available,
args=model_args,
num_labels=7
)
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)
要查看其他训练模型,请检查 https://simpletransformers.ai/
有关土耳其文本分类的详细用法,请参阅 python notebook