模型:
savasy/bert-turkish-text-classification
此模型是使用文本分类数据对 https://github.com/stefan-it/turkish-bert 进行微调,其中包含7个类别,如下所示
code_to_label={ 'LABEL_0': 'dunya ', 'LABEL_1': 'ekonomi ', 'LABEL_2': 'kultur ', 'LABEL_3': 'saglik ', 'LABEL_4': 'siyaset ', 'LABEL_5': 'spor ', 'LABEL_6': 'teknoloji '}
用于微调的是以下土耳其基准数据集
https://www.kaggle.com/savasy/ttc4900
开始之前,请按照以下操作安装transformers
pip install transformers
# Code: # import libraries from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification tokenizer= AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification") # build and load model, it take time depending on your internet connection model= AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification") # make pipeline nlp=pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) # apply model nlp("bla bla") # [{'label': 'LABEL_2', 'score': 0.4753005802631378}] code_to_label={ 'LABEL_0': 'dunya ', 'LABEL_1': 'ekonomi ', 'LABEL_2': 'kultur ', 'LABEL_3': 'saglik ', 'LABEL_4': 'siyaset ', 'LABEL_5': 'spor ', 'LABEL_6': 'teknoloji '} code_to_label[nlp("bla bla")[0]['label']] # > 'kultur '
## loading data for Turkish text classification import pandas as pd # https://www.kaggle.com/savasy/ttc4900 df=pd.read_csv("7allV03.csv") df.columns=["labels","text"] df.labels=pd.Categorical(df.labels) traind_df=... eval_df=... # model from simpletransformers.classification import ClassificationModel import torch,sklearn model_args = { "use_early_stopping": True, "early_stopping_delta": 0.01, "early_stopping_metric": "mcc", "early_stopping_metric_minimize": False, "early_stopping_patience": 5, "evaluate_during_training_steps": 1000, "fp16": False, "num_train_epochs":3 } model = ClassificationModel( "bert", "dbmdz/bert-base-turkish-cased", use_cuda=cuda_available, args=model_args, num_labels=7 ) model.train_model(train_df, acc=sklearn.metrics.accuracy_score)
要查看其他训练模型,请检查 https://simpletransformers.ai/
有关土耳其文本分类的详细用法,请参阅 python notebook