模型:

classla/xlm-roberta-base-multilingual-text-genre-classifier

英文

X-GENRE分类器 - 多语种文本类型分类器

基于 xlm-roberta-base 训练的文本分类模型,并在斯洛文尼亚GINCO 1数据集、英语CORE 2数据集和英语FTD 3数据集的组合上进行了微调。该模型可用于自动识别任何语言的文本类型,使用了xlm-roberta-base作为支持。

模型描述

该模型在“X-GENRE”数据集上进行了微调,该数据集包含CORE、FTD和GINCO数据集。每个数据集都有自己的类型模式,因此它们根据标签的比较和跨数据集实验进行了合并,形成了一个统一的类型模式(“X-GENRE”模式),详细描述在 here 中。

微调超参数

使用simpletransformers进行了微调。之前对超参数进行了简要的优化,推测的最佳超参数为:

model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            }        
      

使用意图和限制

使用方法

可以在 here 中找到准备用于类型识别和后处理结果的数据的示例,我们将X-GENRE分类器应用于 MaCoCu 的英文部分平行语料库。

为了获得可靠的结果,应将类型分类器应用于足够长的文档(经验法则是至少75个单词)。建议不使用置信度低于0.9的预测结果。此外,“Other”标签也可作为预测结果置信度低的另一个指标,因为它经常表明文本没有足够的特征属于任何类型,这些预测结果也可以被丢弃。

在提出的后处理(删除置信度低的预测结果、"Other"标签以及在特定情况下还有"Forum"标签)后,根据手动检查,对MaCoCu数据的性能达到了0.92的宏和微F1值。

使用示例

from simpletransformers.classification import ClassificationModel
model_args= {
            "num_train_epochs": 15,
            "learning_rate": 1e-5,
            "max_seq_length": 512,
            "silent": True
            }
model = ClassificationModel(
    "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True,
    args=model_args
    
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.", 
                                        "On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
                                        )
predictions
# Output: array([3, 8])

[model.config.id2label[i] for i in predictions]
# Output: ['Instruction', 'Promotion']

可以通过 Google Collab 以批处理方式的数据集预测的使用示例。

X-GENRE 类别

标签列表:

labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],

labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction': 3, 'Opinion/Argumentation': 4, 'Forum': 5, 'Prose/Lyrical': 6, 'Legal': 7, 'Promotion': 8}

标签描述:

Label Description Examples
Information/Explanation An objective text that describes or presents an event, a person, a thing, a concept etc. Its main purpose is to inform the reader about something. Common features: objective/factual, explanation/definition of a concept (x is …), enumeration. research article, encyclopedia article, informational blog, product specification, course materials, general information, job description, manual, horoscope, travel guide, glossaries, historical article, biographical story/history.
Instruction An objective text which instructs the readers on how to do something. Common features: multiple steps/actions, chronological order, 1st person plural or 2nd person, modality (must, have to, need to, can, etc.), adverbial clauses of manner (in a way that), of condition (if), of time (after …). how-to texts, recipes, technical support
Legal An objective formal text that contains legal terms and is clearly structured. The name of the text type is often included in the headline (contract, rules, amendment, general terms and conditions, etc.). Common features: objective/factual, legal terms, 3rd person. small print, software license, proclamation, terms and conditions, contracts, law, copyright notices, university regulation
News An objective or subjective text which reports on an event recent at the time of writing or coming in the near future. Common features: adverbs/adverbial clauses of time and/or place (dates, places), many proper nouns, direct or reported speech, past tense. news report, sports report, travel blog, reportage, police report, announcement
Opinion/Argumentation A subjective text in which the authors convey their opinion or narrate their experience. It includes promotion of an ideology and other non-commercial causes. This genre includes a subjective narration of a personal experience as well. Common features: adjectives/adverbs that convey opinion, words that convey (un)certainty (certainly, surely), 1st person, exclamation marks. review, blog (personal blog, travel blog), editorial, advice, letter to editor, persuasive article or essay, formal speech, pamphlet, political propaganda, columns, political manifesto
Promotion A subjective text intended to sell or promote an event, product, or service. It addresses the readers, often trying to convince them to participate in something or buy something. Common features: contains adjectives/adverbs that promote something (high-quality, perfect, amazing), comparative and superlative forms of adjectives and adverbs (the best, the greatest, the cheapest), addressing the reader (usage of 2nd person), exclamation marks. advertisement, promotion of a product (e-shops), promotion of an accommodation, promotion of company's services, invitation to an event
Forum A text in which people discuss a certain topic in form of comments. Common features: multiple authors, informal language, subjective (the writers express their opinions), written in 1st person. discussion forum, reader/viewer responses, QA forum
Prose/Lyrical A literary text that consists of paragraphs or verses. A literary text is deemed to have no other practical purpose than to give pleasure to the reader. Often the author pays attention to the aesthetic appearance of the text. It can be considered as art. lyrics, poem, prayer, joke, novel, short story
Other A text that which does not fall under any of other genre categories.

性能

与其他模型在数据集内和数据集间实验的比较

X-GENRE模型与单独在每个类型数据集上进行微调的xlm-roberta-base分类器使用了X-GENRE模式进行了比较(参见 https://github.com/TajaKuzman/Genre-Datasets-Comparison 中的实验)。

在数据集内实验(在相同数据集的拆分上进行训练和测试)中,它表现优于所有数据集,除了FTD数据集,该数据集具有较少的X-GENRE标签。

Trained on Micro F1 Macro F1
FTD 0.843 0.851
X-GENRE 0.797 0.794
CORE 0.778 0.627
GINCO 0.754 0.75

在每个数据集的测试拆分上应用时,分类器表现良好:

Trained on Tested on Micro F1 Macro F1
X-GENRE CORE 0.837 0.859
X-GENRE FTD 0.804 0.809
X-GENRE X-GENRE 0.797 0.794
X-GENRE X-GENRE-dev 0.784 0.784
X-GENRE GINCO 0.749 0.758

对于2个附加类型数据集(映射到X-GENRE模式),将分类器与其他分类器进行了比较:

  • EN-GINCO:英语enTenTen20语料库的样本
  • FinCORE :芬兰CORE语料库
Trained on Tested on Micro F1 Macro F1
X-GENRE EN-GINCO 0.688 0.691
X-GENRE FinCORE 0.674 0.581
GINCO EN-GINCO 0.632 0.502
FTD EN-GINCO 0.574 0.475
CORE EN-GINCO 0.485 0.422

在数据集和语言间的实验中表明,通过在这三个数据集上进行训练的X-GENRE分类器优于仅在一个数据集上进行训练的分类器。

引用

如果您使用了该模型,请引用解释了微调实验的GitHub存储库:

 @misc{Kuzman2022,
  author = {Kuzman, Taja},
  title = {{Comparison of genre datasets: CORE, GINCO and FTD}},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TajaKuzman/Genre-Datasets-Comparison}}
}

以及下面基于的原始模型的论文:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

引用用于微调的数据集:

CORE数据集:

@article{egbert2015developing,
  title={Developing a bottom-up, user-based method of web register classification},
  author={Egbert, Jesse and Biber, Douglas and Davies, Mark},
  journal={Journal of the Association for Information Science and Technology},
  volume={66},
  number={9},
  pages={1817--1831},
  year={2015},
  publisher={Wiley Online Library}
}

GINCO数据集:

@InProceedings{kuzman-rupnik-ljubei:2022:LREC,
  author    = {Kuzman, Taja  and  Rupnik, Peter  and  Ljube{\v{s}}i{\'c}, Nikola},
  title     = {{The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild}},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1584--1594},
  url       = {https://aclanthology.org/2022.lrec-1.170}
}

FTD数据集:

@article{sharoff2018functional,
  title={Functional text dimensions for the annotation of web corpora},
  author={Sharoff, Serge},
  journal={Corpora},
  volume={13},
  number={1},
  pages={65--95},
  year={2018},
  publisher={Edinburgh University Press The Tun-Holyrood Road, 12 (2f) Jackson's Entry~…}
}

数据集可在以下位置获得:

  • http://hdl.handle.net/11356/1467 (GINCO)
  • https://github.com/TurkuNLP/CORE-corpus (CORE)
  • https://github.com/ssharoff/genre-keras (FTD)