模型:

MoritzLaurer/ernie-m-large-mnli-xnli

任务:

零样本分类

类库:

PyTorch Safetensors Transformers

数据集:

multi_nli xnli 3Axnli 3Amulti_nli

语言:

multilingual

其他:

ernie_m 文本分类 nli

预印本库:

arxiv:2012.15674 arxiv:1809.05053 arxiv:2111.09543 arxiv:1911.02116

许可:

apache-2.0

模型介绍文件清单

英文

多语言ernie-m-large-mnli-xnli

模型描述

这个多语言模型可以在100种语言上执行自然语言推理（NLI），因此也适用于多语言零样本分类。底层模型是由百度在Meta的RoBERTa上进行预训练的（在 CC100 multilingual dataset 上进行了预训练）。然后在 XNLI dataset 上进行了微调，其中包含来自15种语言的假设-前提对，还有英文的 MNLI dataset 。该模型由百度在 this paper 中推出。该模型优于同等规模的RoBERTa模型。

如果您正在寻找速度更快（但性能较差）的模型，可以尝试 multilingual-MiniLMv2-L6-mnli-xnli 。如果您正在寻找性能和速度的良好组合的基本型号，可以尝试 mDeBERTa-v3-base-mnli-xnli

如何使用模型

简单的零样本分类流程

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/ernie-m-large-mnli-xnli")

sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

NLI用例

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/ernie-m-large-mnli-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

训练数据

该模型使用了XNLI开发数据集和MNLI训练数据集进行训练。XNLI开发集包含2490个由英文翻译成其他14种语言的专业翻译文本（总共37350个文本）（参见 this paper ）。请注意，XNLI包含15种语言的机器翻译版本的训练集，但由于这些机器翻译的质量问题，该模型仅使用了来自XNLI开发集的专业翻译文本和原始的英文MNLI训练集的专业翻译文本（共392702个文本）。不使用机器翻译文本可以避免过度拟合这15种语言，避免对ernie-m进行另外85种语言的灾难性遗忘，并显著减少训练成本。

训练过程

ernie-m-large-mnli-xnli使用Hugging Face训练器进行训练，具体使用了以下超参数。

training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    learning_rate=3e-05,
    per_device_train_batch_size=16,   # batch size per device during training
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=16,    # batch size for evaluation
    warmup_ratio=0.1,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    fp16=True,
)

评估结果

该模型在XNLI测试集上进行了评估，涉及15种语言（每种语言5010个文本，总计75150个）。请注意，多语言NLI模型能够在没有特定语言的NLI训练数据的情况下对NLI文本进行分类（跨语言转移）。这意味着该模型也能够对mDeBERTa训练时涉及的其他85种语言进行NLI，但性能很可能低于XNLI中的那些语言。

还请注意，如果模型中心的其他多语言模型声称在英语以外的语言上的性能约为90%，那么作者在测试过程中很可能犯了一个错误，因为没有一篇最新的论文显示多语言平均性能在XNLI上超过80%几个百分点（参见 here 或 here ）。

Datasets	avg_xnli	mnli_m	mnli_mm	ar	bg	de	el	en	es	fr	hi	ru	sw	th	tr	ur	vi	zh
Accuracy	0.822	0.881	0.878	0.818	0.853	0.84	0.837	0.882	0.855	0.849	0.799	0.83	0.751	0.809	0.818	0.76	0.826	0.799
Inference text/sec (A100, batch=120)	1415.0	783.0	774.0	1487.0	1396.0	1430.0	1206.0	1623.0	1482.0	1291.0	1302.0	1366.0	1484.0	1500.0	1609.0	1344.0	1403.0	1302.0

限制和偏见

请参考原始ernie-m论文和不同NLI数据集的文献，查看可能存在的偏见。

引用

如果您使用了这个模型，请引用：Laurer, Moritz, Wouter van Atteveldt, Andreu Salleras Casas和Kasper Welbers. 2022. “Less Annotating, More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI”。预印本，六月。开放科学框架， https://osf.io/74b8k 。

合作意向或问题？

如果您有问题或合作意向，请通过m{dot}laurer{at}vu{dot}nl或 LinkedIn 与我联系。

调试和问题

由于ernie-m架构仅支持transformers==4.27或更高版本（尚未发布，并且在03.03.23的推理小部件中出现错误），因此在4.27发布之前，您需要使用以下命令从源代码安装transformers：pip install git+https://github.com/huggingface/transformers，以及使用以下命令安装sentencepiece分词器：pip install sentencepiece。发布后，您可以运行：pip install transformers[sentencepiece]>=4.27

作者:

Moritz Laurer

数据集大小:

4.18 GB