英文

描述:

Sentence-CamemBERT-Large 是由 La Javaness 开发的法语嵌入模型。该嵌入模型的目的是用数学向量表示法语句子的内容和语义,以便理解文本的含义-超越查询和文档中的单个单词,提供强大的语义搜索能力。

预训练的句子嵌入模型是法语句子嵌入的最先进技术。

该模型使用预训练的 facebook/camembert-large Siamese BERT-Networks with 'sentences-transformers' 在数据集 stsb 上进行微调。

使用方法

可以直接使用该模型(无需语言模型),如下所示:

from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("dangvantuan/sentence-camembert-large")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)

评估

可以按照以下方法对法语stsb的测试数据进行评估。

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")

测试结果:使用皮尔逊和斯皮尔曼相关性进行性能评估:

  • 在dev上
Model Pearson correlation Spearman correlation #params
1238321 88.2 88.02 336M
1239321 86.73 86.54 110M
12310321 79.22 79.16 135M
12311321 85 NaN 175B
12312321 79.75 80.44 NaN
  • 在test上
Model Pearson correlation Spearman correlation
1238321 85.9 85.8
1239321 82.36 81.64
12310321 78.62 77.48
12311321 82 NaN
12312321 79.05 77.56

引用

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}


@article{martin2020camembert,
   title={CamemBERT: a Tasty French Language Mode},
   author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   year={2020}
}