解析：两种方法优化嵌入模型

2024年10月09日由 alex 发表 422 0

什么是嵌入，为什么需要嵌入？

嵌入是机器学习任务中的一个重要工具，它能将复杂对象（文本、图像、音频）转换为数字向量。它们可以评估对象之间的相似性，这对推荐系统、搜索、异常检测和其他任务非常重要。

为什么需要优化嵌入

许多最先进的（SOTA）模型输出的向量表示维数为 1024，其中每个数字都以 float32 格式编码，这意味着每个维数需要 4 个字节。要在 2.5 亿个向量中进行搜索，我们需要大约 1TB 的内存！

Matryoshka Representation Learning（MRL）

我们需要的嵌入模型可以生成不同维度的向量，从 1024 维到 64 维不等。此外，这些向量还应该保留有意义的信息。

该论文的作者证明，可以对模型进行训练，将嵌入式中最重要的信息存储在向量的开头，这样我们就可以舍弃尾部信息，使用更有效的向量大小。

在训练过程中，我们不仅会对全尺寸嵌入进行损耗，还会对修剪过的部分进行损耗。就训练速度和内存而言，这并不会造成很大的开销。

例如，对于 nomic-ai/nomic-embed-text-v1.5，在将嵌入大小减小 3 倍的情况下，仍能保持 95.8% 的性能，而在减小 6 倍的情况下，则能保持 90% 的性能。

代码中的 MRL

训练：

from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import CoSENTLoss, MatryoshkaLoss
model = SentenceTransformer("microsoft/mpnet-base")
base_loss = CoSENTLoss(model=model)
loss = MatryoshkaLoss(
    model=model,
    loss=base_loss,
    matryoshka_dims=[768, 512, 256, 128, 64],
    matryoshka_weight=[1, 1, 1, 1, 1],
)
model.fit(
    train_objectives=[(train_dataset, loss)],
    ...,
)

使用：

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka")
matryoshka_dim = 64
embeddings = model.encode(
    [
        "The weather is so nice!",
        "It's so sunny outside!",
        "He drove to the stadium.",
    ]
)
embeddings = embeddings[..., :matryoshka_dim]  # Shrink the embedding dimensions
print(embeddings.shape)
# => (3, 64)
# Similarity of the first sentence to the other two:
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
# => tensor([[0.8910, 0.1337]])

二进制和标量嵌入量化

就优化而言，嵌入量化与 MRL 方法是正交的。我们不是削减嵌入，而是降低每个维度的精度，从而加快计算速度。

二进制量化

使用一个简单的阈值函数，将归一化嵌入中的每个数字映射为 0 或 1：

在计算相似性时，我们使用汉明距离，这是一种计算字符串中不同元素数量的有效方法。

这样，嵌入的每个维度都用一个比特来编码，但在实际工作中，我们使用的是字节，因此比特被编码成字节，如果原始嵌入的维度为 1024，那么 1024 比特就会减少到 128 字节。

许多矢量数据库（如 faiss）都支持二进制矢量。

标量量化

将嵌入数据从 float32 量化为 int8 的过程。

这包括将 float32 值的连续范围映射到 int8 值的离散集合，int8 可以表示 256 个不同的值（从 -128 到 127）。为此，我们使用了一个大型嵌入式校准数据集。我们计算这些嵌入的范围，即每个嵌入维度的最小值和最大值。由此，我们计算出每个值的分类步骤（桶）。

为了进一步提高检索性能，可以采用与二进制嵌入相同的校准步骤。值得注意的是，校准数据集对性能有很大影响，因为它定义了量化桶，所以最好有一个大型数据集。

将大小为 1024 的嵌入量化为 uint8 或 int8 将产生 1024 个字节。

代码中的量化

二进制：

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
# 2b. or, encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
binary_embeddings = quantize_embeddings(embeddings, precision="binary")

标量：

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
from datasets import load_dataset
# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
# 2. Prepare an example calibration dataset
corpus = load_dataset("nq_open", split="train[:1000]")["question"]
calibration_embeddings = model.encode(corpus)
# 3. Encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
int8_embeddings = quantize_embeddings(
    embeddings,
    precision="int8",
    calibration_embeddings=calibration_embeddings,
)

你可以将二进制量化和标量量化结合起来，用二进制嵌入进行预过滤，然后用标量嵌入重新排序。

文章来源：https://medium.com/@abletobetable/two-methods-for-optimizing-embedding-models-1d9f43d8a7a6

标签：

深度学习机器学习

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇【指南】如何为GenAI应用程序选择架构

下一篇【指南】深入解析K近邻回归器

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来