如何使用CLIP和Pinecone构建图像到图像搜索工具

2023年11月28日由 alex 发表 651 0

在本文中，我们将引导你完成从头开始构建图像到图像搜索工具的过程！

图像到图像搜索

什么是图像到图像搜索？

传统的图像搜索引擎中，你通常使用文本查询来查找图像，并且搜索引擎根据与这些图像相关联的关键词返回结果。而在图像到图像搜索中，你以一幅图像作为查询的起点，系统会检索出在视觉上类似于查询图像的图像。

想象一下，你有一幅画，比如一幅美丽的日落图。现在，你想找到看起来很相似的其他画作，但是你不能用文字来描述它。相反，你向电脑展示你的画作，它会浏览它知道的所有画作，找出那些非常相似的，即使它们有不同的名称或描述。

我可以用这个搜索工具做什么？

图像到图像搜索引擎开启了令人兴奋的可能性：

查找特定数据 - 搜索包含你想训练模型识别的特定物体的图像。
错误分析 - 当模型错误分类一个物体时，搜索它同样会失败的视觉相似图像。
模型调试 - 揭示包含导致模型行为不当的属性或缺陷的其他图像。

CLIP和Pinecone：简介

上图展示了在向量数据库中对图像数据集建立索引的步骤。

步骤1：收集图像数据集（可以是原始的/未标记的图像）。
步骤2：使用CLIP，一种嵌入模型，用于提取图像的高维向量表示，该表示捕捉了图像的语义和感知特征。
步骤3：这些图像被编码到一个嵌入空间，在此空间中，图像的嵌入（向量表示）被索引到像Pinecone这样的向量数据库中。

在查询时，上图所示，样本图像通过相同的CLIP编码器以获得其嵌入向量。执行向量相似度搜索来高效找到最近的k个数据库图像向量。那些与查询嵌入向量具有最高余弦相似度得分的图像被作为最视觉相似的搜索结果返回。

构建一个图像到图像的搜索引擎

数据集 —指环王

我们使用谷歌搜索来查询与关键词：“指环王电影场景”相关的图像。在这段代码的基础上，我们创建了一个函数来检索基于给定查询的100个网址。

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36"
}
params = {
    "q": "the lord of the rings film scenes", # search query
    "tbm": "isch",                # image results
    "hl": "en",                   # language of the search
    "gl": "us",                   # country where search comes from
    "ijn": "0"                    # page number
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
def get_images():
    """
    https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
    if you try to json.loads() without json.dumps() it will throw an error:
    "Expecting property name enclosed in double quotes"
    """
    google_images = []
    all_script_tags = soup.select("script")
    # # https://regex101.com/r/48UZhY/4
    matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)
    # https://regex101.com/r/VPz7f2/1
    matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
    # https://regex101.com/r/NnRg27/1
    matched_google_images_thumbnails = ", ".join(
        re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(", ")
    thumbnails = [
        bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails
    ]
    # removing previously matched thumbnails for easier full resolution image matches.
    removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
    # https://regex101.com/r/fXjfb1/4
    # https://stackoverflow.com/a/19821774/15164646
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
    full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
    ]
    return full_res_images

使用CLIP获取嵌入向量

提取我们图片集合中的所有嵌入向量。

def get_all_image_embeddings_from_urls(dataset, processor, model, device, num_images=100):
    embeddings = []
    # Limit the number of images to process
    dataset = dataset[:num_images]
    working_urls = []
    #for image_url in dataset['image_url']:
    for image_url in dataset:
      if check_valid_URL(image_url):
          try:
              # Download the image
              response = requests.get(image_url)
              image = Image.open(BytesIO(response.content)).convert("RGB")
              # Get the embedding for the image
              embedding = get_single_image_embedding(image, processor, model, device)
              #embedding = get_single_image_embedding(image)
              embeddings.append(embedding)
              working_urls.append(image_url)
          except Exception as e:
              print(f"Error processing image from {image_url}: {e}")
      else:
          print(f"Invalid or inaccessible image URL: {image_url}")
    return embeddings, working_urls

LOR_embeddings, valid_urls = get_all_image_embeddings_from_urls(list_image_urls, processor, model, device, num_images=100)
Invalid or inaccessible image URL: https://blog.frame.io/wp-content/uploads/2021/12/lotr-forced-perspective-cart-bilbo-gandalf.jpg
Invalid or inaccessible image URL: https://www.cineworld.co.uk/static/dam/jcr:9389da12-c1ea-4ef6-9861-d55723e4270e/Screenshot%202020-08-07%20at%2008.48.49.png
Invalid or inaccessible image URL: https://upload.wikimedia.org/wikipedia/en/3/30/Ringwraithpic.JPG

在100个网址中有97个包含有效图片。

将我们的嵌入数据存储在 Pinecone 中

要将我们的嵌入数据存储在 Pinecone 中，你首先需要创建一个 Pinecone 账户。之后，创建一个名为“image-to-image”的索引。

pinecone.init(
   api_key = "YOUR-API-KEY",
   environment="gcp-starter"  # find next to API key in console
)
my_index_name = "image-to-image"
vector_dim = LOR_embeddings[0].shape[1]
if my_index_name not in pinecone.list_indexes():
  print("Index not present")
# Connect to the index
my_index = pinecone.Index(index_name = my_index_name)

创建一个函数，以便将你的数据存储在Pinecone索引中。

def create_data_to_upsert_from_urls(dataset,  embeddings, num_images):
  metadata = []
  image_IDs = []
  for index in range(len(dataset)):
    metadata.append({
        'ID': index,
        'image': dataset[index]
    })
    image_IDs.append(str(index))
  image_embeddings = [arr.tolist() for arr in embeddings]
  data_to_upsert = list(zip(image_IDs, image_embeddings, metadata))
  return data_to_upsert

执行上述函数以获得：

LOR_data_to_upsert = create_data_to_upsert_from_urls(valid_urls, 
                                LOR_embeddings, len(valid_urls))
my_index.upsert(vectors = LOR_data_to_upsert)
# {'upserted_count': 97}
my_index.describe_index_stats()
# {'dimension': 512,
# 'index_fullness': 0.00097,
# 'namespaces': {'': {'vector_count': 97}},
# 'total_vector_count': 97}

测试我们的图像到图像搜索工具。

# For a random image
n = random.randint(0,len(valid_urls)-1)
print(f"Sample image with index {n} in {valid_urls[n]}")

Sample image with index 47 in 
https://www.intofilm.org/intofilm-production/scaledcropped/870x489https%3A/s3-eu-west-1.amazonaws.com/images.cdn.filmclub.org/film__3930-the-lord-of-the-rings-the-fellowship-of-the-ring--hi_res-a207bd11.jpg/film__3930-the-lord-of-the-rings-the-fellowship-of-the-ring--hi_res-a207bd11.jpg

# 1. Get the image from url
LOR_image_query = get_image(valid_urls[n])
# 2. Obtain embeddings (via CLIP) for the given image
LOR_query_embedding = get_single_image_embedding(LOR_image_query, processor, model, device).tolist()
# 3. Search on Vector DB index for similar images to "LOR_query_embedding"
LOR_results = my_index.query(LOR_query_embedding, top_k=3, include_metadata=True)
# 4. See the results
plot_top_matches_seaborn(LOR_results)

正如上图所示，我们的图像到图像搜索工具找到了与给定样本相似的图像，正如所预期的，ID 47具有最高的相似度评分。

文章来源：https://medium.com/@tenyks_blogger/how-to-build-an-image-to-image-search-tool-using-clip-pinecone-b7b70c44faac

标签：

机器学习人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇深度伪造检测的策略和挑战

下一篇 LMQL—用于语言模型的SQL

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来