模型:

OFA-Sys/chinese-clip-vit-huge-patch14

英文

Chinese-CLIP-ViT-Huge-Patch14

介绍

这是中国CLIP的巨型版本,使用ViT-H/14作为图像编码器和RoBERTa-wwm-large作为文本编码器。中国CLIP是对大约2亿个中文图像-文本对的大规模数据集的简单实现。详细信息请参阅我们的技术报告 https://arxiv.org/abs/2211.01335 及我们的官方GitHub存储库 https://github.com/OFA-Sys/Chinese-CLIP (欢迎点赞!??)

使用官方API

我们提供简单的代码片段,展示如何使用中国CLIP的API来计算图像和文本的嵌入向量和相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.1419e-02, 1.0478e-02, 5.2018e-04, 9.7758e-01]]

但是,如果您对仅使用API不满意,可以随时查看我们的GitHub存储库 https://github.com/OFA-Sys/Chinese-CLIP ,了解有关训练和推理的更多详细信息。

结果

MUGE 文本到图像检索 :

Setup Zero-shot Finetune
Metric R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Wukong 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIP 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Flickr30K-CN 检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

COCO-CN 检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

零样本图像分类 :

Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

引用

如果您发现中国CLIP有用,请随时引用我们的论文。感谢您的支持!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}