模型:

OFA-Sys/chinese-clip-vit-large-patch14

英文

Chinese-CLIP-ViT-Large-Patch14

介绍

这是中文CLIP的大型版本,使用ViT-L/14作为图像编码器和RoBERTa-wwm-base作为文本编码器。中文CLIP是在一个拥有大约2亿个中文图像-文本对的大规模数据集上简单实现的CLIP。更多详情,请参考我们的技术报告 https://arxiv.org/abs/2211.01335 和我们的官方Github主页 https://github.com/OFA-Sys/Chinese-CLIP (欢迎给星星!??)

使用官方API

我们提供了一个简单的代码片段,展示了如何使用中文CLIP的API计算图像和文本的嵌入向量以及相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0066, 0.0211, 0.0031, 0.9692]]

然而,如果你不满足于仅使用API,欢迎查看我们的Github主页 https://github.com/OFA-Sys/Chinese-CLIP ,了解更多关于训练和推理的细节。

结果

MUGE文本-图像检索:

Setup Zero-shot Finetune
Metric R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Wukong 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIP 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Flickr30K-CN检索:

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

COCO-CN检索:

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

零样本图像分类:

Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

引用

如果您发现中文CLIP有帮助,请随意引用我们的论文。感谢您的支持!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}