模型:

OFA-Sys/chinese-clip-vit-large-patch14-336px

英文

Chinese-CLIP-ViT-Large-Patch14-336px

介绍

这是Chinese CLIP的大型版本,使用ViT-L/14@336px作为图像编码器,使用RoBERTa-wwm-base作为文本编码器。 Chinese CLIP是对大约2亿个中文图像-文本对的大规模数据集的简单实现。更多详情,请参考我们的技术报告 https://arxiv.org/abs/2211.01335 和我们的官方GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP (欢迎点赞!??)

使用官方API

我们提供了一个简单的代码片段,展示如何使用Chinese-CLIP的API计算图像和文本的嵌入向量和相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0219, 0.0316, 0.0043, 0.9423]]

但是,如果您对仅仅使用API不满意,可以随时查看我们的GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP ,了解有关训练和推理的更多详细信息。

结果

MUGE文本到图像检索 :

Setup Zero-shot Finetune
Metric R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Wukong 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIP 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Flickr30K-CN检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

COCO-CN检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

零样本图像分类 :

Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

引用

如果您发现Chinese CLIP有帮助,请随时引用我们的论文。感谢您的支持!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}