模型:

thu-ml/zh-clip-vit-roberta-large-patch14

英文

ZH-CLIP: 一个中文 CLIP 模型

模型

您可以从 ? thu-ml/zh-clip-vit-roberta-large-patch14 下载 ZH-CLIP 模型。模型结构如下所示:

结果

COCO-CN 检索(官方测试集):
Model Text-to-Image Image-to-Text
R@1 R@5 R@10 Mean R@1 R@5 R@10 Mean
Clip-Chinese 22.60 50.04 65.24 45.96 22.8 49.8 64.1 45.57
mclip 56.51 83.57 90.79 76.95 59.9 87.3 94.1 80.43
Taiyi-CLIP 52.52 81.10 89.93 74.52 45.80 75.80 88.10 69.90
CN-CLIP 64.10 88.79 94.40 82.43 61.00 84.40 93.10 79.5
altclip-xlmr-l 62.87 87.18 94.01 81.35 63.3 88.3 95.3 82.3
ZH-CLIP 68.00 89.46 95.44 84.30 68.50 90.10 96.50 85.03
Flickr30K-CN 检索(官方测试集):
Model Text-to-Image Image-to-Text
R@1 R@5 R@10 Mean R@1 R@5 R@10 Mean
Clip-Chinese 17.76 40.34 51.88 36.66 30.4 55.30 67.10 50.93
mclip 62.3 86.42 92.58 80.43 84.4 97.3 98.9 93.53
Taiyi-CLIP 53.5 80.5 87.24 73.75 65.4 90.6 95.7 83.9
CN-CLIP 67.98 89.54 94.46 83.99 81.2 96.6 98.2 92.0
altclip-xlmr-l 69.16 89.94 94.5 84.53 85.1 97.7 99.2 94.0
ZH-CLIP 69.64 90.14 94.3 84.69 86.6 97.6 98.8 94.33
Muge 文本到图像检索(官方验证集):
Model Text-to-Image
R@1 R@5 R@10 Mean
Clip-Chinese 15.06 34.96 46.21 32.08
mclip 22.34 41.15 50.26 37.92
Taiyi-CLIP 42.09 67.75 77.21 62.35
cn-clip 56.25 79.87 86.50 74.21
altclip-xlmr-l 29.69 49.92 58.87 46.16
ZH-CLIP 56.75 79.75 86.66 74.38
零样本图像分类:
Model Zero-shot Classification (ACC1)
CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC ImageNet
Clip-Chinese 86.85 44.21 18.40 34.86 14.21 3.87 32.63 14.37 52.49 67.73 22.22
mclip 92.88 65.54 29.57 46.76 41.18 7.20 23.21 52.80 51.64 77.56 42.99
Taiyi-CLIP 95.62 73.30 40.69 61.62 36.22 13.98 41.21 73.91 50.02 75.28 49.82
CN-CLIP 94.75 75.04 44.73 52.34 48.57 20.55 20.11 61.99 62.59 79.12 53.40
Altclip-xlmr-l 95.49 77.29 42.07 56.96 51.52 26.85 24.89 65.68 50.02 77.99 59.21
ZH-CLIP 97.08 80.73 47.66 51.58 48.48 20.73 20.11 61.94 62.31 78.07 56.87

入门指南

依赖关系

  • python >= 3.9
  • pip install -r requirements.txt

推理

您可以从 https://github.com/thu-ml/zh-clip 克隆代码

from PIL import Image
import requests
from models.zhclip import ZhCLIPProcessor, ZhCLIPModel  # Code in https://github.com/thu-ml/zh-clip

version = 'thu-ml/zh-clip-vit-roberta-large-patch14'
model = ZhCLIPModel.from_pretrained(version)
processor = ZhCLIPProcessor.from_pretrained(version)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["一只猫", "一只狗"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
image_features = outputs.image_features
text_features = outputs.text_features
text_probs = (image_features @ text_features.T).softmax(dim=-1)

其他中文 CLIP 模型

另外,为了比较不同方法的有效性,已经整合了其他中文 CLIP 模型的推理方法。为了方便使用,推理代码也已公开,并且如果有任何侵权行为,请与我们联系。该代码仅实现与 clip-vit-large-patch14 相同级别的模型,但将来可能会适应更多不同版本模型的使用。

# model alias
0 1237321 zhclip
1 1238321 altclip
2 1239321 cnclip
3 12310321 taiyiclip
4 12311321 mclip
5 12312321 clip-chinese

inference.py 中的用法