ZH-CLIP: 一个中文 CLIP 模型

模型

您可以从 🤗 thu-ml/zh-clip-vit-roberta-large-patch14 下载 ZH-CLIP 模型。模型结构如下所示：

视觉编码器网络结构与 openai/clip-vit-large-patch14 相同，并使用 laion/CLIP-ViT-L-14-laion2B-s32B-b82K 进行初始化。
文本编码器网络结构与 hfl/chinese-roberta-wwm-ext-large 相同，并进行初始化。

结果

COCO-CN 检索（官方测试集）：

Model	Text-to-Image	Image-to-Text
R@1	R@5	R@10	Mean	R@1	R@5	R@10	Mean
Clip-Chinese	22.60	50.04	65.24	45.96	22.8	49.8	64.1	45.57
mclip	56.51	83.57	90.79	76.95	59.9	87.3	94.1	80.43
Taiyi-CLIP	52.52	81.10	89.93	74.52	45.80	75.80	88.10	69.90
CN-CLIP	64.10	88.79	94.40	82.43	61.00	84.40	93.10	79.5
altclip-xlmr-l	62.87	87.18	94.01	81.35	63.3	88.3	95.3	82.3
ZH-CLIP	68.00	89.46	95.44	84.30	68.50	90.10	96.50	85.03

Flickr30K-CN 检索（官方测试集）：

Model	Text-to-Image	Image-to-Text
R@1	R@5	R@10	Mean	R@1	R@5	R@10	Mean
Clip-Chinese	17.76	40.34	51.88	36.66	30.4	55.30	67.10	50.93
mclip	62.3	86.42	92.58	80.43	84.4	97.3	98.9	93.53
Taiyi-CLIP	53.5	80.5	87.24	73.75	65.4	90.6	95.7	83.9
CN-CLIP	67.98	89.54	94.46	83.99	81.2	96.6	98.2	92.0
altclip-xlmr-l	69.16	89.94	94.5	84.53	85.1	97.7	99.2	94.0
ZH-CLIP	69.64	90.14	94.3	84.69	86.6	97.6	98.8	94.33

Muge 文本到图像检索（官方验证集）：

Model	Text-to-Image
R@1	R@5	R@10	Mean
Clip-Chinese	15.06	34.96	46.21	32.08
mclip	22.34	41.15	50.26	37.92
Taiyi-CLIP	42.09	67.75	77.21	62.35
cn-clip	56.25	79.87	86.50	74.21
altclip-xlmr-l	29.69	49.92	58.87	46.16
ZH-CLIP	56.75	79.75	86.66	74.38

零样本图像分类：

Model	Zero-shot Classification (ACC1)
CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC	ImageNet
Clip-Chinese	86.85	44.21	18.40	34.86	14.21	3.87	32.63	14.37	52.49	67.73	22.22
mclip	92.88	65.54	29.57	46.76	41.18	7.20	23.21	52.80	51.64	77.56	42.99
Taiyi-CLIP	95.62	73.30	40.69	61.62	36.22	13.98	41.21	73.91	50.02	75.28	49.82
CN-CLIP	94.75	75.04	44.73	52.34	48.57	20.55	20.11	61.99	62.59	79.12	53.40
Altclip-xlmr-l	95.49	77.29	42.07	56.96	51.52	26.85	24.89	65.68	50.02	77.99	59.21
ZH-CLIP	97.08	80.73	47.66	51.58	48.48	20.73	20.11	61.94	62.31	78.07	56.87

入门指南

依赖关系

python >= 3.9
pip install -r requirements.txt

推理

您可以从 https://github.com/thu-ml/zh-clip 克隆代码

from PIL import Image
import requests
from models.zhclip import ZhCLIPProcessor, ZhCLIPModel  # Code in https://github.com/thu-ml/zh-clip

version = 'thu-ml/zh-clip-vit-roberta-large-patch14'
model = ZhCLIPModel.from_pretrained(version)
processor = ZhCLIPProcessor.from_pretrained(version)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["一只猫", "一只狗"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
image_features = outputs.image_features
text_features = outputs.text_features
text_probs = (image_features @ text_features.T).softmax(dim=-1)

其他中文 CLIP 模型

另外，为了比较不同方法的有效性，已经整合了其他中文 CLIP 模型的推理方法。为了方便使用，推理代码也已公开，并且如果有任何侵权行为，请与我们联系。该代码仅实现与 clip-vit-large-patch14 相同级别的模型，但将来可能会适应更多不同版本模型的使用。

#	model	alias
0	1237321	zhclip
1	1238321	altclip
2	1239321	cnclip
3	12310321	taiyiclip
4	12311321	mclip
5	12312321	clip-chinese

在 inference.py 中的用法

作者:

Tsinghua Machine Learning Group

数据集大小:

2.35 GB