模型:
OFA-Sys/chinese-clip-vit-huge-patch14
这是中国CLIP的巨型版本,使用ViT-H/14作为图像编码器和RoBERTa-wwm-large作为文本编码器。中国CLIP是对大约2亿个中文图像-文本对的大规模数据集的简单实现。详细信息请参阅我们的技术报告 https://arxiv.org/abs/2211.01335 及我们的官方GitHub存储库 https://github.com/OFA-Sys/Chinese-CLIP (欢迎点赞!??)
我们提供简单的代码片段,展示如何使用中国CLIP的API来计算图像和文本的嵌入向量和相似度。
from PIL import Image import requests from transformers import ChineseCLIPProcessor, ChineseCLIPModel model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14") processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14") url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" image = Image.open(requests.get(url, stream=True).raw) # Squirtle, Bulbasaur, Charmander, Pikachu in English texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] # compute image feature inputs = processor(images=image, return_tensors="pt") image_features = model.get_image_features(**inputs) image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize # compute text features inputs = processor(text=texts, padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs) text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize # compute image-text similarity scores inputs = processor(text=texts, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) # probs: [[1.1419e-02, 1.0478e-02, 5.2018e-04, 9.7758e-01]]
但是,如果您对仅使用API不满意,可以随时查看我们的GitHub存储库 https://github.com/OFA-Sys/Chinese-CLIP ,了解有关训练和推理的更多详细信息。
MUGE 文本到图像检索 :
Setup | Zero-shot | Finetune | ||||||
---|---|---|---|---|---|---|---|---|
Metric | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
Wukong | 42.7 | 69.0 | 78.0 | 63.2 | 52.7 | 77.9 | 85.6 | 72.1 |
R2D2 | 49.5 | 75.7 | 83.2 | 69.5 | 60.1 | 82.9 | 89.4 | 77.5 |
CN-CLIP | 63.0 | 84.1 | 89.2 | 78.8 | 68.9 | 88.7 | 93.1 | 83.6 |
Flickr30K-CN 检索 :
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
Wukong | 51.7 | 78.9 | 86.3 | 77.4 | 94.5 | 97.0 | 76.1 | 94.8 | 97.5 | 92.7 | 99.1 | 99.6 |
R2D2 | 60.9 | 86.8 | 92.7 | 84.4 | 96.7 | 98.4 | 77.6 | 96.7 | 98.9 | 95.6 | 99.8 | 100.0 |
CN-CLIP | 71.2 | 91.4 | 95.5 | 83.8 | 96.9 | 98.6 | 81.6 | 97.5 | 98.8 | 95.3 | 99.7 | 100.0 |
COCO-CN 检索 :
Task | Text-to-Image | Image-to-Text | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
Wukong | 53.4 | 80.2 | 90.1 | 74.0 | 94.4 | 98.1 | 55.2 | 81.0 | 90.6 | 73.3 | 94.0 | 98.0 |
R2D2 | 56.4 | 85.0 | 93.1 | 79.1 | 96.5 | 98.9 | 63.3 | 89.3 | 95.7 | 79.3 | 97.1 | 98.7 |
CN-CLIP | 69.2 | 89.9 | 96.1 | 81.5 | 96.9 | 99.1 | 63.0 | 86.6 | 92.9 | 83.5 | 97.3 | 99.2 |
零样本图像分类 :
Task | CIFAR10 | CIFAR100 | DTD | EuroSAT | FER | FGVC | KITTI | MNIST | PC | VOC |
---|---|---|---|---|---|---|---|---|---|---|
GIT | 88.5 | 61.1 | 42.9 | 43.4 | 41.4 | 6.7 | 22.1 | 68.9 | 50.0 | 80.2 |
ALIGN | 94.9 | 76.8 | 66.1 | 52.1 | 50.8 | 25.0 | 41.2 | 74.0 | 55.2 | 83.0 |
CLIP | 94.9 | 77.0 | 56.0 | 63.0 | 48.3 | 33.3 | 11.5 | 79.0 | 62.3 | 84.0 |
Wukong | 95.4 | 77.1 | 40.9 | 50.3 | - | - | - | - | - | - |
CN-CLIP | 96.0 | 79.7 | 51.2 | 52.0 | 55.1 | 26.2 | 49.9 | 79.4 | 63.5 | 84.9 |
如果您发现中国CLIP有用,请随时引用我们的论文。感谢您的支持!
@article{chinese-clip, title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese}, author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang}, journal={arXiv preprint arXiv:2211.01335}, year={2022} }