模型:

laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

任务:

零样本图像分类

类库:

TensorBoard OpenCLIP

其他:

clip

预印本库:

arxiv:2201.03545 arxiv:2210.08402 arxiv:1910.04867

许可:

mit

模型介绍文件清单

英文

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup 模型卡片

模型详情

模型描述

在 LAION-2B（英文）子集上对 CLIP 模型进行了一系列训练，使用的是 ConvNeXt-Large（convnext_large）模型作为图像塔，并在视觉塔中使用了 MLP（fc-gelu-drop-fc）头，而不是其他 CLIP 模型中的单个投影。文本塔的宽度与 ViT-L / RN50x16 模型相同，但深度增加了 4 层（深度 16，嵌入维度 768）。

这个 320x320 分辨率的模型是对较高分辨率下的 3 次微调（fine-tune）进行的加权平均（权重平均）组成的。它是由原始 256x256 训练运行的最终检查点的 3 个微调的平均值，每个微调使用了额外的约 20-30 亿个样本和更低的学习速率进行。每个微调使用了不同的学习速率（1e-4，6e-5，5e-5）和样本数量（32 亿，20 亿，25 亿）。

在 320x320 分辨率下，ConvNext-Large-D 模型在效率上显著优于 OpenAI 进行细调的 336x336 的 L/14 模型。L/14-336 模型的 GMAC 增加了 2.5 倍，激活数量增加了 2.8 倍，参数数量增加了 1.22 倍。

Model	Dataset	Resolution	AugReg	Top-1 ImageNet Zero-Shot (%)
12312321	LAION-2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1)	75.9
12313321	LAION-2B	320x320	RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0)	76.6
12314321	LAION-2B	320x320	RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0)	76.9

RRC = 随机调整大小裁剪（裁剪百分比），RE = 随机擦除（概率），SD = 随机深度（概率）——仅适用于图像塔，D = 丢弃（概率）——仅适用于图像塔头

LAION-A = LAION 美学，是 LAION-2B 的近 9 亿个样本子集，经过 pHash 去重和美学评分过滤。

模型训练由 Ross Wightman 在 stability.ai 集群上完成。

用途

根据原始的 OpenAI CLIP model card ，该模型旨在为研究社区提供研究成果。我们希望这个模型能够让研究人员更好地理解和探索零样本、任意图像分类。我们也希望它能用于对该模型潜在影响的跨学科研究。

OpenAI CLIP 论文中提到了潜在的下游影响的讨论，作为这种分析的一个示例。此外，LAION-5B 博客（ https://laion.ai/blog/laion-5b/ ）和即将推出的论文中也有相关讨论，特别与训练数据集相关。

直接使用

零样本图像分类、图像和文本检索等。

下游使用

图像分类和其他图像任务微调、线性探针图像分类、图像生成指导和条件等。

超出范围的使用

根据 OpenAI 模型的原话，目前不支持任何部署后的模型使用，无论是商业还是非商业。除非对模型进行了特定领域的全面测试，不建议在受限环境中使用诸如图像搜索之类的非部署用例，因为我们的安全评估表明，尤其是考虑到 CLIP 在不同类别分类上的性能变化，需要针对特定任务进行全面测试。目前在任何用例中，模型的未经测试和无约束的部署可能具有潜在的危害性。

无论模型的性能如何，始终不接受属于监视和面部识别范畴的用例。这是因为目前使用人工智能进行此类任务可能过早，由于缺乏测试规范和确保其公平使用的检查措施。

由于该模型并未专门训练或评估任何语言（英语以外），因此其使用应限于英语语言的用例。

除上述声明外，模型训练中使用的 LAION-5B 数据集还有其他注意事项，请参见下文。

训练详情

训练数据

该模型使用了 LAION-2B，即 LAION-5B（ https://laion.ai/blog/laion-5b/ ）的 20 亿个英文样本子集进行训练。

重要说明：数据集创建的动机是为了使大规模多模态模型训练和处理来自公开可用的互联网的非策划的大规模数据集的研究和实验更加民主化。因此，我们建议将该数据集用于研究目的。请注意，这个大规模数据集未经策划。请记住，数据集的未策划性意味着收集到的链接可能会导致人类查看者极度不舒服和不安的内容。因此，请小心使用演示链接，并自行承担风险。可以通过使用我们构建的基于安全标签进行筛选的自定义 NSFW 分类器来提取“安全”子集。虽然这样可以大大降低在查看时遇到潜在有害内容的机会，但无法完全排除在安全模式下仍存在有害内容的可能性，因此警告仍然有效。我们认为向广大研究和其他感兴趣的社区提供数据集，将允许对大规模模型训练带来的益处以及可能未被报告或未被注意到的陷阱和危险进行透明调查。但我们不建议使用该数据集创建现成的工业产品，因为我们希望通过这个版本鼓励有关大规模模型的一般属性和安全性的基础研究仍在进行中。

训练过程

对所有的 320x320 模型微调，全局批量大小为 131072，用于每个微调间隔运行 10-16 个检查点，总共约 20-30 亿个样本。

对于 320x320 的模型，在 64 个 8-GPU（A100 40GB）节点（稳定性）上使用了下面的 slurm 脚本和 srun 命令。

/opt/slurm/sbin/srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "convnext_large_320" \
    --pretrained ""/runs/convnext_large_256/epoch_128.pt" \
    --resume 'latest' \
    --train-data="pipe:aws s3 cp s3://mybucket/path/{laion{00000..xxxxx}.tar -" \
    --train-num-samples 203666042 \
    --dataset-type webdataset \
    --precision amp_bfloat16 \
    --beta2 0.98 \
    --warmup 2000 \
    --batch-size=256 \
    --epochs=12 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.5, 1.0)' re_prob=0.4 \
    --clip-grad-norm 5.0 \
    --lr 5e-5 \
    --workers=6 \
    --model "convnext_large_d_320" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing

评估

评估使用了 LAION CLIP Benchmark suite 中的代码。

测试数据、因素和指标

测试数据

使用了 VTAB+（VTAB 与其他鲁棒性数据集的组合）进行分类，以及 COCO 和 Flickr 进行检索。

结果

在 ImageNet-1k 上实现了 75.9% 到 76.9% 的零样本 top-1 准确率。

原始的从头开始 256x256 训练的零样本曲线：

在更广泛的数据集上进行了初始一轮评估，可在 https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb 查看。

致谢

感谢 stability.ai 提供的计算资源用于训练该模型。

引用

BibTeX 引用：

LAION-5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP 软件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}

OpenAI CLIP 论文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

作者:

LAION eV

数据集大小:

1.33 GB