模型:

laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

任务:

零样本图像分类

类库:

OpenCLIP

其他:

clip

预印本库:

arxiv:2210.08402 arxiv:1910.04867

许可:

mit

模型介绍文件清单

英文

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg模型卡

模型详情

模型描述

CLIP ConvNeXt-XXLarge（自定义timm ConvNeXt大小）模型系列在LAION-2B（英文）上进行训练，该数据集是 LAION-5B 的子集，使用了 OpenCLIP 的数据。

Model	Dataset	Resolution	AugReg	Top-1 ImageNet Zero-Shot (%)
12310321	LAION-2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1)	79.1
12311321	LAION-2B	256x256	RRC (0.3, 1.0), RE (0.4), SD (0.1)	79.3
12312321	LAION-2B	256x256	N/A	79.4
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only

核心训练过程分为多个阶段，在大约2个月的时间内完成。核心训练的全局批量大小为81920。最后10%的训练是在95744的全局批量大小下进行的，LR和数据增强强度显著高于之前的训练。两者的结果进行了平均处理。更多细节请参考训练详情。

目标:

将最大的卷积CLIP图像塔的规模推向具有改进的图像大小缩放的ViT-g到ViT-G性能范围内。

首次:

发布的最大ConvNeXt模型的预训练版本（847M参数，256x256图像下的198 GMAC和125 MActs）
非ViT图像塔CLIP模型（无先前的图像塔预训练）在零样本下达到> 79%的ImageNet top-1准确率

模型利用了以下内容:

ConvNeXt-XXLarge模型（convnext_xxlarge）作为图像塔
图像塔末尾的标准投影
与ViT-H-14和ViT-g-14模型相同大小的文本塔（1024，16个head，24层）

模型在256x256的图像分辨率下进行训练。组合的图像+文本CLIP模型的参数量为12亿，GMAC为222，MActs为146。在256x256分辨率下，ConvNext-XXLarge的FLOPS和参数量略高于ViT-H-14的CLIP配置，但激活计数较低。远低于g-14和G-14，但在能力上介于它们之间。

model	image_size	embed_dim	gmacs	macts	mparams	image_gmacs	image_macts	image_mparams	text_gmacs	text_macts	text_mparams
ViT-H-16	224	1024	150.96	122.01	986.26	127.4	100.81	632.23	23.57	21.2	354.03
ViT-H-14	224	1024	190.97	160.61	986.11	167.4	139.41	632.08	23.57	21.2	354.03
ViT-L-14-336	336	768	197.76	278.19	427.94	191.1	270.24	304.29	6.66	7.95	123.65
convnext_xxlarge	256	1024	221.66	145.66	1200.58	198.09	124.45	846.54	23.57	21.2	354.03
RN50x64	448	1024	276.8	249.73	623.26	265.02	239.13	420.38	11.78	10.6	202.88
ViT-g-14	224	1024	290.74	213.84	1366.68	267.18	192.64	1012.65	23.57	21.2	354.03
convnext_xxlarge_320	320	1024	333.08	215.66	1200.58	309.52	194.46	846.54	23.57	21.2	354.03
ViT-H-14-336	336	1024	414.53	428.74	986.52	390.97	407.54	632.49	23.57	21.2	354.03
ViT-bigG-14	224	1280	532.92	310.71	2539.57	483.96	275.37	1844.91	48.96	35.34	694.66

模型训练由Ross Wightman在 stability.ai 集群和 JUWELS Booster 超级计算机上完成。请参考下面的鸣谢内容。

用途

正如原始项目所述，此模型旨在成为面向研究社区的研究成果。我们希望这个模型能帮助研究人员更好地理解和探索零样本、任意图像分类。我们还希望它能用于跨学科研究，以探讨这种模型可能带来的潜在影响。

OpenAI的CLIP论文包括了对潜在下游影响的讨论，以提供此类分析的示例。此外，LAION-5B博客（ https://laion.ai/blog/laion-5b/ ）和即将发布的论文还包括关于训练数据集的进一步讨论。

直接使用

零样本图像分类，图像和文本检索等。

下游使用

图像分类和其他图像任务微调，线性探测图像分类，图像生成引导和条件等。

超出范围的使用

基于OpenAI模型的限制，

当前部署模型的任何用例（无论是商业还是非商业）均超出范围。除非对模型进行具体领域的彻底测试，否则不建议在受限环境中使用诸如图像搜索之类的用例，因为我们的安全评估展示了特定任务测试的重要性，特别是CLIP在不同类别分类体系上的性能变化。这使得在任何尚未经测试和不受约束的用例中部署模型目前可能是有害的。

无论模型的性能如何，始终不建议将人工智能用于监视和人脸识别等任务。这是因为在当前情况下，缺乏测试规范和确保公平使用的检查机制，使用人工智能进行此类任务可能过早。

由于该模型并未经过专门用于除英语以外的任何语言的训练和评估，因此其使用应仅限于英语语言用例。

除上述注意事项外，模型训练中使用的LAION-5B数据集还有其他注意事项，请参考下文。

训练详情

训练数据

该模型使用LAION-2B进行训练，这是LAION-5B（ https://laion.ai/blog/laion-5b/ ）的20亿个样本的英文子集。

重要提示: 数据集创建的动机是将大规模多模态模型训练和从公开可用互联网进行无策划大规模数据集抓取的处理民主化。因此，我们建议将该数据集用于研究目的。请注意，这个大规模数据集是未策划的。请记住，数据集的未策划性质意味着收集到的链接可能导致对人类观看来说非常不舒服和令人不安的内容。因此，请小心谨慎地使用演示链接，并自担风险。通过使用自定义训练的NSFW分类器可以提取出"安全"子集。虽然这极大地减小了在查看时遇到可能有害内容的机会，但我们无法完全排除"安全"模式中仍然存在有害内容的可能性，所以警告仍然有效。我们认为，向广泛的研究和其他感兴趣的社区提供数据集，将允许对使用大规模模型训练的好处进行透明的研究，以及对使用封闭大规模数据集进行工作的缺点和危险进行调查，而这些数据集仅限于小范围社群。然而，我们不建议将数据集用于创建即用型工业产品，因为我们此次发布旨在推进有关此类大规模模型的基本性质和安全性的基础研究仍在进行中。

训练过程

主要的训练运行采用了全局批量大小为81920，每256个检查点间的训练样本为1,356万个，总共训练样本数约为340亿。

在训练这个模型过程中，遇到了许多模型数值稳定性和群集稳定性和性能的困难。最初尝试使用float16的AMP和默认的adam beta2进行训练，结果损失开始波动，最终出现NaN异常。将beta2减小至0.97可以改善情况，但损失/零样本曲线的跟踪效果不如预期。切换到PyTorch最新版后，可以使用bfloat16 + AMP进行训练（与最近的H/14、g/14和G/14模型一样），beta2恢复为0.98，指标改善。

Checkpoint Interval	Cluster	# GPUs	# Nodes	GPU	local BS	sample/s	sample/s/gpu	precision	adam beta2
1 - 2	Stability	1024	128	A100 40GB	80	37-40k	36-39	amp + fp16	0.97
3 - 32	Stability	512	64	A100 80GB	160	27-32k	52-62	amp + fp16	0.97
33 - 75	Booster	1024	256	A100 40GB	80	48k	47	amp + fp16	0.97
76 - 165	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
166 - 232	Stability	320	40	A100 80GB	256	18-19k	56-59	amp + bf16	0.98
233 - 249	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
250 - 256	Stability	1024	128	A100 40GB	80	27-31k	26-30	amp + bf16	0.98

JUWELS Booster每个节点有4个A100 GPU，每个节点有4个HDR-200 IB适配器（每个GPU 200Gbit/sec）。稳定性设置使用每个节点8个A100 GPU和每个节点400Gbit/sec的EFA网络（每个GPU 50 GBit/sec）。发现不同配置之间训练效率（每个GPU的吞吐量）显著变化。1024 GPU配置在两个集群上特别容易崩溃（或者很难使用一组"好"的GPU运行）。

下面是128个8-GPU（40GB A100）配置的slurm srun命令行:

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=80 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 1e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

对于最后10%的重新训练，采用了更高的全局批量大小95744，LR和数据增强强度略有增加。

Checkpoint Interval	Cluster	# GPUs	# Nodes	GPU	local BS	sample/s	sample/s/gpu	precision	adam beta2
231 - 256	stability	1088	136	A100 40GB	88	32-35k	29-32	amp + bf16	0.98

136个8-GPU（40GB A100）节点的slurm srun命令行:

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-r-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=88 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 2e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

评估

使用 LAION CLIP Benchmark suite 中的代码进行评估。

测试数据、因素和指标

测试数据

使用VTAB+（VTAB（ https://arxiv.org/abs/1910.04867 ）与其他鲁棒性数据集的组合）进行分类，使用COCO和Flickr进行检索。

结果

这些模型在ImageNet-1k上的零样本下的准确率在79.1%到79.4%之间。

最后10%的重新训练结果:

已对更广范围的数据集进行了初步的基准测试，可在 https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb 上查看。

鸣谢

感谢 stability.ai 和Gauss Centre for Supercomputing e.V.（ http://gauss-centre.eu ）通过John von Neumann Institute for Computing（NIC）在Jülich Supercomputing Centre（JSC）的GCS Supercomputer JUWELS Booster上提供计算时间资助此项工作。

引用

BibTeX:

LAION-5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP软件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

OpenAI CLIP论文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}

作者:

LAION eV

数据集大小:

4.48 GB