模型:

laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind

任务:

零样本图像分类

类库:

OpenCLIP

其他:

clip

预印本库:

arxiv:2210.08402 arxiv:1910.04867

许可:

mit

模型介绍文件清单

英文

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind 模型卡片

模型详情

模型描述

CLIP ConvNeXt-XXLarge（自定义 timm ConvNeXt 尺寸）模型系列，使用 LAION-2B（英语）数据集的子集训练，包括 LAION-5B 样本，使用 OpenCLIP 进行训练。

Model	Dataset	Resolution	AugReg	Top-1 ImageNet Zero-Shot (%)
12310321	LAION-2B	256x256	RRC (0.33, 1.0), RE (0.35), SD (0.1)	79.1
12311321	LAION-2B	256x256	RRC (0.3, 1.0), RE (0.4), SD (0.1)	79.3
12312321	LAION-2B	256x256	N/A	79.4
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only

核心训练运行分为多个部分，在约 2 个月的时间内完成。核心运行的全局批量大小为 81920。最后约 10% 的训练以 95744 的全局批量大小进行，具有更高的学习率和数据增强，然后将两者平均得到最终模型。更多细节请参阅《训练详情》。

目标：

将最大的 CLIP 图像塔卷积尺寸推入 ViT-g 到 ViT-G 的性能范围，并改进用于下游任务的图像尺寸缩放。

创新点：

最大的已发布的 ConvNeXt 模型预训练（847M 参数，具有 198 GMAC 和 125 MActs，在图像尺寸为 256x256 时）
非 ViT 图像塔的 CLIP 模型（无先前的图像塔预训练），达到超过 79% 的 ImageNet top-1 零样本准确率

模型采用以下配置：

ConvNeXt-XXLarge 模型（convnext_xxlarge）作为图像塔
图像塔后接标准 projection
相同尺寸的文本塔（具有 1024，16 heads，24 depth），与 ViT-H-14 和 ViT-g-14 模型相同

模型在 256x256 的图像分辨率上进行训练。合并的图像 + 文本 CLIP 模型具有 12 亿参数，具有 222 GMAC 和 146 MActs。在 256x256 的分辨率下，ConvNext-XXLarge 在 FLOPS 和参数方面略高于 ViT-H-14 CLIP 配置，在激活计数方面略低。模型的性能介于 g-14 和 G-14 之间，同时低于二者。

model	image_size	embed_dim	gmacs	macts	mparams	image_gmacs	image_macts	image_mparams	text_gmacs	text_macts	text_mparams
ViT-H-16	224	1024	150.96	122.01	986.26	127.4	100.81	632.23	23.57	21.2	354.03
ViT-H-14	224	1024	190.97	160.61	986.11	167.4	139.41	632.08	23.57	21.2	354.03
ViT-L-14-336	336	768	197.76	278.19	427.94	191.1	270.24	304.29	6.66	7.95	123.65
convnext_xxlarge	256	1024	221.66	145.66	1200.58	198.09	124.45	846.54	23.57	21.2	354.03
RN50x64	448	1024	276.8	249.73	623.26	265.02	239.13	420.38	11.78	10.6	202.88
ViT-g-14	224	1024	290.74	213.84	1366.68	267.18	192.64	1012.65	23.57	21.2	354.03
convnext_xxlarge_320	320	1024	333.08	215.66	1200.58	309.52	194.46	846.54	23.57	21.2	354.03
ViT-H-14-336	336	1024	414.53	428.74	986.52	390.97	407.54	632.49	23.57	21.2	354.03
ViT-bigG-14	224	1280	532.92	310.71	2539.57	483.96	275.37	1844.91	48.96	35.34	694.66

模型的训练由 Ross Wightman 在 stability.ai 集群和 JUWELS Booster 超级计算机上完成。请参阅下面的致谢。

用途

与原始 OpenAI CLIP model card 一样，该模型旨在对研究界提供研究成果。我们希望该模型能够帮助研究人员更好地理解和探索零样本、任意图片分类的能力，并用于跨学科研究以评估该模型的潜在影响。

OpenAI CLIP 论文中提供了潜在下游影响的讨论，用于示例分析。此外，LAION-5B 博客（ https://laion.ai/blog/laion-5b/ ）和即将发布的论文还包含有关训练数据集的具体讨论。

直接使用

零样本图片分类、图片和文本检索等。

下游使用

图片分类和其他图片任务微调，线性探针图片分类，引导和条件生成图片等。

不适用用途

根据 OpenAI 模型政策，

目前不适用于模型的任何部署用途，无论商业性质与否。不建议在未经过具体、固定类别分类验证的情况下使用模型进行限制环境中的图片搜索等非部署用途。这是因为我们的安全评估显示，由于 CLIP 在不同类别分类上性能的变化，针对特定任务的测试尤为重要，当前在任何用例中未测试且不受限制的部署都可能有潜在危险。

不管模型性能如何，与监控和人脸识别等领域相关的用途始终不在范围之内。这是因为目前在此类任务中使用人工智能可能过早，鉴于缺乏测试规范和确保公平使用的检查机制。

由于该模型没有经过特意的其他语言训练和评估，所以其使用应限于英语语言用例。

在上述注意事项之外，这些模型的训练中使用的 LAION-5B 数据集还有其他注意事项，请参阅下文。

训练详情

训练数据

使用的训练数据集是 LAION-2B，是 LAION-5B 数据集的英语子集（ https://laion.ai/blog/laion-5b/ ）。

重要提示：数据集的创建动机是为了使大规模多模态模型训练和处理来自公共互联网的非策划大规模数据集的研究和实验民主化。因此，我们建议将数据集用于研究目的。请注意，这个大规模数据集是非策划的。请记住，数据集的非策划性质意味着收集到的链接可能会导致对人类观众来说非常不安和令人不悦的内容。因此，请谨慎使用演示链接，并自行承担风险。可以通过使用基于安全标签的筛选（使用我们构建的定制 NSFW 分类器）提取出“安全”子集。虽然这样可以大大减少查看时遇到潜在有害内容的机会，但我们无法完全排除安全模式下仍然存在有害内容的可能性，因此警告仍然适用。我们认为向广泛的研究和其他感兴趣的社区提供数据集将允许透明地研究训练大规模模型带来的好处，以及与使用封闭大型数据集进行工作的研究相比可能未报告或未注意到的陷阱和危险。但是，通过公开提供数据集，我们不建议将其用于创建即用型工业产品，因为我们此次发布的基础研究仍然在研究大规模模型的一般属性和安全性方面。

训练过程

主要训练运行使用全局批量大小为 81920，总共进行了 256 个检查点间隔的训练，总样本量约为 1356 万，总共训练了约 340 亿个样本。

在训练该模型时遇到了许多困难，包括模型数值稳定性和集群稳定性和性能问题。初始尝试使用 float16 AMP 和默认的 Adam beta2 导致损失波动，最终会导致 NaN 错误。将 beta2 降低为 0.97 有所帮助，但损失 / zs 曲线不符合预期。切换到 PyTorch nightlies 后，可以使用 bfloat16 + AMP 进行训练（与最近的 H/14、g/14 和 G/14 模型一样），beta2 返回到 0.98，指标有所改善。

Checkpoint Interval	Cluster	# GPUs	# Nodes	GPU	local BS	sample/s	sample/s/gpu	precision	adam beta2
1 - 2	Stability	1024	128	A100 40GB	80	37-40k	36-39	amp + fp16	0.97
3 - 32	Stability	512	64	A100 80GB	160	27-32k	52-62	amp + fp16	0.97
33 - 75	Booster	1024	256	A100 40GB	80	48k	47	amp + fp16	0.97
76 - 165	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
166 - 232	Stability	320	40	A100 80GB	256	18-19k	56-59	amp + bf16	0.98
233 - 249	Booster	1024	256	A100 40GB	80	51k	50	amp + bf16	0.98
250 - 256	Stability	1024	128	A100 40GB	80	27-31k	26-30	amp + bf16	0.98

JUWELS Booster 节点每个节点有 4 个 A100 GPU，每个节点有 4 个 HDR-200 IB 适配器（每个 GPU 200Gbit/sec）。稳定性设置使用每个节点有 8 个 A100 GPU 和每个节点有 400Gbit/sec EFA 网络（每个 GPU 50 GBit/sec）。不同配置的训练效率（每个 GPU 的吞吐量）存在显著变化。1024 个 GPU 的配置在两个集群上都特别容易崩溃（或者很难使用一组良好的 GPU 运行）。

以下是一个 128 个 8-GPU（40GB A100）配置的 slurm srun 命令行示例：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=80 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 1e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

在重做最后 10% 的回溯训练中，使用更高的全局批量大小 95744，以及更高的学习率和略微增加的数据增强强度。

Checkpoint Interval	Cluster	# GPUs	# Nodes	GPU	local BS	sample/s	sample/s/gpu	precision	adam beta2
231 - 256	stability	1088	136	A100 40GB	88	32-35k	29-32	amp + bf16	0.98

以下是 136 个 8-GPU（40GB A100）节点的 slurm srun 命令行示例：

srun --cpu_bind=v --accel-bind=gn python -m training.main \
    --save-frequency 1 \
    --name "xxlarge-2b-81920-r-bf16" \
    --resume "latest" \
    --logs "/runs" \
    --log-every-n-steps 50 \
    --train-data="pipe:aws s3 cp s3://laion5b/laion2B-data/{000000..231349}.tar -" \
    --train-num-samples 135646078 \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=88 \
    --epochs=256 \
    --dataset-resampled \
    --aug-cfg use_timm=True scale='(0.3, 1.0)' re_prob=0.4 \
    --precision amp_bfloat16 \
    --grad-clip-norm 5.0 \
    --lr 2e-3 \
    --workers=6 \
    --beta2 0.98 \
    --model "convnext_xxlarge" \
    --seed 0 \
    --ddp-static-graph \
    --local-loss \
    --gather-with-grad \
    --grad-checkpointing \
    --report-to "tensorboard"

评估

使用 LAION CLIP Benchmark suite 中的代码进行评估。

测试数据、因素和指标

测试数据

使用 VTAB+（VTAB（ https://arxiv.org/abs/1910.04867 ）与额外的鲁棒性数据集组合）进行分类，使用 COCO 和 Flickr 进行检索。

结果

这些模型在 ImageNet-1k 上实现了 79.1% 到 79.4% 的零样本准确率。

对最后 10% 进行放大回顾：

已对更广泛的数据集进行了初步基准测试，可以在 https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb 查看。

致谢

感谢 stability.ai 和 Gauss Centre for Supercomputing e.V.（ http://gauss-centre.eu ）资助本项目，并通过 Jülich Supercomputing Centre（JSC）的 GCS Supercomputer JUWELS Booster 提供计算资源。

引用

BibTeX：

LAION-5B

@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

OpenCLIP 软件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

OpenAI CLIP 论文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year    = {2022},
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

@InProceedings{pmlr-v162-wortsman22a,
  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23965--23998},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
}

作者:

LAION eV

数据集大小:

4.48 GB