
Distilled Data-efficient Image Transformer (small-sized model)

Distilled data-efficient Image Transformer (DeiT) model pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. It was first introduced in the paper Training data-efficient image transformers & distillation through attention by Touvron et al. and first released in this repository . However, the weights were converted from the timm repository by Ross Wightman.

Disclaimer: The team releasing DeiT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.

Intended uses & limitations

You can use the raw model for image classification. See the model hub to look forfine-tuned versions on a task that interests you.

How to use

Since this model is a distilled ViT model, you can plug it into DeiTModel, DeiTForImageClassification or DeiTForImageClassificationWithTeacher. Note that the model expects the data to be prepared using DeiTFeatureExtractor. Here we use AutoFeatureExtractor, which will automatically use the appropriate feature extractor given the model name.

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import AutoFeatureExtractor, DeiTForImageClassificationWithTeacher
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-small-distilled-patch16-224')
model = DeiTForImageClassificationWithTeacher.from_pretrained('facebook/deit-small-distilled-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon.

Training data

This model was pretrained and fine-tuned with distillation on ImageNet-1k , a dataset consisting of 1 million images and 1k classes.

Training procedure


The exact details of preprocessing of images during training/validation can be found here .

At inference time, images are resized/rescaled to the same resolution (256x256), center-cropped at 224x224 and normalized across the RGB channels with the ImageNet mean and standard deviation.


The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For all hyperparameters (such as batch size and learning rate) we refer to table 9 of the original paper.

Evaluation results

Model ImageNet top-1 accuracy ImageNet top-5 accuracy # params URL
DeiT-tiny 72.2 91.1 5M 12310321
DeiT-small 79.9 95.0 22M 12311321
DeiT-base 81.8 95.6 86M 12312321
DeiT-tiny distilled 74.5 91.9 6M 12313321
DeiT-small distilled 81.2 95.4 22M 12314321
DeiT-base distilled 83.4 96.5 87M 12315321
DeiT-base 384 82.9 96.2 87M 12316321
DeiT-base distilled 384 (1000 epochs) 85.2 97.2 88M 12317321

Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

BibTeX entry and citation info

      title={Training data-efficient image transformers & distillation through attention}, 
      author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou},
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
Distilled Data-efficient Image Transformer (small-sized model) 对于224x224分辨率的图像,在ImageNet-1k(1百万张图像,1,000类别)上进行了预训练和微调的Distilled Data-efficient Image Transformer(DeiT)模型。该模型首次由Touvron等人在 Training data-efficient image transformers & distillation through attention 论文中介绍,并于 this repository 首次发布。然而,权重是由Ross Wightman从 timm repository 转换而来的。请注意:DeiT团队没有为这个模型编写模型卡片,因此该模型卡片由Hugging Face团队编写。模型描述:该模型是一个蒸馏的Vision Transformer (ViT)。它在预训练和微调过程中使用了一个蒸馏令牌,除了类别令牌之外,通过自注意力层与类别([CLS])和块令牌进行交互,从而有效地从教师(CNN)中学习。将图像呈现给模型时,它们被作为固定大小的块(16x16分辨率)序列,并且线性嵌入。预期用途和限制:您可以使用原始模型进行图像分类。查看 model hub 以寻找您感兴趣的任务的微调版本。使用方法:由于该模型是一个蒸馏的ViT模型,您可以将其插入到DeiTModel、DeiTForImageClassification或DeiTForImageClassificationWithTeacher中。请注意,模型期望使用DeiTFeatureExtractor准备数据。这里我们使用AutoFeatureExtractor,它会根据模型名称自动选择适当的特征提取器。以下是如何使用此模型将COCO 2017数据集的图像分类为1,000个ImageNet类别之一的示例:
from transformers import AutoFeatureExtractor, DeiTForImageClassificationWithTeacher
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-small-distilled-patch16-224')
model = DeiTForImageClassificationWithTeacher.from_pretrained('facebook/deit-small-distilled-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
。目前,该模型和特征提取器都支持PyTorch。Tensorflow和JAX/FLAX即将推出。培训数据:该模型在由 ImageNet-1k 组成的数据集上进行了预训练和微调,该数据集包含1百万张图像和1k类别。训练过程:预处理:有关在训练/验证期间对图像进行预处理的确切细节,请参阅 here 。推断时,图像被调整大小/缩放到相同的分辨率(256x256),在224x224处进行中心裁剪,并通过ImageNet的平均值和标准差在RGB通道上进行归一化。预训练:该模型在单个8-GPU节点上训练了3天。训练分辨率为224。对于所有超参数(如批大小和学习速率),我们参考原始论文的表9。评估结果:
Model ImageNet top-1 accuracy ImageNet top-5 accuracy # params URL
DeiT-tiny 72.2 91.1 5M 12310321
DeiT-small 79.9 95.0 22M 12311321
DeiT-base 81.8 95.6 86M 12312321
DeiT-tiny distilled 74.5 91.9 6M 12313321
DeiT-small distilled 81.2 95.4 22M 12314321
DeiT-base distilled 83.4 96.5 87M 12315321
DeiT-base 384 82.9 96.2 87M 12316321
DeiT-base distilled 384 (1000 epochs) 85.2 97.2 88M 12317321
      title={Training data-efficient image transformers & distillation through attention}, 
      author={Hugo Touvron and Matthieu Cord and Matthijs Douze and Francisco Massa and Alexandre Sablayrolles and Hervé Jégou},
      title={Visual Transformers: Token-based Image Representation and Processing for Computer Vision}, 
      author={Bichen Wu and Chenfeng Xu and Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Zhicheng Yan and Masayoshi Tomizuka and Joseph Gonzalez and Kurt Keutzer and Peter Vajda},
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},