DINOv2: Learning Robust Visual Features without Supervision - Apprentissage de modèles visuels à partir de données massives Access content directly
Journal Articles Transactions on Machine Learning Research Journal Year : 2024

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab
  • Function : Author
Huy Vo
  • Function : Author
Marc Szafraniec
  • Function : Author
Vasil Khalidov
  • Function : Author
Daniel Haziza
  • Function : Author
Francisco Massa
  • Function : Author
Alaaeldin El-Nouby
  • Function : Author
Mahmoud Assran
  • Function : Author
Nicolas Ballas
  • Function : Author
Wojciech Galuba
  • Function : Author
Russell Howes
  • Function : Author
Po-Yao Huang
  • Function : Author
Shang-Wen Li
  • Function : Author
Ishan Misra
  • Function : Author
Michael Rabbat
  • Function : Author
Vasu Sharma
  • Function : Author
Gabriel Synnaeve
  • Function : Author
Hu Xu
  • Function : Author
Hervé Jegou
  • Function : Author
Patrick Labatut
  • Function : Author
Armand Joulin
  • Function : Author
Piotr Bojanowski
  • Function : Author

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
Fichier principal
Vignette du fichier
2304.07193.pdf (7.58 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-04376640 , version 1 (06-01-2024)
hal-04376640 , version 2 (02-02-2024)

Licence

Attribution

Identifiers

Cite

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, et al.. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal, 2024, pp.1-31. ⟨10.48550/arxiv.2304.07193⟩. ⟨hal-04376640v1⟩

Collections

MIAI ANR
117 View
68 Download

Altmetric

Share

Gmail Facebook X LinkedIn More