Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

Shashanka Venkataramanan; Mamshad Nayeem Rizve; João Carreira; Yuki Asano; Yannis Avrithis

Communication Dans Un Congrès Année : 2024

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

(1) , (2) , (3) , (4) , (5)

1
2
3
4
5

Shashanka Venkataramanan

Fonction : Auteur
PersonId : 1126157

Creating and exploiting explicit links between multimedia fragments

Mamshad Nayeem Rizve

Fonction : Auteur

University of Central Florida [Orlando]

João Carreira

Fonction : Auteur

Google DeepMind

Yuki Asano

Fonction : Auteur

University of Amsterdam [Amsterdam] = Universiteit van Amsterdam

Yannis Avrithis

Fonction : Auteur
PersonId : 20705
IdHAL : yannis-avrithis
ORCID : 0000-0001-7476-4482
IdRef : 253126193

Institute of Advanced Research in Artificial Intelligence [Vienna]

Résumé

Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hourslong, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt imagebased pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DORA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

dora_iclr.pdf (15.29 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Shashanka Venkataramanan : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-04407117

Soumis le : lundi 22 janvier 2024-09:36:42

Dernière modification le : vendredi 2 février 2024-16:51:37

Dates et versions

hal-04407117 , version 1 (20-01-2024)

hal-04407117 , version 2 (22-01-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04407117 , version 2

Citer

Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki Asano, Yannis Avrithis. Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video. ICLR 2024 - Twelfth International Conference on Learning Representations, May 2024, Vienna, Austria. pp.1-21. ⟨hal-04407117v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

74 Consultations

29 Téléchargements

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager