VirTex is a pretraining approach which uses semantically dense captions to learn visual representations. We train CNN + Transformers from scratch on COCO Captions, and transfer the CNN to downstream ...
Abstract: Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years. Generally, training a deep CNN model requires numerous labeled ...
25 years ago, Jianbo Shi introduced Normalized Cuts (spectral clustering), a graph-theoretic approach to perceptual grouping that became a staple in unsupervised image segmentation. While the original ...
Abstract: Compositional Zero-Shot Learning (CZSL) aims to learn visual concepts (i.e., attributes and objects) from seen compositions and combine them to predict unseen compositions. Existing visual ...