There was an electric feeling in the air at CVPR 2023 in Vancouver. Three members of the Roboflow team were in attendance, and Roboflow hosted a panel on the Roboflow 100 dataset.
The world's premier computer vision conference was packed full of researchers and practitioners sharing their ideas on the breakthroughs in the field over the last year and looking towards the future of the field and AI at large.
In this post, we will dissect the themes and highlights at CVPR 2023 which is both a reflection on the conference, but it is also a prediction on the major themes that will be dominating the computer vision landscape for the coming year.
The Rise of the Vision Transformer
In the world of AI research the transformer architecture has made major strides in pushing the state of the art. The vision transformer recently landed in the computer vision world. The vision transformer, built using the transformer architecture, treats a patch of pixels like a sequence of text, allowing the same architecture to be used for vision tasks.
At CVPR 2023, we saw a slew of new techniques related to the vision transformer with researchers working on analyzing its biases, pruning it, pretraining it, distilling it, reverse distilling it, and applying it to new tasks.
Some of our favorite papers in this category:
- OneFormer: One Transformer To Rule Universal Image Segmentation
- Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
- SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Computer Vision Yearns for a Foundational Model
General pre-trained models have shown to be wide multi-task learners, obviating the need for many, often much more tedious, fine-tuning approaches to machine learning problems. In NLP, language models predicting the next token in text have proven to be a foundational model that scales in efficacy with model size. In the computer vision research community, no such model and loss objective have emerged to serve as the de facto foundational model for CV tasks.
In Artificial Intelligence academia, there is often an attitude of "doing more with less" as we heard at the Tuesday and Wednesday keynotes. The academic research community recognizes that they will not be able to compete with industrial research labs that have access to vast computing resources to create their general models.
With that said, we saw numerous instances of research labs working on foundation models at CVPR, mostly at the intersection of language and images.
General pre-trained computer vision models that were heavily discussed at CVPR ranged in tactic and modality:
- Grounding DINO: Zero shot object detection, multi-modal
- SAM: Zero shot segmentation, image only
- Multi-modal GPT-4 (not as much as we expected)
- Florence: General task, multi-modal
- OWL-VIT: Zero shot object detection, multi-modal
CLIP also boasted a long lineage of research papers at CVPR. Some exciting research working on foundational embedding models for computer vision at CVPR included:
- Learning Visual Representations via Language-Guided Sampling
- DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
- RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Next year, there will inevitably be significant progress and focus on this front, and we can expect some exciting new foundational CV models to be released.
Machine Learning Technique, Tactics, and Tasks
While the conference hall was full of discussion about general models, the core body of CVPR research in 2023 involved more traditional work in techniques and tasks in computer vision.
Research advanced in tasks like NERFs, pose estimation, and tracking, with new approaches and routines.
General machine learning techniques advanced as well as researchers worked on the theory of machine learning and empirical results to improve training routines.
We were particularly excited about the following practical machine learning research:
- Soft Augmentation for Image Classification
- FFCV: Accelerating Training by Removing Data Bottlenecks
- The Role of Pre-training Data in Transfer Learning
Industry vs Research: A Notable Divide
A physical divide between the research poster sessions and company booths underpinned an intellectual divide between the future of the field and what is practical today.
While the research posters and workshop sessions were focused primarily vision transformers, the industry booths sported Python snippets wrapping YOLO models.
We were really excited to see significant industrial progress being made by companies working on data annotation services, cloud compute, and model acceleration.
It has never been a more exciting time to be working in computer vision. CVPR 2023 showcased many important moments from the year in our field. Multi-modal models promise a new foundation and practical progress is taking computer vision into a new phase of adoption in industry.
You can view the full list of CVPR 2023 research papers here: https://openaccess.thecvf.com/CVPR2023?day=all