Pre-trained vision-language models have demonstrated remarkable zero-shot performance out-of-the-box on several tasks, such as image classification and visual question answering. In this talk, we describe methods for improving this visual reasoning performance with little or no human annotation effort and no training. In the first part, we describe a method that combines CLIP and simple spatial heuristics to form a strong baseline for referring expression comprehension. In the second part, we describe a method for multi-hop visual question answering that generates based on the question a program, which calls visual functions and combines their outputs to predict the answer.
Papers: ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension (https://arxiv.org/abs/2204.05991)Modular Visual Question Answering via Code Generation (https://arxiv.org/abs/2306.05392)
Sanjay is a PhD student in computer science at UC Berkeley advised by Trevor Darrell and Dan Klein. His research interests lie at the intersection of natural language processing and computer vision. He was a Predoctoral Young Investigator in the AllenNLP team at the Allen Institute for Artificial Intelligence.
Add comment