Bio: Yang Chen is a Ph.D. student at Georgia Tech, supervised by Professors Alan Ritter and Wei Xu. His research interests lie in grounding the visual knowledge of multimodal large language models through retrieval-augmented generation. He has interned at Google DeepMind, working with Hexiang Hu and Ming-Wei Chang on visual entity reasoning using multimodal LLMs. Prior to that, he received an M.S. in Computer Science from the University of Chicago and a Bachelor’s degree from the University of Melbourne.

Description: Multimodal Large Language Models (MLLMs) have demonstrated state-of-the-art capabilities in various tasks involving both images and text, including visual question answering. However, it remains unclear whether these MLLMs possess the ability to answer information-seeking queries of an image such as ‘When was this church built?’.

In this talk, I will first introduce InfoSeek, a dataset tailored for visual information-seeking questions that cannot be answered using only common sense knowledge. I will then present insights into the generalization and instruction-tuning of MLLMs using InfoSeek. Finally, I will discuss what the future holds for multimodal retrieval models and how MLLMs-powered generative search engines could transform the existing search experiences.

Project page at https://open-vision-language.github.io/infoseek/

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics