Description: Multimodal Large Language Models (MLLMs) have demonstrated state-of-the-art capabilities in various tasks involving both images and text, including visual question answering. However, it remains unclear whether these MLLMs possess the ability to answer information-seeking queries of an image such as ‘When was this church built?’.
In this talk, I will first introduce InfoSeek, a dataset tailored for visual information-seeking questions that cannot be answered using only common sense knowledge. I will then present insights into the generalization and instruction-tuning of MLLMs using InfoSeek. Finally, I will discuss what the future holds for multimodal retrieval models and how MLLMs-powered generative search engines could transform the existing search experiences.
Project page at https://open-vision-language.github.io/infoseek/
Add comment