The Cohere For AI community’s Interactive Reading Group was pleased to welcome Michael Tschannen to present their work on Image-and-Language Understanding from Pixels Only.

Abstract: Multimodal models often consist of many tasks- and modality-specific pieces and training procedures. For example, CLIP trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss, outperforming pixel-based prior work.

Bio: Michael Tschannen is a Research Scientist at Google Research Zurich (Brain Team) broadly interested in multimodal representation learning. Before that he was working on computer vision R&D at Apple Zurich for two years, and spent a year as a postdoc at Google Research Zurich exploring topics in unsupervised representation learning, generative models, and neural compression. He completed his PhD at ETH Zurich in late 2018. Prior to that he obtained a MSc from ETH Zurich and a BSc from EPFL, both in Electrical Engineering and Information Technology.

Website: https://mitscha.github.io/

Paper: https://arxiv.org/abs/2212.08045

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics