FLAIR: VLM with Fine-grained Language-informed Image Representations

¹Technical University of Munich ²Munich Center for Machine Learning (MCML)
³Helmholtz Munich ⁴Munich Data Science Institute (MDSI)

CVPR 2025

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained LAnguage-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

Methodology

Figure 1 compares FLAIR with prior approaches. FLAIR is a CLIP-style vision-language model that introduces text-conditioned attention pooling. Unlike CLIP, which directly uses a global image token, or SigLIP, which employs a learnable global query, FLAIR leverages global text tokens as queries to pool local image features. It further aligns the resulting global image token with its corresponding text.

Figure 1. Comparison with previous methods.

Figure 2 illustrates the architecture of FLAIR. It is trained on datasets with synthetic captions generated by MLLMs. By sampling diverse sub-captions per image, FLAIR creates multiple positive and negative pairs. Each sub-caption guides attention pooling over local image tokens to produce fine-grained, text-specific image embeddings. The model is optimized using a text-conditioned sigmoid loss and a multi-positive sigmoid loss.

Figure 2. An overview of FLAIR.

Quantitative Results

FLAIR is evaluated on a range of tasks, including zero-shot standard, fine-grained, and long image-text retrieval, as well as zero-shot semantic segmentation and image classification. Trained on just 30M image-text pairs, FLAIR delivers strong performance across all three retrieval settings and in semantic segmentation, outperforming models trained on billion-scale datasets. For zero-shot image classification, it matches the performance of prior methods trained on similar-scale synthetic data, showing that fine-grained alignment does not come at the cost of global recognition ability.

Qualitative Results

We visualize the attention maps from our text-conditioned attention pooling layer to illustrate how FLAIR localizes semantic regions in the image based on different captions. As shown below, FLAIR attends to both large and small objects (e.g., "truck" and "worker") and distinguishes between similar objects based on properties like color or position (e.g., different horses). We also demonstrate token-wise image-text similarities (between local image tokens and the global text token), showing that FLAIR produces fine-grained alignment, successfully highlighting relevant local regions for each caption. This highlights FLAIR's strong sensitivity to semantic details and its capacity for localized understanding.

BibTeX

@inproceedings{xiao2025flair, title = {FLAIR: VLM with Fine-grained Language-informed Image Representations}, author = {Xiao, Rui and Kim, Sanghwan and Georgescu, Mariana-Iuliana and Akata, Zeynep and Alaniz, Stephan}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025} }