FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao; Sanghwan Kim; Yongqin Xian; Zeynep Akata; Stephan Alaniz

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao^1,2, Sanghwan Kim^1,2,3, Yongqin Xian⁴, Zeynep Akata^1,2,3, Stephan Alaniz⁵

¹Technical University of Munich ²Munich Center for Machine Learning ³Helmholtz Munich
⁴Google ⁵Télécom Paris

Paper Code

FINER-Tuning Models

FINER-Tuning Data

FINER-CompreCap

Abstract

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites and enhancing general multimodal capabilities across six benchmarks.

Motivational Study

We observe an underexplored failure mode of frontier MLLMs: like careless humans, they overlook incorrect details hidden within long queries. We term such cases Fine-grained Negative Queries (FINER). As a motivational study, we construct FINER with seven levels of increasing granularity and probe InternVL-3.5 with yes-or-no questions, where the correct answer is always “No.” We then compare the base model with our fine-tuned version after FINER-Tuning.

The trend is striking. As the query becomes more detailed, the base model becomes much more likely to answer “Yes” to claims that should be rejected (false-positive hallucinations). For InternVL3.5-14B, accuracy drops from around 80% to around 20% on FINER-CompreCap, and to around 15% on FINER-DOCCI. In other words, subtle mistakes hidden inside otherwise correct descriptions are much harder for MLLMs to detect. FINER-Tuning noticeably improves this behavior, especially at the finest levels, motivating the need for benchmarks and training data that specifically target fine-grained hallucinations.

Motivational study showing performance under increasingly fine-grained negative queries — Figure 1. As negative queries become more fine-grained, base MLLMs become more likely to hallucinate. FINER-Tuning improves robustness, especially at higher granularity.

FINER-Benchmarks

FINER is built to test whether an MLLM can spot a small mistake hidden inside a detailed query. We start from a scene graph containing objects, attributes, and relations. For each element, we generate several plausible but incorrect alternatives, such as replacing an object, changing a color, or modifying a relation. We then compose both positive and negative questions from these scene graphs.

Both FINER-CompreCap and FINER-DOCCI contain four settings. Multi-obj checks whether the model can detect one wrong object among several correct ones. Multi-attr does the same for attributes. Multi-rel focuses on relations. Wh asks “what” questions with one incorrect attribute embedded in the query. Instead of simple yes/no evaluation, FINER uses multiple-choice questions, this is driven by two motivations, first we want the model to reason beyond simple yes-or-no, second by embedding the ground-truth into the choices, the model could compare and choose the best answer, decreasing ambiguity issues.

We perform human-in-the-loop filtering for the two benchmarks, since full human inspection is not possible at this scale. For one positive entity, we generate four negatives. We then apply Qwen2.5-VL-72B to classify the positive out from the negatives. If successful, then we keep all negatives. If the classification failed, we inspect the decision entropies. If entropies are low (the model is very certain at certain negatives, likely indicating that they are actually inside the image), we rewrite the negatives with Gemini again. Human verification is performed on small subset of samples to specify the filtering threshold.

Pipeline for constructing the FINER benchmarks — Figure 2. FINER benchmark construction: extract or build scene graphs, generate plausible negatives, and compose paired multiple-choice questions across four settings.

FINER-Tuning

FINER-Tuning is a data-driven training pipeline designed to make MLLMs better at rejecting fine-grained false queries. We start from dense long captions from Pixmo, avoiding overlap with COCO and the DOCCI training split. ONLY using Phi-4-14B (not the expensive Gemini or GPT), we extract four kinds of positive phrases that mirror our benchmark settings: object summaries, attribute summaries, relation summaries, and composed phrases for “what” questions.

We then generate minimally edited negative counterparts by changing exactly one semantic component, such as an object, an attribute, or a relation. From these positive and negative phrases, we build both positive and negative query-answer pairs. The accepted response always states the correct visual fact, while the rejected response gives the wrong one. The key difference between our work and previous works lies in queries: FINER-Tuning features long queries that possibly contain a small anomaly that requires models negative response.

Finally, we train the model with Direct Preference Optimization (DPO), so that it prefers grounded answers over hallucinated ones. This trains the model not only to answer correctly, but also to explicitly reject subtle false claims embedded in otherwise plausible queries.

Training data generation pipeline for FINER-Tuning — Figure 3. FINER-Tuning extracts fine-grained positive phrases, generates minimally edited negatives, constructs accepted/rejected answer pairs, and trains with DPO.

Results on FINER-Benchmarks

FINER is challenging even for strong frontier MLLMs. Performance drops sharply when models must reject subtle mistakes involving multiple objects, attributes, or relations, and the Wh setting remains particularly difficult. Prior hallucination-reduction methods that work on earlier benchmarks transfer poorly to FINER, showing that coarse hallucination benchmarks do not fully capture this problem.

FINER-Tuning consistently improves all four base models on both FINER-CompreCap and FINER-DOCCI. The gains are especially strong on the more fine-grained settings. For example, InternVL3.5-14B improves by up to 24.2% on FINER-CompreCap, and the tuned 14B model becomes competitive with much larger or closed models in several settings. We also find a clear trend: performance declines as the number of queried attributes or relations increases, but FINER-Tuning reduces this drop and brings larger gains precisely where the questions are hardest.

Figure showing performance trends on FINER benchmarks

Results on General Hallucination Benchmarks

A key question is whether training on FINER only helps on FINER itself, or whether it generalizes to broader hallucination evaluation. Encouragingly, FINER-Tuning transfers well. Across eight existing hallucination benchmarks, it consistently improves Qwen2.5-VL and InternVL3.5 on both discriminative and generative settings. On DASH, for instance, it improves the two InternVL3.5 variants by 6.2% and 5.5%, and it also lowers hallucination on MMHal-Bench and improves scores on HaloQuest.

This matters because FINER-Tuning is not aimed just for a single narrow benchmark. Instead, it teaches models to detect subtle contradictions in queries, and this stronger discrimination ability carries over to other hallucination suites as well.

Results on other hallucination benchmarks part 1

Results on other hallucination benchmarks part 2

Results on other hallucination benchmarks part 3

Results on General Multimodal Capabilities

Hallucination reduction often comes with an alignment tax, something that we were worried about FINER (because training with fine-grained negatives might, intuitively, leading to over-rejection behaviours of the models). However, FINER-Tuning actually avoids this trade-off. On six general-purpose multimodal benchmarks, it maintains or improves the performance of strong base models, including gains on MMStar, MMVP, NaturalBench, and V* Bench. For InternVL3.5-14B, the average score improves by 1.4%. This suggests that FINER provides a useful training signal that complements, rather than damages, a model’s broader multimodal capabilities.

Ablation Studies

We conduct two ablation studies to understand what drives FINER-Tuning. First, we ablate the training strategy by comparing DPO against SFT, and by training with only negative queries versus both positive and negative queries. Interestingly, the results show that SFT can even hurt performance, while DPO is consistently stronger. In particular, training with both positive and negative queries gives the best overall results, showing that FINER-Tuning benefits from learning both to confirm correct statements and to reject subtle false ones. Second, we ablate training subset selection by training on only one subset at a time: Multi-obj, Multi-attr, Multi-rel, or Wh. Models trained on a single subset perform best on their matching test setting, but still transfer somewhat to other settings. However, training on all subsets gives the most balanced performance overall, suggesting that FINER-Tuning learns a broader fine-grained rejection capability rather than overfitting to one query type. We have also included a series of extra ablation studies in the supplementary material.

First ablation study table — Training strategy ablation.

Second ablation study table — Training subset ablation.

Qualitative Results on FINER benchmarks

Results on other hallucination benchmarks part 4

Limitations and Future Work

FINER still has several limitations. Although we include human-in-the-loop filtering, the scale of the benchmarks, especially FINER-DOCCI, makes full manual curation impractical. As a result, some samples may still contain annotation errors or ambiguous cases (we will provide more detailed visualizations later). In addition, our current multi-relation setting is limited to at most three relations, which does not yet capture richer relational compositions that appear in real-world scenes.

We recommend using FINER-CompreCap as the primary benchmark for evaluating fine-grained hallucination. It has more controlled annotations and lower noise. FINER-DOCCI is provided as an exploratory large-scale benchmark that extends FINER to open-set scenarios and we mainly aim to test if our findings still hold at scale. Due to the significantly larger scale and the open-set nature of the annotations, its labels may contain more noise. We are still working on to improve FINER benchmarks, and we think a benchmark with full human validation should indeed be the future direction.

A natural next step is to build a larger and more challenging version of FINER with more objects, attributes, and relations per image, while increasing the level of human validation. We view FINER as a starting point for studying hallucinations hidden inside fine-grained queries, and hope it encourages broader work in this direction. Looking ahead, we believe the same idea can be extended beyond general-domain vision-language benchmarks to high-stakes settings such as medicine, finance, and law, where even subtle fine-grained errors can be costly.

BibTeX

@inproceedings{xiao2026finer,
  title={FINER: MLLMs Hallucinate under Fine-grained Negative Queries},
  author={Xiao, Rui and Kim, Sanghwan and Xian, Yongqin and Akata, Zeynep and Alaniz, Stephan},
  journal={CVPR},
  year={2026}
}