A Visual Reasoning Dataset for Natural Language Explanations

1University of Tübingen 
2MPI for Informatics 
3MPI for Intelligent Systems 


To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations.
For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question.
We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics.

CLEVR-X example.
Question: There is a purple metallic ball; what number of cyan objects are right of it?
Answer: 1
Explanation: There is a cyan cylinder which is on the right side of the purple metallic ball.

Key Properties of CLEVR-X

The CLEVR-X dataset provides 10 semantically identical (but grammatically varied) textual explanations for each CLEVR sample. By design, all explanations in CLEVR-X are:
  • correct,
  • and complete (i.e. all relevant objects are described).

More details are provided in our paper.

CLEVR-X Dataset Download

The CLEVR-X dataset consists of:
  • A training set of 2,401,275 natural language explanations for 70,000 images.
  • A validation set of 599,711 natural language explanations for 14,000 images.
  • A test set of 644,151 natural language explanations for 15,000 images.
We provide more information about the structure of the dataset and how to use it in our GitHub repository.


If you use the CLEVR-X dataset, please consider citing:
	title     = {CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations},
	author    = {Leonard Salewski and A. Sophia Koepke and Hendrik P. A. Lensch and Zeynep Akata},
	booktitle = {xxAI - Beyond explainable Artificial Intelligence},
	pages     = {85--104},
	year      = {2022},
	publisher = {Springer},


The authors thank the Amazon Mechanical Turk workers that participated in the user study. This work was supported by the DFG – EXC number 2064/1 – project number 390727645, by the DFG: SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms - project number: 276693517, by the ERC (853489 - DEXIM), and by the BMBF (FKZ: 01IS18039A). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting L. Salewski.