CLEVR-X

Abstract

To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations.
For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question.
We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics.

Question: There is a purple metallic ball; what number of cyan objects are right of it?

Answer: 1

Explanation: There is a cyan cylinder which is on the right side of the purple metallic ball.

Key Properties of CLEVR-X

The CLEVR-X dataset provides 10 semantically identical (but grammatically varied) textual explanations for each CLEVR sample. By design, all explanations in CLEVR-X are:

correct,
and complete (i.e. all relevant objects are described).

More details are provided in our paper.

CLEVR-X Dataset Download

The CLEVR-X dataset consists of:

A training set of 2,401,275 natural language explanations for 70,000 images.
A validation set of 599,711 natural language explanations for 14,000 images.
A test set of 644,151 natural language explanations for 15,000 images.

We provide more information about the structure of the dataset and how to use it in our GitHub repository.

CLEVR-X dataset

Generation code

Citation

If you use the CLEVR-X dataset, please consider citing:

@inproceedings{salewski2022clevrx,
	title     = {CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations},
	author    = {Leonard Salewski and A. Sophia Koepke and Hendrik P. A. Lensch and Zeynep Akata},
	booktitle = {xxAI - Beyond explainable Artificial Intelligence},
	pages     = {85--104},
	year      = {2022},
	publisher = {Springer},
}

Acknowledgements

The authors thank the Amazon Mechanical Turk workers that participated in the user study. This work was supported by the DFG – EXC number 2064/1 – project number 390727645, by the DFG: SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms - project number: 276693517, by the ERC (853489 - DEXIM), and by the BMBF (FKZ: 01IS18039A). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting L. Salewski.