Abstract
To obtain detailed insights into the process of generating natural language explanations for VQA, we
introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language
explanations.
For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple
structured textual explanations which are derived from the original scene graphs. By construction,
the CLEVR-X explanations are correct and describe the reasoning and visual information that is
necessary to answer a given question.
We conducted a user study to confirm that the ground-truth
explanations in our proposed dataset are indeed complete and relevant.
We present baseline results for generating natural language explanations in the context of VQA using
two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis
of the explanation generation quality for different question and answer types.
Additionally, we study the influence of using different numbers of ground-truth explanations on the
convergence of natural language generation (NLG) metrics.
Key Properties of CLEVR-X
- correct,
- and complete (i.e. all relevant objects are described).
More details are provided in our paper.
CLEVR-X Dataset Download
- A training set of 2,401,275 natural language explanations for 70,000 images.
- A validation set of 599,711 natural language explanations for 14,000 images.
- A test set of 644,151 natural language explanations for 15,000 images.
Citation
If you use the CLEVR-X dataset, please consider citing:@inproceedings{salewski2022clevrx,
title = {CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations},
author = {Leonard Salewski and A. Sophia Koepke and Hendrik P. A. Lensch and Zeynep Akata},
booktitle = {xxAI - Beyond explainable Artificial Intelligence},
pages = {85--104},
year = {2022},
publisher = {Springer},
}