Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

翻译：邦加德问题（BPs）为抽象视觉推理（AVR）提供了一个具有挑战性的测试平台，要求模型仅从少量示例中识别视觉概念并用自然语言进行描述。早期的BP基准测试采用合成的黑白线条图，可能无法完全捕捉真实场景的复杂性。随后的BP数据集使用了真实世界图像，但其所表征的概念可通过高层图像特征识别，降低了任务难度。与之不同，近期发布的Bongard-RWR数据集旨在通过细粒度真实世界图像来表征原始BP中定义的抽象概念。然而，其人工构建方式将数据集规模限制在仅$60$个实例，制约了评估的稳健性。本研究提出Bongard-RWR+，这是一个包含$5\,400$个实例的BP数据集，通过视觉语言模型（VLM）流程生成的类真实世界图像来表征原始BP抽象概念。基于Bongard-RWR，我们使用Pixtral-12B描述人工筛选的图像并生成与底层概念对齐的新描述，利用Flux.1-dev根据这些描述合成图像，并人工验证生成图像是否忠实反映目标概念。我们在多种BP任务框架下评估前沿VLM模型，包括二分类与多分类任务以及文本答案生成。研究结果表明，虽然VLM能够识别粗粒度视觉概念，但在辨别细粒度概念时持续存在困难，这凸显了其推理能力的局限性。