The potential of Vision-Language Models (\textsc{vlm}s) often remains underutilized in handling complex text-based problems, particularly when these problems could benefit from visual representation. Resonating with humans' ability to solve complex text-based problems by (1) creating a visual diagram from the problem and (2) deducing what steps they need to take to solve it, we propose \textsc{Self-Imagine}. We leverage a single Vision-Language Model (\textsc{vlm}) to generate a structured representation of the question using HTML, then render the HTML as an image, and finally use the same \vlm to answer the question using both the question and the image. Our approach does not require any additional training data or training. We evaluate our approach in three mathematics tasks and nine general-purpose reasoning tasks using state-of-the-art \textsc{vlm}. Our approach boosts the performance of \textsc{vlm} on all math tasks (\gsm: +4.62\%; \asdiv: +4.49\%; \svamp: +9.30\%) and the majority of the general-purpose reasoning tasks by 0.4\% to 13.20\% while achieving comparable performance in other tasks. Code and data at https://github.com/snat1505027/self-imagine .
翻译:视觉-语言模型(VLM)在处理复杂文本问题时,其潜力往往未能被充分挖掘,尤其是当这些问题可从视觉表征中获益时。基于人类通过(1)从问题中构建可视化图表,以及(2)推导解题所需步骤来解决复杂文本问题的能力,我们提出了“自我构想”(Self-Imagine)方法。该方法利用单一视觉-语言模型,首先生成问题的结构化HTML表示,然后将HTML渲染为图像,最后使用同一VLM结合问题与图像进行答案推理。本方法无需额外训练数据或训练过程。我们在三个数学任务和九个通用推理任务上,采用最先进的VLM进行评估。结果表明,该方法在所有数学任务(GSM:+4.62%;ASDiv:+4.49%;SVAMP:+9.30%)及多数通用推理任务上提升了VLM的性能(提升幅度为0.4%至13.20%),并在其他任务上取得了相当的性能。代码与数据见https://github.com/snat1505027/self-imagine。