Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.
翻译:摘要:当前的基础模型在仅文本输入或图文混合输入时均展现出令人瞩目的能力,但其表现是否随输入模态的变化而改变?本研究提出 $\textbf{IsoBench}$ 基准数据集,涵盖数学、科学、算法与游戏四大领域的问题。每个示例均提供多种$\textbf{同构表示}$输入形式,例如视觉、文本与数学表示。IsoBench 通过细粒度反馈诊断因表示形式差异导致的性能差距。在多种基础模型上的实验表明,针对同一问题,模型对文本表示存在一致偏好。尤为显著的是,在IsoBench全部问题上,Claude-3 Opus 在图像输入相比文本输入时性能下降28.7个百分点;类似地,GPT-4 Turbo 下降18.7个百分点,Gemini Pro 下降14.9个百分点。最终,我们提出两种提示策略——$\textit{IsoCombination}$ 与 $\textit{IsoScratchPad}$——通过整合不同输入表示组合及其转换来提升模型性能。