The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

翻译：视觉-语言模型（VLM）的迅速崛起被广泛赞誉为统一多模态知识发现的曙光，但其根基建立在一个危险且未经审视的公理之上：即当前的VLM能够忠实地合成多模态数据。我们认为它们做不到这一点。相反，一种深刻的信任危机潜藏于主流的视觉编码器-投影仪-大语言模型范式之中。当前最先进的模型并非从视觉输入中提取扎实的知识，而是频繁表现出功能性失明，即利用强大的语言先验来绕过严重的视觉表征瓶颈。在本工作中，我们挑战了传统的多模态评估方法，该方法依赖于数据消融或创建新数据集，因此致命地将数据集偏差与架构能力不足相混淆。我们提出一种根本性的、基于信息论的变革：模态转换协议，旨在量化地揭示“看得见的代价”。通过转换语义载荷而非消融它们，我们构建了三个新颖的指标——看得见的通行费（ToS）、诅咒（CoS）和谬误（FoS）——并最终提出语义充分性准则（SSC）。此外，我们提出一个颇具挑衅性的多模态缩放发散定律，假设随着底层语言引擎扩展至前所未有的推理能力，视觉知识瓶颈的数学惩罚反而会增大。我们呼吁KDD社区放弃对“多模态增益”的虚幻追求。通过将SSC从被动的诊断约束提升为主动的架构蓝图，我们提供了严格的、可信赖的基础，以迫使下一代AI系统真正地“看见”数据，从而实现真正的多模态推理。