Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models' failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs' visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at https://github.com/Tujz2023/SMSP.
翻译:近期研究表明,多模态大语言模型(MLLMs)对隐藏图文视觉错觉高度敏感——模型无法察觉隐藏内容,而人类却一目了然。这一缺陷不仅揭示了当前MLLMs与人类感知的错位,更引发了潜在的安全隐患。为系统探究该故障机理,我们构建了综合性高难度错觉数据集IlluChar,并发现模型失效的关键内在机制:高频注意力偏差,即模型易被错觉图像中的高频背景纹理干扰,从而忽略隐藏图案。针对此问题,我们提出多尺度感知策略(SMSP),该即插即用框架与人类视觉感知策略相契合。通过抑制干扰性高频背景,SMSP可生成更贴近人类感知的图像。实验表明,SMSP显著提升了所有评估MLLMs在错觉图像上的表现,例如将Qwen3-VL-8B-Instruct的准确率从13.0%提升至84.0%。本研究为MLLMs的视觉感知机制提供了新见解,并给出了增强其感知能力的实用稳健方案。代码已开源在 https://github.com/Tujz2023/SMSP。