Large Vision-Language Models (LVLMs) exhibit strong performance on single-image tasks. However, their performance degrades significantly when handling multi-image inputs. While this degradation has been observed in prior work, its nature remains poorly understood. We empirically observe visual elements from different images become entangled in the model's representations and responses. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic method. FOCUS masks all but one image with random noise, guiding the model to focus on the single clean image. This process is applied across the target images to obtain logits under partially masked contexts. These logits are aggregated and then refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance on diverse multi-image benchmarks. We further show that FOCUS generalizes to video understanding, extending its applicability beyond static multi-image inputs. This demonstrates that FOCUS offers a general solution for enhancing multi-image reasoning without additional training or architectural modifications.
翻译:暂无翻译