Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
翻译:工作记忆是智能行为的核心组成部分,为维持和更新任务相关信息提供了动态工作空间。近期研究采用n-back任务探究大型语言模型中类工作记忆的行为表现,但在视觉-语言模型中,当信息以视觉编码而非文本编码传递时,相同的实验范式是否能引发可比较的计算过程尚不明确。我们在受控的空间n-back任务上评估了Qwen2.5和Qwen2.5-VL模型,该任务以匹配的文本渲染或图像渲染网格形式呈现。在所有实验条件下,模型在文本模态上的准确率和d'值均稳定高于视觉模态。为从加工层面解释这些差异,我们利用试次水平的对数概率证据进行分析,发现名义上的2/3-back任务往往未能反映指令设定的滞后距离,而是与近因锁定的比较策略保持一致。我们进一步证明网格尺寸会改变刺激序列中的近期重复结构,从而影响干扰模式和错误类型。这些结果为多模态工作记忆的计算敏感性解释提供了实证依据。