Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
翻译:大型语言模型在同时执行高要求任务时,常常无法满足格式指令。我们通过认知心理学中的前瞻性记忆视角来研究这一行为,采用一种受控范式,将可验证的格式约束与复杂度递增的基准任务相结合。在三个模型系列和超过8000个提示中,并发任务负载下的遵守率下降了2-21%。脆弱性高度依赖于约束类型:终端约束(要求在响应边界采取行动)退化最严重,下降幅度高达50%,而回避约束则相对稳健。一种显著性增强格式(明确的指令框架加上尾随提醒)能够恢复大部分失去的遵守率,在许多设置中将性能恢复到90-100%。干扰是双向的:格式约束也会降低任务准确率,其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中,随着约束的累积,联合遵守率急剧下降。所有结果均使用确定性程序化检查器,无需LLM作为评判组件,并在公开可用数据集上得出。