State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.
翻译:状态空间模型(SSM)已被证明具备建模星自由序贯任务和有界层级结构的理论能力(Sarrof等人,2024)。然而,形式化表达能力的结果并不能保证基于梯度的优化方法会可靠地发现相应解决方案。现有基准测试要么探测单调状态追踪(如标准触发器任务),要么探测结构嵌套(如Dyck语言),但均未隔离可逆语义状态检索。我们引入UNDO触发器任务来填补这一空白。通过在标准触发器中扩展UNDO操作,该任务要求模型维护隐式有界堆栈,并在非单调更新序列下恢复历史状态。我们在此框架下评估单层和双层Mamba-2模型。两种变体均未能习得理论上可表达的基于堆栈的回滚机制,而是收敛于一种局部翻转启发式策略——该策略仅反转当前状态而非检索存储的历史。在训练长度分布内进行的对抗性回缩压力测试中,双层模型准确率下降至41.10%,低于随机水平。该结果证实了系统性的而非偶然的失效。因果消融实验表明,瓶颈存在于检索环节而非存储环节。这些发现清晰界定了架构原则上能表示的内容与梯度下降可靠习得的内容之间的鸿沟——这一区别是单纯的理论表达能力分析无法捕捉的。