Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is hindered by representation entanglement (e.g., superposition), where entangled features encoded in overlapping directions obstruct the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to RNN-based methods while extending interpretability to continuous variable domains, and the annealing dynamics exhibit a clear exploration-to-exploitation transition. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a robust framework for demonstration-free algorithm discovery and Transformer interpretability.
翻译:算法提取旨在直接从经过算法任务训练的模型中合成可执行程序,从而在不依赖人类编写代码的情况下实现算法的新发现。然而,将该范式应用于Transformer受到表示纠缠(例如,叠加现象)的阻碍——以重叠方向编码的纠缠特征阻碍了符号表达的恢复。我们提出离散Transformer,这是一种明确设计用于弥合连续表示与离散符号逻辑之间差距的架构。通过温度退火采样注入离散性,我们的框架有效利用假设检验与符号回归提取人类可读程序。实验表明,离散Transformer在性能上可与基于循环神经网络的方法相媲美,同时将可解释性扩展至连续变量领域,且退火动力学呈现出清晰的从探索到利用的转变。最后,我们证明架构性归纳偏置可为合成程序提供细粒度控制,从而确立离散Transformer作为无演示算法发现与Transformer可解释性的稳健框架。