A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
翻译:推理系统的一个关键特性是能够对其输入数据做出尖锐的决策。对于当代人工智能系统,实现尖锐行为的一个关键载体是 softmax 函数,它具备执行可微分查询-键值查找的能力。人们普遍认为,利用 softmax 的网络其预测能力源于"电路",这些电路能够在许多不同的输入上一致地、尖锐地执行某些类型的计算。然而,要使这些电路具有鲁棒性,它们需要能够很好地泛化到任意的有效输入。在本文中,我们打破了这一迷思:即使对于像查找最大键值这样简单的任务,任何学习到的电路在测试时随着项目数量的增加都必然会发生弥散。我们将此归因于 softmax 函数在鲁棒地近似尖锐函数方面存在根本性限制,从理论上证明了这一现象,并提出将自适应温度作为一种临时技术,用于在推理时提高 softmax 的尖锐性。