A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
翻译:推理系统的一个关键特性是能够对其输入数据做出尖锐决策。对于当代人工智能系统而言,softmax 函数是实现尖锐行为的关键载体,其具备执行可微分查询-键值查找的能力。一种普遍观点认为,利用 softmax 的网络其预测能力源于“电路”,这些电路能够针对多种不同输入持续地、尖锐地执行特定类型的计算。然而,要使这些电路具备鲁棒性,它们必须能够良好地泛化到任意有效输入。在本文中,我们打破了这一迷思:即使对于寻找最大键值这样简单的任务,任何学习到的电路都必然随着测试时项目数量的增加而趋于弥散。我们将此归因于 softmax 函数在鲁棒逼近尖锐函数方面存在根本性局限,从理论上证明了这一现象,并提出将自适应温度作为一种临时技术,用于在推理时提升 softmax 的尖锐度。