A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
翻译:推理系统的一个关键特性是能够对其输入数据做出精确的决策。对于当代人工智能系统而言,实现精确行为的一个关键载体是 softmax 函数,它具备执行可微查询-键查找的能力。人们普遍认为,利用 softmax 的网络其预测能力源于"电路",这些电路能够在许多不同的输入上持续地、精确地执行某些类型的计算。然而,为了使这些电路具有鲁棒性,它们需要能够很好地泛化到任意的有效输入。在本文中,我们打破了这一迷思:即使对于像查找最大键这样简单的任务,任何学习到的电路在测试时随着项目数量的增加都必然会分散。我们将此归因于 softmax 函数在问题规模增大时稳健地逼近精确函数存在根本性限制,从理论上证明了这一现象,并提出将自适应温度作为一种临时技术,用于在推理时提高 softmax 的精确度。