It is typically understood that the training of modern neural networks is a process of fitting the probability distribution of desired output. However, recent paradoxical observations in a number of language generation tasks let one wonder if this canonical probability-based explanation can really account for the empirical success of deep learning. To resolve this issue, we propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning. The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function that encodes the preference revealed in training data. In this perspective, training of the neural network corresponds to a utility learning process. Specifically, we show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation (MLE) can be seen as an iteration process that optimizes the neural network toward an optimal utility function. This utility-based interpretation can explain several otherwise-paradoxical observations about the neural networks thus trained. Moreover, our utility-based theory also entails an equation that can transform the learned utility values back to a new kind of probability estimation with which probability-compatible decision rules enjoy dramatic (double-digits) performance improvements. These evidences collectively reveal a phenomenon of utility-probability duality in terms of what modern neural networks are (truly) modeling: We thought they are one thing (probabilities), until the unexplainable showed up; changing mindset and treating them as another thing (utility values) largely reconcile the theory, despite remaining subtleties regarding its original (probabilistic) identity.
翻译:通常认为,现代神经网络的训练是拟合期望输出概率分布的过程。然而,近期在若干语言生成任务中观察到的反常现象令人质疑:这种基于经典概率的解释能否真正解释深度学习的经验成功。为解决这一矛盾,我们提出一种基于效用的替代解释框架,用于理解深度学习中的标准监督学习过程。核心思想是将训练后的神经网络视为编码训练数据中隐含偏好的序数效用函数,而非概率模型。在此视角下,神经网络训练对应着效用学习过程。具体而言,我们证明:对于所有采用softmax输出的神经网络,最大似然估计的SGD学习动态可被视为迭代优化过程,驱动神经网络趋近最优效用函数。这种基于效用的解释能够阐明若干关于神经网络的反常观测结果。此外,我们的效用理论还推导出一个方程,可将学习到的效用值转化回新型概率估计,使基于概率兼容的决策规则获得显著(两位数的)性能提升。这些证据共同揭示了现代神经网络(真正)建模层面的效用-概率二象性现象:我们原以为它们建模的是概率,直到无法解释的现象出现;转变视角将其作为效用值处理时,理论矛盾得以大幅调和——尽管其原始概率身份仍存微妙之处。