It is typically understood that the training of modern neural networks is a process of fitting the probability distribution of desired output. However, recent paradoxical observations in a number of language generation tasks let one wonder if this canonical probability-based explanation can really account for the empirical success of deep learning. To resolve this issue, we propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning. The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function that encodes the preference revealed in training data. In this perspective, training of the neural network corresponds to a utility learning process. Specifically, we show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation (MLE) can be seen as an iteration process that optimizes the neural network toward an optimal utility function. This utility-based interpretation can explain several otherwise-paradoxical observations about the neural networks thus trained. Moreover, our utility-based theory also entails an equation that can transform the learned utility values back to a new kind of probability estimation with which probability-compatible decision rules enjoy dramatic (double-digits) performance improvements. These evidences collectively reveal a phenomenon of utility-probability duality in terms of what modern neural networks are (truly) modeling: We thought they are one thing (probabilities), until the unexplainable showed up; changing mindset and treating them as another thing (utility values) largely reconcile the theory, despite remaining subtleties regarding its original (probabilistic) identity.
翻译:通常认为,现代神经网络的训练过程是对期望输出概率分布的拟合。然而,近期在多项语言生成任务中观察到的矛盾现象令人质疑:这种基于概率的标准解释是否真正能解释深度学习的实证成功。为解决这一问题,我们提出了一种基于效用的替代性解释框架,用以理解深度学习中标准的监督学习过程。基本思想是将学习到的神经网络并非视为概率模型,而是视为一种序数效用函数,该函数编码了训练数据中隐含的偏好。在这一视角下,神经网络的训练对应着一个效用学习过程。具体而言,我们证明:对于所有具有softmax输出的神经网络,最大似然估计(MLE)的SGD学习动态可被视为一种迭代优化过程,使神经网络趋向最优效用函数。这种基于效用的解释能够说明关于此类训练所得神经网络的若干原本矛盾的现象。此外,我们的效用理论还蕴含一个方程,可将学习到的效用值转化回一种新型概率估计,使兼容概率的决策规则获得显著的性能提升(双位数百分点)。这些证据共同揭示了现代神经网络(真正)建模内容中的效用-概率对偶现象:我们曾认为它们是一类事物(概率),直至无法解释的现象出现;转变思维将其视为另一类事物(效用值)后,理论在很大程度上得以统一,尽管其原始(概率)身份仍存在细微未解之处。