Numerous models for supervised and reinforcement learning benefit from combinations of discrete and continuous model components. End-to-end learnable discrete-continuous models are compositional, tend to generalize better, and are more interpretable. A popular approach to building discrete-continuous computation graphs is that of integrating discrete probability distributions into neural networks using stochastic softmax tricks. Prior work has mainly focused on computation graphs with a single discrete component on each of the graph's execution paths. We analyze the behavior of more complex stochastic computations graphs with multiple sequential discrete components. We show that it is challenging to optimize the parameters of these models, mainly due to small gradients and local minima. We then propose two new strategies to overcome these challenges. First, we show that increasing the scale parameter of the Gumbel noise perturbations during training improves the learning behavior. Second, we propose dropout residual connections specifically tailored to stochastic, discrete-continuous computation graphs. With an extensive set of experiments, we show that we can train complex discrete-continuous models which one cannot train with standard stochastic softmax tricks. We also show that complex discrete-stochastic models generalize better than their continuous counterparts on several benchmark datasets.
翻译:在监督学习和强化学习中,众多模型受益于离散与连续模型组件的结合。端到端可学习的离散-连续模型具有组合性,通常泛化能力更强且更具可解释性。构建离散-连续计算图的主流方法之一是使用随机softmax技巧,将离散概率分布集成到神经网络中。现有工作主要关注每条图执行路径上仅含单个离散组件的计算图。我们分析了具有多个连续离散组件的更复杂的随机计算图的行为,并论证其参数优化面临梯度微弱和局部极小值两大挑战。为此,我们提出两种新策略:首先,通过增大训练过程中Gumbel噪声扰动的尺度参数可改善学习行为;其次,专门为随机离散-连续计算图设计丢弃残差连接。通过大量实验证明,我们能够训练出标准随机softmax技巧无法训练的复杂离散-连续模型,同时在多个基准数据集上验证了复杂离散随机模型的泛化能力优于其连续对应模型。