Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory under gradient flow. With the soft attention loss, the focus model improves quickly at initialization and splutters later on. On the other hand, hard attention loss behaves in the opposite fashion. Based on our observations, we propose a simple hybrid approach that combines the advantages of the different loss functions and demonstrates it on a collection of semi-synthetic and real-world datasets
翻译:注意力模型通常通过优化三种标准损失函数之一来学习,这些损失函数分别称为——软注意力、硬注意力和潜在变量边际似然(LVML)注意力。这三种范式都源于同一目标:找到两个模型——一个“聚焦”模型,用于“选择”输入的正确片段;以及一个“分类”模型,用于将所选片段处理为目标标签。然而,它们在所选片段的聚合方式上存在显著差异,从而产生不同的动力学和最终结果。我们观察到使用这些范式学习的模型具有独特的特征,并将其解释为当聚焦模型固定时,分类模型在梯度下降下的演化结果。我们还在一个简单设置下分析了这些范式,推导出了梯度流下参数轨迹的闭式表达式。对于软注意力损失,聚焦模型在初始化阶段快速改进,随后停滞不前。相反,硬注意力损失则表现出相反的行为。基于我们的观察,我们提出了一种简单的混合方法,该方法结合了不同损失函数的优势,并在半合成和真实世界数据集上进行了验证。