Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory under gradient flow. With the soft attention loss, the focus model improves quickly at initialization and splutters later on. On the other hand, hard attention loss behaves in the opposite fashion. Based on our observations, we propose a simple hybrid approach that combines the advantages of the different loss functions and demonstrates it on a collection of semi-synthetic and real-world datasets
翻译:注意力模型通常通过优化三种标准损失函数之一来学习,这些函数分别被称为——软注意力、硬注意力和潜在变量边缘似然(LVML)注意力。这三种范式均源于同一目标:寻找两个模型——一个用于“选择”输入中正确片段的“聚焦”模型,以及一个将所选片段处理为目标标签的“分类”模型。然而,它们在所选片段的聚合方式上存在显著差异,从而产生不同的学习动态和最终结果。我们观察到使用这些范式学习的模型具有独特的特征,并将其解释为当聚焦模型固定时,分类模型在梯度下降下演化的结果。我们还在简单设置下分析了这些范式,并推导出梯度流下参数轨迹的闭式表达式。使用软注意力损失时,聚焦模型在初始化时改进迅速,但随后停滞。相比之下,硬注意力损失的行为则相反。基于我们的观察,我们提出了一种简单的混合方法,结合了不同损失函数的优势,并在半合成和真实世界数据集上进行了验证。