We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.
翻译:我们研究了使用软注意力的Transformer在何种条件下能够模拟硬注意力,即有效地将所有注意力集中于位置子集。首先,我们考察了线性时序逻辑的若干变体——已有研究表明这些逻辑的公式可通过硬注意力Transformer计算。我们论证了软注意力Transformer如何通过无界位置编码或温度缩放来计算这些逻辑的公式。其次,我们证明了温度缩放如何使softmax Transformer能够模拟平均硬注意力Transformer的一个庞大子类,即具有我们称为"均匀无绑定"特性的模型。