Dual encoder models are ubiquitous in modern classification and retrieval. Crucial for training such dual encoders is an accurate estimation of gradients from the partition function of the softmax over the large output space; this requires finding negative targets that contribute most significantly ("hard negatives"). Since dual encoder model parameters change during training, the use of traditional static nearest neighbor indexes can be sub-optimal. These static indexes (1) periodically require expensive re-building of the index, which in turn requires (2) expensive re-encoding of all targets using updated model parameters. This paper addresses both of these challenges. First, we introduce an algorithm that uses a tree structure to approximate the softmax with provable bounds and that dynamically maintains the tree. Second, we approximate the effect of a gradient update on target encodings with an efficient Nystrom low-rank approximation. In our empirical study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining. Furthermore, our method surpasses prior state-of-the-art while using 150x less accelerator memory.
翻译:双编码器模型在现代分类与检索任务中无处不在。训练这类双编码器的关键在于准确估计大输出空间上softmax函数配分函数的梯度,这需要寻找贡献最大的负目标(即"困难负样本")。由于双编码器模型参数在训练过程中持续更新,使用传统静态最近邻索引可能并非最优选择。这些静态索引存在两大问题:(1) 需要定期重建索引,这本身成本高昂;(2) 重建后必须使用更新后的模型参数对所有目标进行重新编码,同样代价昂贵。本文针对这两个挑战提出解决方案。首先,我们提出一种基于树结构的算法,该算法能以可证明的边界逼近softmax函数,并动态维护树结构。其次,我们通过高效的Nyström低秩近似来模拟梯度更新对目标编码的影响。在包含超过两千万个目标的数据集上进行的实证研究表明,相比暴力穷举的负采样基准方法,我们的方法能将错误率降低一半。此外,本方法在超越先前最优成果的同时,可将加速器内存使用量降低150倍。