DDCL-Attention is a prototype-based readout layer for transformer encoders that replaces simple pooling methods, such as mean pooling or class tokens, with a learned compression mechanism. It uses a small set of global prototype vectors and assigns tokens to them through soft probabilistic matching, producing compact token summaries at linear complexity in sequence length. The method offers three main advantages. First, it avoids prototype collapse through an exact decomposition of the training loss into a reconstruction term and a diversity term, ensuring that prototypes remain distinct. Second, its joint training with the encoder is shown to be stable under a practical timescale condition, using Tikhonov's singular perturbation theory and explicit learning-rate constraints. Third, the same framework supports three uses: a final readout layer, a differentiable codebook extending VQ-VAE, and a hierarchical document compressor. Experiments on four datasets confirm the theoretical predictions: the loss decomposition holds exactly, prototype separation grows as expected when the stability condition is met, and the codebook reaches full utilization, outperforming standard hard vector quantization. An additional study on orbital debris classification shows that the method also applies beyond standard NLP and vision tasks, including scientific tabular data.
翻译:DDCL-Attention是一种基于原型的Transformer编码器读出层,它用学习到的压缩机制替代了简单的池化方法(如均值池化或分类令牌)。该方法使用少量全局原型向量,通过软概率匹配将令牌分配给这些原型,从而在线性序列长度复杂度下生成紧凑的令牌摘要。该技术具有三项主要优势。首先,它将训练损失精确分解为重构项和多样性项,避免了原型坍缩,确保原型保持区分度。其次,基于吉洪诺夫奇异摄动理论和显式学习率约束,证明该方法在实用时间尺度条件下能与编码器稳定联合训练。第三,同一框架支持三种应用场景:最终读出层、扩展VQ-VAE的可微分码本,以及层次化文档压缩器。在四个数据集上的实验结果验证了理论预测:损失分解精确成立,当满足稳定性条件时原型分离度按预期增长,码本实现完全利用率且优于标准硬向量量化。此外,轨道碎片分类实验表明,该方法还可应用于标准自然语言处理和视觉任务之外的场景,包括科学表格数据。