Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model's growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA
翻译:近期基于扩散的多模态大语言模型(dMLLM)存在推理延迟高的问题,因此依赖缓存技术来加速解码。然而,缓存机制的应用常常会引入不良的重复文本生成现象,我们称之为\textbf{重复诅咒}。为深入探究该问题背后的机制,我们从信息流的角度分析了重复生成现象。我们的研究揭示了三个关键发现:(1)上下文标记作为锚点聚合语义信息并指导最终预测;(2)随着信息在层间传播,上下文标记的熵在深层网络中收敛,反映了模型预测确定性的增长;(3)重复现象通常与上下文标记信息流的中断及其熵在深层无法收敛相关。基于这些发现,我们提出了\textbf{CoTA}——一种即插即用的缓解重复的方法。CoTA通过增强上下文标记的注意力以保持其内在信息流模式,同时在解码阶段对置信度分数引入惩罚项,以避免输出受不确定的上下文标记驱动。大量实验表明,CoTA在减轻重复现象方面具有显著效果,并在通用任务上实现了持续的性能提升。代码发布于https://github.com/ErikZ719/CoTA