Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.
翻译:推理时扩展已成为一种有效方法,通过使用验证器对候选输出进行评分和筛选,在测试时提升生成式模型性能。常见做法是采用多模态大语言模型(MLLM)作为验证器,这虽能提升性能,但会引入大量推理时成本。实际上,扩散流水线在自编码器潜在空间中运行以降低计算量,但MLLM验证器仍需将候选结果解码至像素空间,并重新编码到视觉嵌入空间中,导致冗余且昂贵的操作。本研究提出基于隐层状态验证器(VHS),该验证器直接作用于扩散变压器(DiT)单步生成器的中间隐层表示。VHS无需解码至像素空间即可分析生成器特征,从而在降低每个候选验证成本的同时,提升或匹配基于MLLM的竞争方法性能。研究表明,在仅对每个提示使用少量候选样本的极低推理预算下,与标准MLLM验证器相比,VHS可实现更高效的推理时扩展:联合生成与验证时间减少63.3%,计算FLOPs降低51%,显存占用减少14.5%,并在相同推理时预算下使GenEval指标提升+2.7%。