Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.
翻译:生成式列表排序利用大型多模态模型在一次前向传播中捕捉全局列表上下文,但在长上下文多模态场景中其效果会退化。我们识别出一种反复出现的失败模式——解析崩溃,即自回归解码器通过静默省略候选并提前终止,产生流畅但未完成的排序结果。这一失败源于有限的上下文利用而非简单的格式错误,使得提示工程和约束解码难以解决。我们提出PRISMR(语义多模态排序的参数化表征内化)框架,用参数化结构条件替代瞬时的上下文列表处理。PRISMR使用轻量级超网络并行编码多模态候选,生成项目特定的LoRA权重,并将其合成为LMM的实例特定适配器。这种范式能在保留基础模型的同时,更稳健地内化列表结构。我们进一步引入大规模多模态评论排序基准进行评测。实验表明,PRISMR显著减少解析崩溃,提升列表排序性能,并能有效跨领域和指令微调主干迁移。