RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

翻译：参考驱动图像生成在身份保持方面取得了快速进展，但跨不同主体的可靠视点控制仍未被充分理解。其难点不仅在于生成目标主体的新图像：模型必须推断一个主体的隐含视点，并将其迁移至另一个主体，且仅能利用图像级证据，无法依赖相机姿态、深度或基于光线的条件。在此设定下，现有以多张图像参考为条件的生成器常依赖虚假语义相关性，导致视点偏移、部件级结构失配以及目标特定内容的缺失或不完整支持。我们将这一挑战形式化为跨主体视点对齐问题，并提出RAVA，一种在生成前提供显式几何证据的检索增强框架。RAVA首先学习跨实例视点嵌入，以检索与锚定视点对齐的目标主体图像，随后应用基于LogDet的子集选择策略，保留既视点一致又在结构上互补的紧凑参考集。所选参考最终由微调后的多参考图像生成器处理。实验表明，通用语义嵌入在此任务中几乎呈随机性，而所提出的检索器显著提升了视点检索质量。在跨主体生成任务上，RAVA在相同生成骨干下始终优于零样本基线及更强的检索替代方案。这些结果表明，跨主体视点对齐依赖于检索增强的几何基础而非仅依靠端到端生成。