Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
翻译:给定一个多视角视频,哪种视角对人类观察者最具信息量?现有方法依赖于启发式规则或昂贵的"最佳视角"监督来回答这一问题,限制了其适用性。我们提出一种弱监督方法,利用伴随教学类多视角视频的语言描述作为恢复其最具信息量视角的手段。我们的核心假设是:单个视角能越准确地预测视角无关的文本摘要,该视角的信息量就越大。为实现这一目标,我们提出一个框架,将视角相关描述预测的相对准确度作为最佳视角伪标签的代理指标。随后,这些伪标签与增强视角敏感度的辅助相机姿态预测器共同用于训练视角选择器。在推理阶段,我们的模型仅以多视角视频作为输入——无需语言描述或相机姿态——即可返回每个时间步的最佳观看视角。在两个包含多样化多相机设置和操作活动的挑战性数据集上,我们的模型在定量指标和人工评估方面均持续优于现有最先进的基线方法。