The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes.
翻译:场景表示变换器(SRT)是一种最近提出的以交互速率渲染新视图的方法。由于SRT使用相对于任意选定参考摄像机的摄像机姿态,因此它不具备对输入视图顺序的不变性。因此,SRT无法直接应用于需要定期改变参考坐标系的大规模场景。本文提出相对姿态注意力SRT(RePAST):我们不固定输入中的参考坐标系,而是将成对的相对摄像机姿态信息直接注入Transformer的注意力机制中。这使模型在定义上对任意全局参考坐标系的选择具有不变性,同时仍保持原始方法的全部能力。实证结果表明,为模型添加这种不变性并不会导致质量损失。我们相信这是向将基于全潜Transformer的渲染方法应用于大规模场景迈出的一步。