Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. In this work, we tackle this challenge from the perspective of camera selection. We begin by constructing a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. Based on this matrix, we use the Intra-List Diversity (ILD) metric to assess camera redundancy, formulating the camera selection task as an optimization problem. Then we apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments, closely mimicking real-world scenarios. Experimental results demonstrate that our strategy outperforms other approaches under time and memory constraints. Remarkably, our method achieves performance comparable to models trained on the full dataset, while using only an average of 15% of the frames and 75% of the allotted time.
翻译:室内场景的新视角合成可通过采集环境的单目视频序列实现。然而,输入视频数据中由人为移动引起的冗余信息降低了场景建模的效率。在本工作中,我们从相机选择的角度应对这一挑战。我们首先构建了一个相似度矩阵,该矩阵同时考虑了相机的空间多样性与图像的语义变化。基于此矩阵,我们使用列表内多样性(ILD)指标评估相机冗余,将相机选择任务形式化为一个优化问题。随后,我们应用一种基于多样性的采样算法来优化相机选择。我们还开发了一个新的数据集 IndoorTraj,其中包含在虚拟室内环境中由人工采集的长而复杂的相机运动轨迹,高度模拟了真实世界场景。实验结果表明,在时间和内存限制下,我们的策略优于其他方法。值得注意的是,我们的方法仅平均使用 15% 的帧和 75% 的分配时间,即可达到与在全数据集上训练的模型相当的性能。