High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
翻译:高质量的多相机三维流媒体对于众多增强现实/虚拟现实应用中的沉浸式体验至关重要。由于实时性限制,视角数量通常有限,这会导致渲染图像中出现信息缺失和表面不完整的问题。现有方法通常依赖简单的启发式算法进行空洞填充,可能导致不一致性或视觉伪影。我们提出一种新颖的、面向应用的修复方法,该方法独立于底层表示形式,可作为新视角渲染后的基于图像的后处理步骤来补全缺失纹理。该方法设计为独立模块,兼容任何已标定的多相机系统。为此,我们引入了一种基于Transformer的多视角感知网络架构,利用时空嵌入确保帧间一致性,同时保留细节特征。此外,我们的分辨率无关设计可适配不同相机配置,而自适应分块选择策略在推理速度与质量之间取得平衡,从而实现实时性能。我们在相同实时约束条件下,将本方法与最先进的修复技术进行比较评估,结果表明我们的模型在质量与速度之间达到了最佳平衡,在图像和视频评估指标上均优于现有方法。