Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.
翻译:多视角随意拍摄中的瞬态物体会在3D高斯溅射重建中产生重影伪影。现有解决方案依赖高内存成本的场景分解或基于运动启发式的方法,后者易受视差歧义影响。本文提出一种语义过滤框架,利用视觉-语言模型实现类别感知的瞬态物体去除。通过累积训练迭代中每个高斯点对应渲染视图与干扰文本提示的CLIP相似度分数,对超过校准阈值的高斯点进行透明度正则化与周期性剪枝。与基于运动的方法不同,语义分类通过独立于运动模式识别物体类别来解决视差歧义问题。在RobustNeRF基准测试上的实验表明,该方法在四个序列中相较原始3DGS均能持续提升重建质量,同时保持最低内存开销与实时渲染性能。阈值校准及与基线方法的对比验证了语义引导在可预测干扰类别场景中作为瞬态物体去除实用策略的有效性。