Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.
翻译:3D场景中的功能性分割要求智能体将隐式的自然语言指令转化为精细交互元素的精确掩码。现有方法依赖碎片化流程,在初始任务解析阶段存在视觉盲区。我们观察到这些方法受限于单尺度、被动且启发式的帧选择策略。本文提出UniFunc3D——一种统一且无需训练的框架,将多模态大语言模型视为主动观察者。通过将语义、时序与空间推理整合到单次前向传播中,UniFunc3D执行联合推理,将任务分解直接锚定于视觉证据。本方法引入具有粗到细策略的主动时空定位,使模型能够自适应选择正确视频帧,聚焦高细节交互部件的同时保留全局上下文以消除歧义。在SceneFun3D数据集上,UniFunc3D实现业界最佳性能,在不经任何任务特定训练的情况下,以相对59.9%的mIoU提升大幅超越免训练与基于训练的方法。代码将发布于项目页面:https://jiaying.link/unifunc3d。