Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.
翻译:基于自然语言的遥操作可降低操作员工作负荷,并提升高风险或远程环境下的安全性。然而,在动态远程场景中,双向通信过程中的传输延迟会在远程感知状态与操作员意图之间产生间隙,导致指令误解与错误执行。为缓解此问题,我们提出时空开放词汇场景图(ST-OVSG),该表示方法通过时间动态性与轻量化延迟标注增强了开放词汇感知能力。ST-OVSG利用LVLMs构建开放词汇三维物体表示,并通过匈牙利匹配算法结合我们提出的时序匹配代价将其扩展至时间域,从而生成统一的时空场景图。嵌入的延迟标签使LVLM规划器能够回溯查询历史场景状态,进而解决由传输延迟引起的本地-远程状态失配问题。为进一步减少冗余并突出任务相关线索,我们提出一种面向任务的子图过滤策略,为规划器生成紧凑的输入。ST-OVSG能够泛化至新类别,并在无需微调的情况下提升规划对传输延迟的鲁棒性。实验表明,我们的方法在Replica基准测试中达到74%的节点准确率,优于ConceptGraph。值得注意的是,在延迟鲁棒性实验中,基于ST-OVSG辅助的LVLM规划器实现了70.5%的规划成功率。