Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies overfit small simulator datasets, achieving high success on training scenes but failing to generalize and exhibiting unsafe behaviour (frequent collisions). We introduce OVSegDT, a lightweight transformer policy that tackles these issues with two synergistic components. The first component is the semantic branch, which includes an encoder for the target binary mask and an auxiliary segmentation loss function, grounding the textual goal and providing precise spatial cues. The second component consists of a proposed Entropy-Adaptive Loss Modulation, a per-sample scheduler that continuously balances imitation and reinforcement signals according to the policy entropy, eliminating brittle manual phase switches. These additions cut the sample complexity of training by 33%, and reduce collision count in two times while keeping inference cost low (130M parameters, RGB-only input). On HM3D-OVON, our model matches the performance on unseen categories to that on seen ones and establishes state-of-the-art results (40.1% SR, 20.9% SPL on val unseen) without depth, odometry, or large vision-language models. Code is available at https://github.com/CognitiveAISystems/OVSegDT.
翻译:开放词汇目标导航要求具身智能体抵达由自由形式语言描述的目标物体,包括训练期间从未见过的类别。现有的端到端策略在小规模仿真数据集上容易过拟合,在训练场景中虽能取得较高的成功率,但泛化能力差,且表现出不安全行为(频繁碰撞)。本文提出OVSegDT,一种轻量级Transformer策略,通过两个协同组件解决上述问题。第一个组件是语义分支,包含目标二值掩码编码器与辅助分割损失函数,用于对文本目标进行语义落地并提供精确的空间线索。第二个组件是提出的熵自适应损失调制机制,这是一种基于样本的调度器,能够根据策略熵持续平衡模仿学习与强化学习信号,从而消除脆弱的人工阶段切换。这些改进将训练样本复杂度降低了33%,并在保持较低推理成本(1.3亿参数、仅RGB输入)的同时将碰撞次数减少至原来的一半。在HM3D-OVON数据集上,本模型在未见类别上的性能与已见类别相当,且在不依赖深度信息、里程计或大型视觉语言模型的情况下取得了最先进的结果(未见验证集上SR为40.1%,SPL为20.9%)。代码发布于https://github.com/CognitiveAISystems/OVSegDT。