Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.
翻译:近期的大型视觉语言模型(LVLM)在设备控制方面展现出巨大潜力。然而,现有研究主要集中于点选式交互,而日常电视使用中常见的遥控式交互仍鲜有探索。为填补这一空白,我们提出了\textbf{TVWorld}——一种基于图结构的离线抽象框架,用于模拟真实电视导航环境,支持可复现且无需实际部署的评估。在此基础上,我们构建了两个互补的基准测试,全面评估电视使用能力:用于拓扑感知导航的\textbf{TVWorld-N}与用于焦点感知定位的\textbf{TVWorld-G}。这些基准测试揭示了现有智能体的关键局限:在基于焦点、长时序的电视导航任务中拓扑感知能力不足。基于这一发现,我们提出了\textit{拓扑感知训练}框架,将拓扑感知能力注入LVLM中。利用该框架,我们开发了专精于电视导航的基础模型\textbf{TVTheseus}。TVTheseus在TVWorld-N上取得了$68.3\%$的成功率,超越了Gemini 3 Flash等强大的闭源基线模型,确立了最先进的性能表现。进一步的分析为开发高效电视使用智能体提供了有价值的见解。