Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on https://mikigom.github.io/YOPO-project-page.
翻译:从单张RGB图像中准确恢复特定类别未见实例的完整9自由度位姿,仍然是机器人与自动化领域的核心挑战。现有方案大多仍依赖伪深度、CAD模型或将2D检测与位姿估计分离的多级级联结构。受对更简洁、仅需RGB输入且能在类别层级直接学习的替代方案的需求驱动,我们重新审视了一个长期存在的问题:能否在不使用任何额外数据的情况下,以高性能统一物体检测与9自由度位姿估计?我们通过所提方法YOPO证明这是可行的。YOPO是一个单阶段、基于查询的框架,将类别级9自由度估计视为2D检测的自然延伸。YOPO通过轻量级位姿头、边界框条件化平移模块以及6D感知匈牙利匹配代价,对Transformer检测器进行了增强。该模型仅使用RGB图像和类别级位姿标签进行端到端训练。尽管设计极简,YOPO在三个基准测试中均取得了新的最优性能。在REAL275数据集上,其$\rm{IoU}_{50}$达到79.6%,在$10^\circ$$10{\rm{cm}}$度量标准下达到54.1%,超越了所有先前的纯RGB方法,并大幅缩小了与RGB-D系统的性能差距。代码、模型及更多定性结果可在https://mikigom.github.io/YOPO-project-page获取。