To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 38.90 AP_3D, surpassing the previous best by +13.98 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
翻译:为使模型能在现实世界中行动,它必须能识别所见之物并知晓其在三维空间中的位置。当前的视觉-语言模型(VLMs)在开放式二维描述与定位方面表现出色,但多目标三维检测在很大程度上仍是VLM工具箱中缺失的一环。我们提出了LocateAnything3D,一种原生VLM方法,将三维检测构建为下一个词元预测问题。其核心是一个简短、显式的视觉链序列,该序列模拟了人类从图像中推理的过程:先在二维图像中找到目标,然后推断其距离、尺寸与姿态。解码器首先以视觉思维链的形式输出二维检测结果,随后在由易到难的课程学习策略下预测三维边界框:在跨目标层面,采用由近及远的顺序以减少早期歧义并符合以自我为中心的效用;在每个目标内部,采用从相机出发的中心点、尺寸及旋转分解方式,依据信息的稳定性和可学习性进行排序。这种原生VLM接口保留了开放词汇和视觉提示能力,无需专用头部模块。在具有挑战性的Omni3D基准测试中,我们的模型取得了最先进的结果,AP_3D达到38.90,即使基线模型被提供了真实二维边界框,本模型仍以+13.98的绝对优势超越先前最佳结果。该模型还能以零样本方式泛化到未见过的类别,并展现出强大的鲁棒性。通过将三维检测转化为一个结构化的下一个词元预测问题,LocateAnything3D为模型实现三维感知提供了一个实用的基础。