Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, \textbf{visual perception requires reasoning in a spatial and object-centric space}. In response, we introduce \textbf{Artemis}, a perception-policy learning method that performs structured visual reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Building upon verifiable and spatially grounded reasoning chains, Artemis provides a unified architecture for diverse perceptual tasks, without requiring the task-specific designs relied upon by prior perceptual policy models. Trained using grounding and detection sampeles in natural image domains, Artemis generalizes to counting and geometric perception tasks. At its core, a spatially grounded, object-centric chain rule provides a principled foundation for scalable and general perceptual policies.
翻译:摘要:近期面向视觉感知策略的强化学习框架通常采用自然语言表达的中介推理链。实验观察表明,这种纯语言化的中介推理往往降低感知任务的性能。我们认为核心问题不在于推理本身,而在于推理形式:虽然这些链式推理在非结构化语言空间中进行语义推理,但**视觉感知需要在空间与物体中心化空间中进行推理**。对此,我们提出\textbf{Artemis}——一种执行结构化视觉推理的感知策略学习方法,其中每个中介步骤被表示为(标签,边界框)对,用于捕获可验证的视觉状态。该设计能够显式追踪中介状态、对候选质量提供直接监督,并避免基于语言推理引入的歧义。通过构建基于可验证且空间锚定的推理链,Artemis为多样化感知任务提供统一架构,无需依赖先前感知策略模型所需的特定任务设计。利用自然图像域中的定位与检测样本训练后,Artemis可泛化至计数与几何感知任务。其核心在于,基于空间锚定与物体中心化的链式法则为可扩展的通用感知策略提供了原理性基础。