Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.
翻译:近期,“视觉o1”开始进入人们的视野,人们期望这种慢思考设计能够解决视觉推理任务,尤其是几何数学问题。然而,现实情况是,当前的大型视觉语言模型甚至难以精确复制一个几何图形,更不用说真正理解几何形状内部复杂的固有逻辑和空间关系。我们认为,精确复制(强感知)是视觉o1的第一步。为此,我们引入了“慢感知”的概念,它引导模型像人类一样逐步感知基本的点线组合,渐进地重建复杂的几何结构。慢感知包含两个阶段:a) 感知分解。感知并非瞬时完成。在此阶段,复杂的几何图形被分解为基本的简单单元,以统一几何表示。b) 感知流,该阶段承认精确描绘一条线段并非易事。其目标是通过提出的“感知标尺”逐笔追踪每条线段,避免在线段回归中出现“长视觉跳跃”。令人惊讶的是,这种类人的感知方式遵循一种推理时间缩放定律——越慢越好。过去研究者致力于加速模型的感知,而我们却再次将其放缓,使模型能够逐步、仔细地读取图像。