Slow Perception: Let's Perceive Geometric Figures Step-by-step

Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.

翻译：近期，“视觉o1”开始进入人们的视野，人们期望这种慢思考设计能够解决视觉推理任务，尤其是几何数学问题。然而，现实情况是，当前的大型视觉语言模型甚至难以精确复制一个几何图形，更不用说真正理解几何形状内部复杂的固有逻辑和空间关系。我们认为，精确复制（强感知）是视觉o1的第一步。为此，我们引入了“慢感知”的概念，它引导模型像人类一样逐步感知基本的点线组合，渐进地重建复杂的几何结构。慢感知包含两个阶段：a) 感知分解。感知并非瞬时完成。在此阶段，复杂的几何图形被分解为基本的简单单元，以统一几何表示。b) 感知流，该阶段承认精确描绘一条线段并非易事。其目标是通过提出的“感知标尺”逐笔追踪每条线段，避免在线段回归中出现“长视觉跳跃”。令人惊讶的是，这种类人的感知方式遵循一种推理时间缩放定律——越慢越好。过去研究者致力于加速模型的感知，而我们却再次将其放缓，使模型能够逐步、仔细地读取图像。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日