Contrast maximization (CMAX) is a direct geometric framework for event-based motion estimation, but its iterative warp-and-accumulate pipeline incurs input-dependent computation and frequent memory accesses, challenging real-time, low-power edge deployment. We present CMAX-CAMEL, a coarse-to-fine adaptive, memory-efficient, low-power edge processor for CMAX. CMAX-CAMEL combines a runtime-adaptive execution strategy with a memory-centric processor architecture. It adjusts coarse-to-fine execution according to the observed event distribution, prioritizing stages likely to improve estimation accuracy while suppressing low-value iterations and unnecessary stage transitions. Architecturally, a banked parallel memory organization sustains real-time throughput while reducing latency, and a subsampling-coupled accumulation structure lowers memory-access activity along the warp-and-accumulate dataflow. On a Virtex FPGA prototype operating at 200 MHz, CMAX-CAMEL improves estimation accuracy by up to 19% over fixed coarse-to-fine schedules, reduces processing latency by 53.3%, lowers effective memory accesses by 42%, and cuts total system energy by 52.2%, including adaptation overheads. These results show that CMAX-CAMEL is an HW-SW co-design that co-optimizes execution policy and data movement for real-time, low-power event-based motion estimation at the edge.
翻译:对比度最大化是一种基于事件运动估计的直接几何框架,但其迭代式扭曲-累积流水线会产生输入依赖的计算和频繁的内存访问,对实时、低功耗边缘部署构成挑战。我们提出CMAX-CAMEL——一种面向CMAX的由粗到精自适应、高能效且低功耗边缘处理器。CMAX-CAMEL将运行时自适应执行策略与内存中心处理器架构相结合:根据观测到的事件分布调整由粗到精的执行过程,优先提升估计精度的阶段,同时抑制低价值迭代和不必要的阶段转换。在架构层面,分块并行存储器组织在降低延迟的同时支持实时吞吐量,而子采样耦合的累积结构则沿扭曲-累积数据流降低内存访问活动。在运行于200 MHz的Virtex FPGA原型上,CMAX-CAMEL相较于固定由粗到精调度方案将估计精度提升最高19%,处理延迟降低53.3%,有效内存访问减少42%,系统总能耗(含自适应开销)降低52.2%。结果表明,CMAX-CAMEL通过软硬件协同设计,协同优化了执行策略与数据移动,实现了边缘设备上实时、低功耗的事件驱动型运动估计。