Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.
翻译:大规模视觉-语言-动作(VLA)模型具备语义泛化能力,但其较高的推理延迟使其局限于低频的批处理-执行范式。这种频率失配会产生执行盲区,导致在动态环境中(目标在开环执行窗口内移动)任务失败。我们提出TIDAL(时序交错扩散与动作循环),一种将语义推理与高频执行解耦的分层框架。TIDAL作为一个与主干网络无关的模块,适用于基于扩散的VLA模型,采用双频架构重新分配计算资源。具体而言,低频宏意图循环缓存语义嵌入,而高频微控制循环则将单步流积分与执行过程交错进行。该设计使得在边缘硬件上能够实现约9 Hz的控制更新(基线约为2.4 Hz),且不增加边际开销。为处理由此产生的延迟偏移,我们引入了一种时序错位的训练策略,使策略能够利用滞后的语义意图与实时本体感知学习预测性补偿。此外,针对静态视觉编码器对速度不敏感的问题,我们引入了差分运动预测器。TIDAL是架构层面的改进,与系统级优化正交。实验表明,在动态拦截任务中,TIDAL相比开环基线实现了2倍的性能提升。尽管静态成功率略有下降,但我们的方法将反馈频率提高了4倍,并将语义嵌入的有效视野扩展到原生动作块大小之外。在非暂停推理协议下,TIDAL在标准基线因延迟而失效的场景中仍保持鲁棒性。