TIDAL：用于高频视觉-语言-动作控制的时序交错扩散与动作循环 (TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control)

Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.

翻译：大规模视觉-语言-动作（VLA）模型具备语义泛化能力，但其较高的推理延迟使其局限于低频的批处理-执行范式。这种频率失配会产生执行盲区，导致在动态环境中（目标在开环执行窗口内移动）任务失败。我们提出TIDAL（时序交错扩散与动作循环），一种将语义推理与高频执行解耦的分层框架。TIDAL作为一个与主干网络无关的模块，适用于基于扩散的VLA模型，采用双频架构重新分配计算资源。具体而言，低频宏意图循环缓存语义嵌入，而高频微控制循环则将单步流积分与执行过程交错进行。该设计使得在边缘硬件上能够实现约9 Hz的控制更新（基线约为2.4 Hz），且不增加边际开销。为处理由此产生的延迟偏移，我们引入了一种时序错位的训练策略，使策略能够利用滞后的语义意图与实时本体感知学习预测性补偿。此外，针对静态视觉编码器对速度不敏感的问题，我们引入了差分运动预测器。TIDAL是架构层面的改进，与系统级优化正交。实验表明，在动态拦截任务中，TIDAL相比开环基线实现了2倍的性能提升。尽管静态成功率略有下降，但我们的方法将反馈频率提高了4倍，并将语义嵌入的有效视野扩展到原生动作块大小之外。在非暂停推理协议下，TIDAL在标准基线因延迟而失效的场景中仍保持鲁棒性。