Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}
翻译:视觉-语言-动作(VLA)模型承袭了视觉-语言预训练中的共享同步时钟机制,以统一速率处理每个输入。这与物理交互过程存在矛盾:在物理交互中,高频率模态每秒变化数百次,视觉信号演化较慢,而语言信息在整个任务片段中保持不变。同步VLA模型过度采样慢速模态、欠采样快速模态,并将动作生成频率限制在最低有效速率上。我们提出假设:将各模态的时间处理过程解耦,使其以自身传感器速率进行信息更新与保持,能够生成更强的表征与更鲁棒的控制能力。为此我们提出DAM-VLA模型,为每个模态维护以传感器速率刷新的潜在状态缓冲区,并通过门控交叉注意力机制(保持预训练骨干网络不变)集成新增的高频模态,由动作头连续读取该缓冲区。在七项接触密集的真实世界操作任务中,DAM-VLA的平均成功率较最强同步基线提升了一倍以上(95.2\% vs.\ 40.95\%),同时维持了流畅且具备响应能力的100\,Hz控制频率。项目网站:\href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}