The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io
翻译:收集真实机器人数据的高昂成本使得机器人仿真成为评估和数据生成的可扩展平台。然而,现有的大多数基准测试集中于简单的操作任务,如抓取放置,未能捕捉现实世界任务的非马尔可夫特性以及铰接物体交互的复杂性。为应对这一局限,我们提出了RuleSafe——一个基于可扩展大语言模型辅助仿真框架构建的新型铰接操作基准。RuleSafe包含具有多样化解锁机制(如钥匙锁、密码锁和逻辑锁)的保险箱,这些机制需要不同的多阶段推理与操作策略。这些由大语言模型生成的规则产生了需要时序建模和基于记忆推理的非马尔可夫长视界任务。我们进一步提出VQ-Memory,一种紧凑且结构化的时序表征方法,它利用向量量化变分自编码器将过去的本体感知状态编码为离散的潜在标记。该表征能过滤底层噪声,同时保留高层任务阶段上下文,为现有视觉-语言-动作模型提供轻量级且鲁棒的时序线索。在先进视觉-语言-动作模型和扩散策略上的大量实验表明,VQ-Memory能持续提升长视界规划能力,增强对未见配置的泛化性能,并以更低计算成本实现更高效的操作。项目页面:vqmemory.github.io