To address a fundamental limitation in cognitive systems, namely the absence of a time-updatable mediating thought space between semantics and continuous control, this work constructs and trains a vision-language-action model termed Sigma, deployed on a single RTX 4090. The model is built upon the open-source pi0.5_base backbone, with the svla_so101_pickplace dataset preprocessed into a structured training corpus. An independently designed VLA architecture is introduced to integrate deep semantic understanding with associative reasoning, enabling telepathic-style alignment between perception and action. Training proceeds through iterative optimization of data preprocessing, LoRA-based fine-tuning, and inference-stage adapter design. Evaluation is conducted using offline closed-loop replay, comparing Sigma against the untuned pi0.5_base under identical data conditions. Experimental results indicate a consistent reduction in control MSE across vector-, fragment-, and trajectory-level scales, while preserving the stability of the telepathy norm and semantic-text alignment quality. These findings demonstrate that mind-responsive alignment control can be quantitatively achieved through semantic and associative architectural integration without retraining the base model, providing a reproducible pathway for semantic alignment and intention-driven behavior.
翻译:为应对认知系统中的一个根本性局限——即语义与连续控制之间缺乏一个可随时间更新的中介思维空间,本研究构建并训练了一个名为Sigma的视觉-语言-动作模型,该模型部署于单张RTX 4090显卡上。该模型基于开源pi0.5_base主干网络构建,并将svla_so101_pickplace数据集预处理为结构化训练语料。我们引入了一种独立设计的VLA架构,以整合深度语义理解与联想推理,从而实现感知与行动之间心灵感应式的对齐。训练过程通过迭代优化数据预处理、基于LoRA的微调以及推理阶段适配器设计来完成。评估采用离线闭环回放方法,在相同数据条件下将Sigma与未经调优的pi0.5_base进行对比。实验结果表明,在向量级、片段级和轨迹级尺度上控制均方误差持续降低,同时保持了心灵感应范数的稳定性与语义-文本对齐质量。这些发现表明,通过语义与联想架构的整合,无需重新训练基础模型即可定量实现思维响应式的对齐控制,为语义对齐与意图驱动行为提供了可复现的技术路径。