We introduce SuperIntelliAgent, an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier) to enable continual intelligence growth through self-supervised interaction. Unlike conventional supervised fine-tuning, SuperIntelliAgent learns autonomously without annotation: the learner generates candidate outputs, the verifier evaluates them through step-by-step reasoning, and their interaction produces chosen/rejected pairs for Direct Preference Optimization (DPO). This converts each input into a pseudo-training signal for continual improvement. The framework integrates dual-scale memory: short-term in-context memory that preserves reasoning traces across refinement cycles, and long-term memory that consolidates acquired knowledge through lightweight on-the-fly fine-tuning. A replay buffer retains samples that show verifiable progress and replays them as auxiliary supervision, reinforcing recent learning while forming adaptive curricula. SuperIntelliAgent is infrastructure-agnostic and can be plugged into existing agentic frameworks while turning ordinary inference loops into a lifelong optimization process. We posit that pairing a trainable learner with a reasoning-capable verifier forms a minimal reliable unit of growing intelligence, as paired feedback and partial-history replay yield richer learning curricula and stronger preference alignment. With a small number of automatically generated DPO pairs, the learner improves across all benchmarks, indicating that this mechanism provides a promising direction for continual intelligence accumulation and real-world deployment.
翻译:本文提出SuperIntelliAgent,一种智能体学习框架,通过耦合一个可训练的小型扩散模型(学习者)与一个冻结的大型语言模型(验证器),以自监督交互实现持续智能增长。与传统的监督微调不同,SuperIntelliAgent无需标注即可自主学习:学习者生成候选输出,验证器通过逐步推理进行评估,二者的交互产生用于直接偏好优化(DPO)的选择/拒绝对。这将每个输入转化为持续改进的伪训练信号。该框架集成了双尺度记忆:短期上下文记忆保留跨优化周期的推理轨迹,而长期记忆通过轻量级即时微调巩固已获知识。一个回放缓冲区保留显示可验证进展的样本,并将其作为辅助监督进行回放,以强化近期学习并形成自适应课程。SuperIntelliAgent与基础设施无关,可嵌入现有智能体框架,同时将普通推理循环转化为终身优化过程。我们认为,将可训练的学习者与具备推理能力的验证器配对,构成了增长智能的最小可靠单元,因为配对反馈和部分历史回放能产生更丰富的学习课程和更强的偏好对齐。通过少量自动生成的DPO对,学习者在所有基准测试中均取得改进,表明该机制为持续智能积累和实际部署提供了有前景的方向。