We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.
翻译:我们提出了CosmoCore,一种受神经科学启发的强化学习(RL)架构,它整合了情感信号以增强大型语言模型(LLMs)的代码生成能力。其灵感来源于人类和动物的学习过程,即错误带来的尴尬感会驱动快速纠正,正如训练小狗时一次斥责后其能避免重复犯错所观察到的那样。CosmoCore使用一个轻量级多层感知机(MLP)为代码生成轨迹标记效价和惊奇度。高负效价(令人尴尬)的片段,例如有错误的代码输出,会被优先存入一个梦境队列,在离策略更新期间进行五倍回放;而低惊奇度的成功片段则被剪除,以防止过度自信和缓冲区膨胀。在HumanEval和BigCodeBench等代码生成基准测试以及使用自定义数据管道环境的模拟中进行评估,CosmoCore将幻觉代码(例如语法错误或逻辑缺陷)减少了48\%,并将自我纠正速度提升了45\%。在PySpark环境中使用Hugging Face模型进行的本地实验验证了这些收益,并提供了可复现的代码片段。消融实验证实,效价标记增强了探索的好奇心,而剪除则缓解了低效问题。该框架扩展了基于人类反馈的强化学习(RLHF),旨在开发更具情感感知能力的代码助手,可应用于集成开发环境(IDEs)和数据管道。代码及自定义微型世界模拟已开源发布。