NeuroClaw Technical Report

Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html

翻译：智能体人工智能系统有望加速科学研究流程，但神经影像学面临独特挑战：多模态异质性（sMRI、fMRI、dMRI、EEG）、冗长的多阶段处理管线以及持续存在的可重复性风险。为应对这一缺口，我们提出了NeuroClaw——一个面向领域特化的多智能体研究助手，专为可执行且可复现的神经影像学研究设计。NeuroClaw可直接作用于跨格式和模态的原始神经影像数据，其决策基于数据集语义和BIDS元数据，因此用户无需准备精选输入或定制模型代码。该平台将工程化编排与端到端环境管理相结合，包括固定Python环境、Docker支持、常见神经影像工具自动安装程序以及GPU配置。在实践中，这一层次架构强调检查点设置、执行后验证、结构化审计轨迹以及受控运行时环境，从而使工具链更加透明，同时提高可重复性和可审计性。三级技能/智能体层级结构将用户交互、高层编排和底层工具技能分离，将复杂工作流分解为安全、可复用的单元。除NeuroClaw框架外，我们还引入了NeuroBench——一个面向可执行性、产物有效性和可复现性准备度的系统级基准测试。在多种多模态大语言模型上，与直接智能体调用相比，NeuroClaw驱动的运行持续产生一致且显著的得分提升。项目主页：https://cuhk-aim-group.github.io/NeuroClaw/index.html