NeuroClaw Technical Report

Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html

翻译：智能体人工智能系统有望加速科学工作流程，但神经影像学面临独特挑战：异质性模态（sMRI、fMRI、dMRI、EEG）、长程多阶段处理流程及持续存在的可重复性风险。为应对这一空白，我们提出NeuroClaw，一个面向可执行与可重复神经影像研究的领域专用多智能体研究助手。NeuroClaw直接对跨格式与跨模态的原始神经影像数据进行操作，基于数据集语义和BIDS元数据进行决策，因此用户无需准备精心整理的输入或定制模型代码。该平台将集成工程与端到端环境管理相结合，包括固定Python环境、Docker支持、通用神经影像工具的自动安装程序及GPU配置。在实际应用中，该层强调检查点机制、执行后验证、结构化审计追踪及受控运行时设置，从而在提升工具链透明度的同时改善可重复性与可审计性。三层技能/智能体层级结构将用户交互、高层次编排与低层次工具技能分离，将复杂工作流分解为安全、可复用的单元。伴随NeuroClaw框架，我们引入NeuroBench，一个面向可执行性、工件有效性及可重复性准备度的系统级基准测试。在多种多模态大语言模型中，启用NeuroClaw的运行相比直接智能体调用能产生一致且显著的分数提升。项目主页：https://cuhk-aim-group.github.io/NeuroClaw/index.html