Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.
翻译:大型语言模型(LLMs)与大型多模态模型(LMMs)有望加速生物医学发现,但其可靠性仍不明确。我们提出了ARIEL(专家在环学习人工智能研究助手),这是一个开源评估与优化框架,其将精选的多模态生物医学语料库与专家审核的任务相结合,以探究两种核心能力:全文摘要生成与细粒度图表解读。通过统一的评估协议与盲审的博士级专家评估,我们发现当前最先进的模型能够生成流畅但不完整的摘要,而LMMs在细致的视觉推理方面存在困难。后续实验表明,提示工程与轻量级微调能显著提升文本覆盖度,而一种计算规模化的推理策略则增强了视觉问答性能。我们构建了一个整合文本与视觉线索的ARIEL智能体,并证明其能够提出可检验的机制性假设。ARIEL明确了当前基础模型的优势与局限,为推进生物医学领域可信人工智能的发展提供了一个可复现的平台。