K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.
翻译:K-12科学课堂是充满探究性的丰富场景,学生通过话语协调现象、证据与解释模型;然而,这些互动中多模态的复杂性使得自动化分析难以实现。现有的课堂话语评测基准主要集中于数学领域,且仅依赖文字转录,忽视了《新一代科学教育标准》(NGSS)所强调的视觉化表征与基于模型的推理。为填补这一空白,我们提出了SciIBI——首个面向科学课堂话语分析的视频评测基准,包含113个符合NGSS标准的教学片段,并标注了核心教学实践(CIP)及其复杂度层级。通过对八种前沿大语言模型与多模态大语言模型的评估,我们揭示了其根本性局限:当前模型难以区分教学逻辑相似的实践行为,表明CIP编码需要超越表层模式匹配的教学推理能力。此外,增加视频输入在不同模型架构中带来的性能提升并不稳定。关键的是,我们基于证据的评估表明,模型往往通过表层捷径而非真正的教学理解获得成功。这些发现确立了科学课堂话语分析作为多模态人工智能研究的前沿挑战领域,并指出人机协作的发展方向——即模型通过检索证据来加速专家评估进程,而非取代专家判断。