This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.
翻译:本文介绍了ClinicalSkillQA 2026共享任务的概况,该任务与ACL 2026的BioNLP研讨会联合举办。该共享任务旨在评估临床技能评估中的连续感知与程序推理能力,要求系统重建打乱的临床关键帧的正确时间顺序,并基于临床工作流知识生成推理依据。基准测试包含200个仅用于测试的样本,这些样本来源于临床技能视频,涵盖三种急诊护理操作。每个样本均标注了真实时间顺序和专家验证的推理依据。共有七个团队参与该任务,累计提交了90份结果,其中四个团队提供了系统描述论文。系统评估指标包括任务准确率(Task Accuracy)、成对准确率(Pairwise Accuracy)和BERTScore(BERTScore),分别衡量精确序列重建、局部时间一致性和推理依据质量。本文描述了任务设置、数据集构建和评估标准,进一步总结了参赛团队采用的方法,并对提交的系统进行了全面分析。官方结果表明,当前模型在处理连续感知与程序推理方面仍面临挑战,尤其是在需要整合视觉证据、时间结构和临床工作流知识时。