Automated task guidance has recently attracted attention from the AI research community. Procedural mistake detection (PMD) is a challenging sub-problem of classifying whether a human user (observed through egocentric video) has successfully executed the task at hand (specified by a procedural text). Despite significant efforts in building resources and models for PMD, machine performance remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we recast PMD to an explanatory self-dialog of questions and answers, which serve as evidence for a decision. As this reformulation enables an unprecedented transparency, we leverage a fine-tuned natural language inference (NLI) model to formulate two automated coherence metrics for generated explanations. Our results show that while open-source VLMs struggle with this task off-the-shelf, their accuracy, coherence, and dialog efficiency can be vastly improved by incorporating these coherence metrics into common inference and fine-tuning methods. Furthermore, our multi-faceted metrics can visualize common outcomes at a glance, highlighting areas for improvement.
翻译:自动化任务指导近来引起了人工智能研究界的关注。流程错误检测(PMD)是一个具有挑战性的子问题,其目标是通过第一人称视角视频观察人类用户,判断其是否成功执行了当前任务(由流程文本规定)。尽管为PMD投入了大量资源构建和模型开发,机器性能仍难以达到实用水平,且其推理过程缺乏透明度。为此,我们将PMD重构为问答形式的解释性自我对话,这些问答作为决策的证据支撑。由于这种重构实现了前所未有的透明度,我们利用微调的自然语言推理(NLI)模型构建了两个自动化连贯性指标来评估生成的解释。研究结果表明,虽然开源视觉语言模型(VLM)在此任务上直接应用表现欠佳,但通过将连贯性指标融入常见的推理和微调方法,其准确性、连贯性和对话效率均可获得显著提升。此外,我们的多维度指标能够直观展示常见结果,清晰揭示需要改进的领域。