A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow based on a video demonstration. SOPs are a crucial step toward automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. We explore in-context learning with video-language models for SOP generation. We report that in-context learning sometimes helps video-language models at SOP generation. We then propose an in-context ensemble learning to further enhance the capabilities of the models in SOP generation.
翻译:标准操作程序(SOP)是基于视频演示,为商业软件工作流定义的低级、逐步书面指南。SOP是实现端到端软件工作流自动化的关键步骤。手动创建SOP可能非常耗时。大型视频-语言模型的最新进展为通过分析人类演示的录制视频实现SOP自动生成提供了潜力。然而,当前的大型视频-语言模型在零样本SOP生成方面面临挑战。我们探索了视频-语言模型在SOP生成中的上下文学习方法。我们发现上下文学习有时能帮助视频-语言模型进行SOP生成。随后,我们提出了一种上下文集成学习方法,以进一步增强模型在SOP生成中的能力。