Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.
翻译:多模态大语言模型(MLLMs)在处理如HD-EPIC VQA这类复杂的视频问答基准测试时面临挑战,主要源于查询/选项的模糊性、长程时序推理能力不足以及输出结果非标准化。我们提出了一个集成框架,该框架包含查询/选项预处理、领域特定的Qwen2.5-VL微调、一种用于多步推理的新型时序思维链提示方法,以及鲁棒的后处理机制。该系统在HD-EPIC VQA上取得了41.6%的准确率,突显了在要求苛刻的视频理解任务中进行全流程优化的重要性。我们的代码及微调模型可在 https://github.com/YoungSeng/Egocentric-Co-Pilot 获取。