Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
翻译:视频语料视觉答案定位(VCVAL)包含问题相关视频检索与视频内视觉答案定位。具体而言,我们采用文本到文本检索方法,基于视频转录文本与GPT4生成答案的相似度,为医学问题检索相关视频。对于视觉答案定位,答案的起始与结束时间戳通过视觉内容、字幕与查询之间的对齐关系进行预测。在查询聚焦式教学步骤描述(QFISC)任务中,步骤描述由GPT4生成。具体来说,我们提供由LLaVA-Next-Video模型生成的视频描述以及带时间戳的视频字幕作为上下文,要求GPT4针对给定医学查询生成步骤描述。我们仅提交一个运行结果进行评估,其F得分为11.92,平均交并比为9.6527。