MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian

Multimodal learning on video and text data has been receiving growing attention from many researchers in various research tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Although many algorithms have been proposed for those challenging tasks, most of them are developed on English language datasets. Despite Indonesian being one of the most spoken languages in the world, the research progress on the multimodal video-text with Indonesian sentences is still under-explored, likely due to the absence of the public benchmark dataset. To address this issue, we construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences. Using our dataset, we then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. The recent neural network-based approaches to video-text tasks often utilized a feature extractor that is primarily pretrained on an English vision-language dataset. Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable. To overcome the lack of pretraining resources, we apply cross-lingual transfer learning by utilizing the feature extractors pretrained on the English dataset, and we then fine-tune the models on our Indonesian dataset. Our experimental results show that this approach can help to improve the performance for the three tasks on all metrics. Finally, we discuss potential future works using our dataset, inspiring further research in the Indonesian multimodal video-text tasks. We believe that our dataset and our experimental results could provide valuable contributions to the community. Our dataset is available on GitHub.

翻译：视频与文本数据的多模态学习正受到越来越多研究者关注，涵盖文本到视频检索、视频到文本检索以及视频描述生成等研究任务。尽管这些具有挑战性的任务已涌现出大量算法，但多数基于英语数据集开发。尽管印尼语是全球使用最广泛的语言之一，但面向印尼语句子的多模态视频-文本研究进展仍相对滞后，这很可能源于缺乏公开基准数据集。为解决该问题，我们通过将MSVD数据集中的英语句子翻译成印尼语句子，构建了首个公开的印尼语视频-文本数据集。利用该数据集，我们在三个任务（文本到视频检索、视频到文本检索和视频描述生成）上训练了原本为英语视频-文本数据集开发的神经网络模型。近年来基于神经网络的方法通常采用预训练于英语视觉-语言数据集的特征提取器。由于印尼语句子的预训练资源相对匮乏，这些方法在我们数据集上的适用性仍存疑问。为克服预训练资源不足，我们采用跨语言迁移学习策略，使用在英语数据集上预训练的特征提取器，并在我们的印尼语数据集上进行模型微调。实验结果表明，该方法在三个任务的所有评估指标上均能提升性能。最后，我们讨论了基于该数据集的潜在未来研究方向，以期推动印尼语多模态视频-文本任务的进一步发展。我们相信，本数据集及实验结果将为学界带来重要贡献。数据集已在GitHub上公开发布。