In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
翻译:在当前机器学习时代,Transformer已成为计算机视觉和自然语言处理等多个领域的实际标准方法。基于Transformer的解决方案构成了当前语言生成、图像与视频分类、分割、行为与物体识别等众多任务中最先进方法的核心支柱。值得注意的是,尽管这些前沿方法在各自领域取得了令人瞩目的成果,但视觉与语言关系的理解问题仍远未解决。本研究提出一种基于时空事件的可解释程序化方法,为视觉与语言建立共同基础,将基于学习的视觉与语言前沿模型相连接,并为长期存在的自然语言视频描述问题提供解决方案。我们通过标准评估指标(如Bleu、ROUGE)和现代LLM-as-a-Jury方法验证了所提出的算法能够为来自多个数据集的视频生成连贯、丰富且相关的文本描述。