Domain randomization is an effective computer vision technique for improving transferability of vision models across visually distinct domains exhibiting similar content. Existing approaches, however, rely extensively on tweaking complex and specialized simulation engines that are difficult to construct, subsequently affecting their feasibility and scalability. This paper introduces BehAVE, a video understanding framework that uniquely leverages the plethora of existing commercial video games for domain randomization, without requiring access to their simulation engines. Under BehAVE (1) the inherent rich visual diversity of video games acts as the source of randomization and (2) player behavior -- represented semantically via textual descriptions of actions -- guides the *alignment* of videos with similar content. We test BehAVE on 25 games of the first-person shooter (FPS) genre across various video and text foundation models and we report its robustness for domain randomization. BehAVE successfully aligns player behavioral patterns and is able to zero-shot transfer them to multiple unseen FPS games when trained on just one FPS game. In a more challenging setting, BehAVE manages to improve the zero-shot transferability of foundation models to unseen FPS games (up to 22%) even when trained on a game of a different genre (Minecraft). Code and dataset can be found at https://github.com/nrasajski/BehAVE.
翻译:域随机化是一种有效的计算机视觉技术,用于提升视觉模型在内容相似但外观不同的域间的迁移能力。然而,现有方法高度依赖对复杂且专业的仿真引擎进行调整,这些引擎难以构建,进而影响了其可行性和可扩展性。本文提出BehAVE,一种视频理解框架,其独特之处在于利用现有海量商业电子游戏进行域随机化,而无需访问其仿真引擎。在BehAVE中:(1)电子游戏固有的丰富视觉多样性充当随机化来源;(2)玩家行为——通过动作的文本描述进行语义表示——引导具有相似内容的视频的*对齐*。我们在25款第一人称射击(FPS)类游戏中,基于多种视频和文本基础模型测试了BehAVE,并报告了其在域随机化中的鲁棒性。BehAVE能够成功对齐玩家行为模式,并仅需在一款FPS游戏上训练即可将其零样本迁移到多款未见过的FPS游戏中。在更具挑战性的设置下,即便在不同于FPS的游戏类型(《我的世界》)上训练,BehAVE仍能将基础模型对未见FPS游戏的零样本迁移性能提升高达22%。代码和数据集可在https://github.com/nrasajski/BehAVE获取。