We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. We propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.
翻译:我们定义了一项具有挑战性的决策任务,称为多模态时序数据的主动获取(A2MT)。在许多现实场景中,输入特征在测试时并非立即可用,而必须以显著成本获取。针对A2MT,我们旨在训练能够主动选择获取输入中哪些模态的智能体,以平衡获取成本与预测性能。A2MT将先前提出的主动特征获取任务扩展至涉及高维输入的时序决策。我们提出了一种基于Perceiver IO架构的方法,以在实践中解决A2MT问题。我们的智能体能够解决一个需要实际相关的跨模态推理能力的新型合成场景。在Kinetics-700和AudioSet这两个大规模真实数据集上,我们的智能体成功学习了成本感知的获取行为。然而,消融实验表明它们无法学习自适应获取策略,这突显了即使对于最先进的模型而言,该任务仍具有挑战性。A2MT的应用可能在医学、机器人或金融等领域产生重要影响,这些领域中不同模态在获取成本和信息量上存在差异。