Foundation models have recently expanded into robotics after excelling in computer vision and natural language processing. The models are accessible in two ways: open-source or paid, closed-source options. Users with access to both face a problem when deciding between effective yet costly closed-source models and free but less powerful open-source alternatives. We call it the model selection problem. Existing supervised-learning methods are impractical due to the high cost of collecting extensive training data from closed-source models. Hence, we focus on the online learning setting where algorithms learn while collecting data, eliminating the need for large pre-collected datasets. We thus formulate a user-centric online model selection problem and propose a novel solution that combines an open-source encoder to output context and an online learning algorithm that processes this context. The encoder distills vast data distributions into low-dimensional features, i.e., the context, without additional training. The online learning algorithm aims to maximize a composite reward that includes model performance, execution time, and costs based on the context extracted from the data. It results in an improved trade-off between selecting open-source and closed-source models compared to non-contextual methods, as validated by our theoretical analysis. Experiments across language-based robotic tasks such as Waymo Open Dataset, ALFRED, and Open X-Embodiment demonstrate real-world applications of the solution. The results show that the solution significantly improves the task success rate by up to 14%.
翻译:基础模型在计算机视觉和自然语言处理中表现优异后,近期已扩展至机器人领域。这些模型可通过两种方式获取:开源模型或付费闭源模型。能够同时使用两种模型的用户在决策时面临困境:是选择高效但成本高昂的闭源模型,还是选择免费但性能较弱的开源替代方案。我们将此问题称为模型选择问题。由于从闭源模型收集大规模训练数据的成本过高,现有的监督学习方法并不实用。因此,我们聚焦于在线学习场景——算法在数据收集过程中同步学习,从而无需预先准备大规模数据集。据此,我们提出一种以用户为中心的在线模型选择问题,并设计了一种创新解决方案:该方案结合了用于输出上下文的开源编码器,以及处理该上下文的在线学习算法。编码器无需额外训练即可将海量数据分布提炼为低维特征(即上下文)。在线学习算法则基于从数据中提取的上下文,以最大化包含模型性能、执行时间和成本在内的复合奖励。理论分析证实,相较于无上下文方法,该方案在开源模型与闭源模型的选择之间实现了更优的权衡。基于Waymo Open Dataset、ALFRED和Open X-Embodiment等语言驱动型机器人任务的跨场景实验验证了该方案的实际应用价值。结果表明,该方案可将任务成功率最高提升14%。