Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We release the full family of EUPE models and the code to foster future research.
翻译:在智能边缘设备上运行AI模型可以解锁多样化的用户体验,但由于计算资源有限且需同时处理多项任务,这带来了挑战。这要求视觉编码器具备小巧的尺寸,同时提供强大且通用的表征能力。我们提出了一种方法——高效通用感知编码器(EUPE),它兼具推理效率和对各类下游任务的通用优质表征。我们通过从多个领域专家级基础视觉编码器中进行知识蒸馏来实现这一目标。与以往通过多教师直接缩放到高效编码器的聚合方法不同,我们证明了首先扩展至大型代理教师、再从此单一教师进行缩放这一策略的重要性。实验表明,在相同尺寸下,EUPE在多样化的任务领域中达到了与单个领域专家相当或更优的性能,并超越了以往的聚合编码器。我们发布了EUPE全系列模型及代码,以促进未来研究。