Parameter management is essential for distributed training of large machine learning (ML) tasks. Some ML tasks are hard to distribute because common approaches to parameter management can be highly inefficient. Advanced parameter management approaches -- such as selective replication or dynamic parameter allocation -- can improve efficiency, but to do so, they typically need to be integrated manually into each task's implementation and they require expensive upfront experimentation to tune correctly. In this work, we explore whether these two problems can be avoided. We first propose a novel intent signaling mechanism that integrates naturally into existing ML stacks and provides the parameter manager with crucial information about parameter accesses. We then describe AdaPM, a fully adaptive, zero-tuning parameter manager based on this mechanism. In contrast to prior systems, this approach separates providing information (simple, done by the task) from exploiting it effectively (hard, done automatically by AdaPM). In our experimental evaluation, AdaPM matched or outperformed state-of-the-art parameter managers out of the box, suggesting that automatic parameter management is possible.
翻译:参数管理对于大规模机器学习任务分布式训练至关重要。由于常见的参数管理方法可能效率极低,某些机器学习任务难以实现分布式部署。先进的参数管理方法(如选择性复制或动态参数分配)虽能提升效率,但通常需要手动集成到每个任务实现中,且需昂贵的预实验调优才能正确配置。本研究探讨了能否避免这两个问题。我们首先提出了一种新颖的意图信号机制,该机制能自然融入现有机器学习栈,为参数管理器提供关于参数访问的关键信息。随后基于该机制描述了AdaPM——一种完全自适应、免调优的参数管理器。与先前系统不同,本方法将信息提供(简单,由任务完成)与信息有效利用(复杂,由AdaPM自动完成)相分离。实验评估表明,AdaPM开箱即用即可匹配或超越最先进的参数管理器,这证实了自动化参数管理的可行性。