Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
翻译:近年来,文本领域的预训练成功已全面扩展至视觉、音频及跨模态场景。不同模态的预训练模型在模型结构上呈现出日益明显的同质化趋势,这为在统一框架下实现不同预训练模型提供了契机。本文提出TencentPretrain——一个支持多模态预训练模型的工具包。其核心特性在于模块化设计:该工具包将预训练模型统一划分为嵌入层、编码器、目标嵌入层、解码器和目标五个组件。由于每个组件均提供几乎所有通用模块,用户可从不同组件中选择所需模块构建完整的预训练模型。模块化设计使用户能够高效复现现有预训练模型或构建全新模型。我们通过文本、视觉和音频基准测试验证了该工具包,结果表明其性能可与原始实现相媲美。