Decoder-based transformers, while revolutionizing language modeling and scaling to immense sizes, have not completely overtaken encoder-heavy architectures in natural language processing. Specifically, encoder-only models remain dominant in tasks like classification, regression, and ranking. This is primarily due to the inherent structure of decoder-based models, which limits their direct applicability to these tasks. In this paper, we introduce Gemma Encoder, adapting the powerful Gemma decoder model to an encoder architecture, thereby unlocking its potential for a wider range of non-generative applications. To optimize the adaptation from decoder to encoder, we systematically analyze various pooling strategies, attention mechanisms, and hyperparameters (e.g., dropout rate). Furthermore, we benchmark Gemma Encoder against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark, demonstrating its effectiveness and versatility.
翻译:解码器Transformer模型虽然在语言建模领域引发革命性变革并实现规模扩展,但尚未在自然语言处理领域完全取代编码器主导的架构。具体而言,编码器专用模型在分类、回归和排序等任务中仍保持主导地位,这主要源于解码器模型的内在结构限制了其在这些任务中的直接适用性。本文提出Gemma Encoder架构,将强大的Gemma解码器模型适配为编码器架构,从而释放其在更广泛非生成式应用中的潜力。为优化从解码器到编码器的适配过程,我们系统分析了多种池化策略、注意力机制及超参数(如丢弃率)。此外,我们在GLUE基准测试集和MS MARCO排序基准上,将Gemma Encoder与现有方法进行性能对比,验证了其有效性与多功能性。