Large Lanugage Models (LLMs) are gaining increasing popularity in a variety of use cases, from language understanding and writing to assistance in application development. One of the most important aspects for optimal funcionality of LLMs is embedding layers. Word embeddings are distributed representations of words in a continuous vector space. In the context of LLMs, words or tokens from the input text are transformed into high-dimensional vectors using unique algorithms specific to the model. Our research examines the embedding algorithms from leading companies in the industry, such as OpenAI, Google's PaLM, and BERT. Using medical data, we have analyzed similarity scores of each embedding layer, observing differences in performance among each algorithm. To enhance each model and provide an additional encoding layer, we also implemented Siamese Neural Networks. After observing changes in performance with the addition of the model, we measured the carbon footage per epoch of training. The carbon footprint associated with large language models (LLMs) is a significant concern, and should be taken into consideration when selecting algorithms for a variety of use cases. Overall, our research compared the accuracy different, leading embedding algorithms and their carbon footage, allowing for a holistic review of each embedding algorithm.
翻译:大语言模型在语言理解与写作、应用程序开发辅助等各类应用场景中日益普及。对于实现大语言模型最优性能而言,嵌入层是最重要的方面之一。词嵌入是将文本中的词汇在连续向量空间中的分布式表示。在大语言模型语境下,输入文本中的词元或词汇通过模型特有的独特算法转化为高维向量。本研究考察了行业领先企业(如OpenAI、谷歌PaLM与BERT)的嵌入算法。我们利用医学数据分析了各嵌入层的相似度评分,观察到不同算法在性能上存在差异。为增强各模型性能并提供额外编码层,我们还实现了孪生神经网络。在观察添加该模型后性能变化的基础上,我们测量了每个训练周期产生的碳足迹。与大语言模型相关的碳足迹问题值得高度重视,在针对不同应用场景选择算法时应予以充分考虑。总体而言,本研究对主流嵌入算法的精确度与碳足迹进行了比较,为全面评估各嵌入算法提供了依据。