Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.
翻译:检索增强生成(RAG)通过集成外部文档检索来为大型语言模型(LLM)提供领域特定或最新知识,从而增强其能力。RAG的有效性取决于检索文档的相关性,这受到嵌入与领域专门内容语义对齐程度的影响。尽管全参数微调可以将语言模型对齐到特定领域,但其计算成本高昂且需要大量数据。本文提出层次化嵌入对齐损失(HEAL),这是一种新颖的方法,它在对比学习中利用层次化模糊聚类与矩阵分解,以高效地将LLM嵌入与领域特定内容对齐。HEAL计算层级/深度维度的对比损失,并引入层次化惩罚项,使嵌入与标签层次结构中的底层关系对齐。该方法提升了检索相关性和文档分类性能,有效减少了LLM输出中的幻觉。在我们的实验中,我们在多个领域(包括医疗健康、材料科学、网络安全和应用数学)对HEAL进行了基准测试和评估。