Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (i.e. embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets. The code is freely available at \url{https://github.com/bkolosk1/bablfusion.git}.
翻译:大型语义知识库建立在事实性知识的基础上。然而,当前的密集文本表示(即嵌入)方法未能有效利用这些资源。文档的密集且鲁棒的表示对于有效解决下游分类与检索任务至关重要。本研究证明,在文本分类任务中,注入来自知识库的嵌入信息能够增强当代基于大语言模型(LLM)的表示性能。进一步地,通过将融合表示空间与自动化机器学习(AutoML)相结合,我们证明即使使用通过高效矩阵分解获得的原始表示空间的低维投影,仍可提升分类准确率。该结果表明,在预测性能几乎无损的前提下,能够实现显著提速的分类器,这一结论通过在六个多样化真实数据集上使用五个强LLM基线模型得到验证。代码已开源:\url{https://github.com/bkolosk1/bablfusion.git}。