Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
翻译:文本嵌入已成为各类语言应用的核心组成部分。然而,现有方法在解释、探索和逆向重构嵌入空间方面存在局限,这降低了透明度,并阻碍了潜在有价值的生成式应用场景。本研究采用近期报道的嵌入语言模型(ELM)方法,将大型语言模型与临床试验的嵌入表示进行对齐。我们开发了一个开源、领域无关的ELM架构与训练框架,设计了针对临床试验的训练任务,并引入了一个经专家验证的合成数据集。随后,我们训练了一系列ELM模型,以探究不同任务和训练方案的影响。最终模型ctELM能够仅依据嵌入向量准确描述和比较未见过的临床试验,并能从新向量生成合理的临床试验方案。我们进一步证明,生成的试验摘要能够响应沿受试者年龄和性别等概念向量移动嵌入的操作。我们公开的ELM实现及实验结果将有助于推动大型语言模型在生物医学领域及其他场景中与嵌入空间的对齐。