Public olfaction datasets are small and fragmented across single molecules and mixtures, limiting learning of generalizable odor representations. Recent works either learn single-molecule embeddings or address mixtures via similarity or pairwise label prediction, leaving representations separate and unaligned. In this work, we propose AROMMA, a framework that learns a unified embedding space for single molecules and two-molecule mixtures. Each molecule is encoded by a chemical foundation model and the mixtures are composed by an attention-based aggregator, ensuring both permutation invariance and asymmetric molecular interactions. We further align odor descriptor sets using knowledge distillation and class-aware pseudo-labeling to enrich missing mixture annotations. AROMMA achieves state-of-the-art performance in both single-molecule and molecule-pair datasets, with up to 19.1% AUROC improvement, demonstrating a robust generalization in two domains.
翻译:现有嗅觉数据集规模小且分散于单分子与混合物之间,限制了可泛化气味表征的学习。近期研究要么学习单分子嵌入,要么通过相似性或成对标签预测处理混合物,导致表征分离且未对齐。本研究提出AROMMA框架,该框架为单分子及双分子混合物学习统一的嵌入空间。每个分子通过化学基础模型编码,混合物则通过基于注意力的聚合器组合,确保排列不变性与非对称分子相互作用。我们进一步通过知识蒸馏与类感知伪标注对齐气味描述符集,以补全缺失的混合物标注。AROMMA在单分子与分子对数据集中均取得最先进性能,AUROC最高提升19.1%,证明其在两个领域均具有鲁棒的泛化能力。