We present Banyan, an improved model to learn semantic representations by inducing explicit structure over data. In contrast to prior approaches using structure spanning single sentences, Banyan learns by resolving multiple constituent structures into a shared one explicitly incorporating global context. Combined with an improved message-passing scheme inspired by Griffin, Banyan learns significantly better representations, avoids spurious false negatives with contrastive learning, and drastically improves memory efficiency in such explicit-structured models. Using the Self-StrAE framework, we show that Banyan (a) outperforms baselines using sentential structure across various settings (b) matches or outperforms unstructured baselines like GloVe (+augmentations) and a RoBERTa medium (+simcse) pre-trained on 100M tokens, despite having just a handful of (non-embedding) parameters, and (c) also learns effective representations across several low resource (Asian and African) languages as measured on SemRel tasks.
翻译:我们提出Banyan,一种通过数据显式结构归纳来学习语义表示的改进模型。与先前使用跨单句结构的方法不同,Banyan通过将多个组成结构解析为显式包含全局上下文的共享结构进行学习。结合受Griffin启发的改进消息传递机制,Banyan能学习到显著更优的表示,避免对比学习中的虚假负例,并大幅提升此类显式结构模型的内存效率。在Self-StrAE框架下,我们证明Banyan:(a)在多种设定下优于使用句子结构的基线模型;(b)在仅使用少量(非嵌入)参数的情况下,其性能匹配或超越无结构基线模型(如GloVe及其增强版本)以及在1亿词例上预训练的RoBERTa-medium(+simcse)模型;(c)根据SemRel任务评估,该模型能在多种低资源(亚洲与非洲)语言中学习到有效表示。