A plethora of sentence embedding models makes it challenging to choose one, especially for technical domains rich with specialized vocabulary. In this work, we domain adapt embeddings using telecom, health and science datasets for question answering. We evaluate embeddings obtained from publicly available models and their domain-adapted variants, on both point retrieval accuracies, as well as their (95\%) confidence intervals. We establish a systematic method to obtain thresholds for similarity scores for different embeddings. As expected, we observe that fine-tuning improves mean bootstrapped accuracies. We also observe that it results in tighter confidence intervals, which further improve when pre-training is preceded by fine-tuning. We introduce metrics which measure the distributional overlaps of top-$K$, correct and random document similarities with the question. Further, we show that these metrics are correlated with retrieval accuracy and similarity thresholds. Recent literature shows conflicting effects of isotropy on retrieval accuracies. Our experiments establish that the isotropy of embeddings (as measured by two independent state-of-the-art isotropy metric definitions) is poorly correlated with retrieval performance. We show that embeddings for domain-specific sentences have little overlap with those for domain-agnostic ones, and fine-tuning moves them further apart. Based on our results, we provide recommendations for use of our methodology and metrics by researchers and practitioners.
翻译:大量句子嵌入模型的存在使得选择合适模型变得困难,特别是在富含专业词汇的技术领域。在本研究中,我们利用电信、健康和科学领域的数据集对嵌入进行领域自适应,以应用于问答任务。我们评估了从公开可用模型及其领域自适应变体获得的嵌入,既考察其点检索准确率,也分析其(95%)置信区间。我们建立了一种系统化方法,用于确定不同嵌入相似度得分的阈值。如预期所示,我们观察到微调能提升平均自助法准确率。同时发现微调会导致置信区间收窄,而若在微调前进行预训练,这种改善效果会进一步增强。我们引入了衡量前K篇文档、正确文档及随机文档与问题相似度分布重叠度的指标。此外,我们证明这些指标与检索准确率及相似度阈值具有相关性。近期文献显示各向同性对检索准确率的影响存在矛盾。我们的实验表明,嵌入的各向同性(通过两种独立的先进各向同性度量定义进行测量)与检索性能的相关性较弱。我们证明领域特定句子的嵌入与领域无关句子的嵌入重叠度很低,而微调会进一步扩大这种差异。基于研究结果,我们为研究人员和实践者提供了关于本方法及指标的应用建议。