Cosine similarity is prevalent in contrastive learning, yet it makes an implicit assumption: embedding magnitude is noise. Prior work occasionally found dot product and cosine similarity comparable, but left unanswered WHAT information magnitude carries, WHEN it helps, and HOW to leverage it. We conduct a systematic study through a $2 \times 2$ ablation that independently controls input-side and output-side normalization across text and vision models. Our findings reveal three key insights. First, in text retrieval, output (document) magnitude strongly correlates with relevance (Cohen's $d$ up to 1.80), yielding the largest gains on reasoning-intensive tasks. Second, input and output magnitudes serve asymmetric roles: output magnitude directly scales similarity scores while input magnitude modulates training dynamics. Third, magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment). These findings establish a task symmetry principle: the choice between cosine and dot product depends on whether the task has distinct input roles, enabling cost-free improvements by simply removing an unnecessary constraint.
翻译:余弦相似度在对比学习中广泛应用,但其隐含了一个假设:嵌入范数是噪声。先前工作偶尔发现点积与余弦相似度效果相当,但未回答范数携带何种信息、何时有益以及如何利用它。我们通过一个 $2 \times 2$ 消融实验展开系统研究,在文本与视觉模型中独立控制输入侧与输出侧的归一化。研究发现揭示了三个关键见解。首先,在文本检索中,输出(文档)范数与相关性强烈相关(Cohen's $d$ 高达 1.80),在推理密集型任务上带来最大增益。其次,输入与输出范数扮演非对称角色:输出范数直接缩放相似度分数,而输入范数调节训练动态。第三,范数学习对非对称任务(文本检索、RAG)有益,但对对称任务(STS、图文对齐)有害。这些发现确立了一项任务对称性原理:在余弦相似度与点积之间的选择取决于任务是否具有不同的输入角色,通过简单地移除不必要的约束即可实现无成本的性能提升。