Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.
翻译:嵌入模型在信息检索(IR)和语义相似度度量任务中至关重要,然而它们对较长文本的处理及相关位置偏差的研究仍显不足。本研究探讨了内容位置和输入规模对文本嵌入的影响。实验表明,无论采用何种位置编码机制,嵌入模型均会不成比例地优先处理输入文本的开头部分。消融研究证明,在文档起始处插入无关文本或进行删除操作,相较于在末尾进行相同操作,会导致修改后嵌入与原始嵌入之间的余弦相似度额外降低高达12.3%。回归分析进一步证实了这种偏差:即使内容无关,句子的重要性也随着其位置远离起始点而下降。我们推测这种效应源于预处理策略和所选的位置编码技术。这些发现量化了检索系统的敏感性,并为理解嵌入模型的鲁棒性提供了新的视角。