Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information -- such as gender and occupation -- can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.
翻译:稠密检索器将源文档压缩为可能带有信息损失的向量表示,但关于哪些信息被保留或丢失及其如何影响下游任务的分析尚不充分。我们首次对稠密检索器与其基础语言模型(如BERT相对于Contriever)所捕获信息进行了比较分析。利用25个MultiBert检查点作为随机初始化参数,我们训练了MultiContrievers——一组包含25个Contriever模型的集合。通过信息论探针方法,我们测试了维基百科类文档的Contriever向量中能否提取特定信息(如性别与职业)。随后,我们考察了信息可提取性与模型性能及性别偏好的关联,并分析了这些结果对随机初始化与数据打乱操作的敏感度。研究发现:(1) Contriever模型显著提升了信息可提取性,但该指标与基准性能通常呈弱相关;(2) 存在性别偏好现象,但并非由Contriever表示直接导致;(3) 模型对随机初始化与数据打乱均表现出高度敏感性,表明未来检索研究应在更广泛的参数与数据分布上进行验证。