This paper introduces and analyzes a search and retrieval model for RAG-like systems under {token} erasures. We provide an information-theoretic analysis of remote document retrieval when query representations are only partially preserved. The query is represented using term-frequency-based features, and semantically adaptive redundancy is assigned according to feature importance. Retrieval is performed using TF-IDF-weighted similarity. We characterize the retrieval error probability by showing that the vector of similarity margins converges to a multivariate Gaussian distribution, yielding an explicit approximation and computable upper bounds. Numerical results support the analysis, while a separate data-driven evaluation using embedding-based retrieval on real-world data shows that the same importance-aware redundancy principles extend to modern retrieval pipelines. Overall, the results show that assigning higher redundancy to semantically important query features improves retrieval reliability.
翻译:本文研究并分析了一种面向令牌擦除场景下、适用于类RAG系统的搜索与检索模型。我们针对查询表示仅部分保留条件下的远程文档检索问题,提供了信息论层面的分析。查询采用基于词频的特征表示,并根据特征重要性分配语义自适应的冗余量。检索过程采用TF-IDF加权相似度方法。通过证明相似度阈值向量收敛于多元高斯分布,我们刻画了检索错误概率,并由此得到显式逼近结果及可计算的上界。数值实验结果支撑了理论分析,此外,基于真实世界数据、采用嵌入检索的独立数据驱动评估表明,相同的重要性感知冗余原则同样适用于现代检索流程。总体而言,研究结果表明,为语义重要的查询特征分配更高冗余度能够提升检索可靠性。