Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at https://github.com/iliass-y/patenteb. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.
翻译:专利文本嵌入支持现有技术检索、技术图谱绘制与专利分析,然而现有基准未能充分涵盖专利领域的特定挑战。本文提出PatenTEB——一个包含检索、分类、复述和聚类四大类共15项任务的综合基准,涵盖206万个样本。PatenTEB采用领域分层划分策略,实施领域特异性困难负样本挖掘,并系统覆盖通用嵌入基准中缺失的非对称片段-文档匹配场景。通过多任务训练,我们构建了参数量从6700万至3.44亿、上下文长度达4096标记的patembed模型家族。外部验证表明其具备强大泛化能力:patembed-base在MTEB BigPatentClustering.v2上达到最优性能(V-measure值0.494对比先前最佳0.445),而patembed-large在DAPFAM上获得0.377的NDCG@100指标。系统消融实验揭示:多任务训练虽略微影响基准性能,但显著提升外部泛化能力;领域预训练初始化在所有任务类别中均提供持续优势。所有资源将通过https://github.com/iliass-y/patenteb公开。关键词:专利检索、句子嵌入、多任务学习、非对称检索、基准评估、对比学习。