Lossless compression has made significant advancements in Genomics Data (GD) storage, sharing and management. Current learning-based methods are non-evolvable with problems of low-level compression modeling, limited adaptability, and user-unfriendly interface. To this end, we propose AgentGC, the first evolutionary Agent-based GD Compressor, consisting of 3 layers with multi-agent named Leader and Worker. Specifically, the 1) User layer provides a user-friendly interface via Leader combined with LLM; 2) Cognitive layer, driven by the Leader, integrates LLM to consider joint optimization of algorithm-dataset-system, addressing the issues of low-level modeling and limited adaptability; and 3) Compression layer, headed by Worker, performs compression & decompression via a automated multi-knowledge learning-based compression framework. On top of AgentGC, we design 3 modes to support diverse scenarios: CP for compression-ratio priority, TP for throughput priority, and BM for balanced mode. Compared with 14 baselines on 9 datasets, the average compression ratios gains are 16.66%, 16.11%, and 16.33%, the throughput gains are 4.73x, 9.23x, and 9.15x, respectively.
翻译:无损压缩技术在基因组数据的存储、共享与管理方面取得了显著进展。当前基于学习的方法存在不可进化性、底层压缩建模不足、适应性有限以及用户界面不友好等问题。为此,我们提出了AgentGC——首个基于进化智能体的基因组数据压缩器,其包含三层结构,由名为Leader和Worker的多智能体构成。具体而言:1)用户层通过结合大语言模型的Leader提供友好用户界面;2)认知层由Leader驱动,集成大语言模型以实现算法-数据集-系统的联合优化,解决底层建模不足与适应性有限的问题;3)压缩层由Worker主导,通过基于自动化多知识学习的压缩框架执行压缩与解压操作。在AgentGC基础上,我们设计了三种模式以支持多样化场景:CP(压缩率优先模式)、TP(吞吐量优先模式)以及BM(均衡模式)。在9个数据集上与14个基线方法的对比实验中,三种模式的平均压缩率分别提升16.66%、16.11%和16.33%,吞吐量分别提升4.73倍、9.23倍和9.15倍。