LaQual: An Automated Framework for LLM App Quality Evaluation

Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

翻译：LLM应用商店作为软件分发的新范式迅速兴起，为用户提供内容生成、编程辅助、教育等领域的多样化选择。然而，当前LLM应用商店的排名与推荐机制主要依赖用户交互量、收藏数等静态指标，导致用户难以高效识别高质量应用。与此同时，现有学术研究聚焦于特定垂直领域，缺乏适用于多元化LLM应用生态的通用自动化评估框架。针对上述挑战，我们提出LaQual，一种面向LLM应用质量的自动化评估框架。LaQual整合三个关键阶段：（1）LLM应用标注与层次分类，实现精确场景映射；（2）基于时间加权用户参与度和功能能力指标的静态指标评估，用于过滤低质量应用；（3）动态场景自适应评估，由LLM生成场景特定的评估指标、评分准则与评估任务，实现全面质量评估。在主流LLM应用商店上的实验验证了LaQual的有效性：其自动化评分与人工评估呈现高度一致性；通过有效筛选，LaQual可将候选LLM应用池规模缩减66.7%至81.3%。用户研究进一步证实其相较于基准系统的显著优势，尤其在比较效率（均值5.45 vs. 3.30）与解释信息价值（4.75 vs. 2.25）方面。上述结果表明，LaQual为现实场景中LLM应用的高质量发现与推荐提供了可扩展、客观且以用户为中心的解决方案。