基准测试论文 - 专知

会员服务 ·

基准测试

基准测试是指通过设计科学的测试方法、测试工具和测试系统，实现对一类测试对象的某项性能指标进行定量的和可对比的测试。

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

Arxiv

0+阅读 · 6月16日

Human-on-the-Bridge: Scalable Evaluation for AI Agents

Arxiv

0+阅读 · 6月15日

The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments

Arxiv

0+阅读 · 6月15日

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

Arxiv

0+阅读 · 6月16日

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

Arxiv

0+阅读 · 6月16日

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

Arxiv

0+阅读 · 6月16日

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Arxiv

0+阅读 · 6月16日

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

Arxiv

0+阅读 · 6月16日

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Arxiv

0+阅读 · 6月16日

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

Arxiv

0+阅读 · 6月16日

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

Arxiv

0+阅读 · 6月15日

Offline Preference-Based Trajectory Evaluation

Arxiv

0+阅读 · 6月16日

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

Arxiv

0+阅读 · 6月16日

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

Arxiv

0+阅读 · 6月16日

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

Arxiv

0+阅读 · 6月16日

参考链接

微信扫码咨询专知VIP会员