Stanford Researchers Launched MedAgentBench: A Actual-World Benchmark for Healthcare AI Brokers

September 16, 2025

36

A staff of Stanford College researchers have launched MedAgentBench, a brand new benchmark suite designed to judge giant language mannequin (LLM) brokers in healthcare contexts. Not like prior question-answering datasets, MedAgentBench supplies a digital digital well being report (EHR) setting the place AI techniques should work together, plan, and execute multi-step medical duties. This marks a big shift from testing static reasoning to assessing agentic capabilities in dwell, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

Why Do We Want Agentic Benchmarks in Healthcare?

Current LLMs have moved past static chat-based interactions towards agentic conduct—decoding high-level directions, calling APIs, integrating affected person knowledge, and automating complicated processes. In drugs, this evolution may assist handle workers shortages, documentation burden, and administrative inefficiencies.

Whereas general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical knowledge, FHIR interoperability, and longitudinal affected person information. MedAgentBench fills this hole by providing a reproducible, clinically related analysis framework.

What Does MedAgentBench Include?

How Are the Duties Structured?

MedAgentBench consists of 300 duties throughout 10 classes, written by licensed physicians. These duties embody affected person data retrieval, lab consequence monitoring, documentation, check ordering, referrals, and medicine administration. Duties common 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

What Affected person Information Helps the Benchmark?

The benchmark leverages 100 life like affected person profiles extracted from Stanford’s STARR knowledge repository, comprising over 700,000 information together with labs, vitals, diagnoses, procedures, and medicine orders. Information was de-identified and jittered for privateness whereas preserving medical validity.

How Is the Surroundings Constructed?

The setting is FHIR-compliant, supporting each retrieval (GET) and modification (POST) of EHR knowledge. AI techniques can simulate life like medical interactions resembling documenting vitals or inserting medicine orders. This design makes the benchmark instantly translatable to dwell EHR techniques.

How Are Fashions Evaluated?

Metric: Job success charge (SR), measured with strict move@1 to mirror real-world security necessities.
Fashions Examined: 12 main LLMs together with GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3.
Agent Orchestrator: A baseline orchestration setup with 9 FHIR features, restricted to eight interplay rounds per process.

Which Fashions Carried out Finest?

Claude 3.5 Sonnet v2: Finest general with 69.67% success, particularly robust in retrieval duties (85.33%).
GPT-4o: 64.0% success, exhibiting balanced retrieval and motion efficiency.
DeepSeek-V3: 62.67% success, main amongst open-weight fashions.
Remark: Most fashions excelled at question duties however struggled with action-based duties requiring protected multi-step execution.

What Errors Did Fashions Make?

Two dominant failure patterns emerged:

Instruction adherence failures — invalid API calls or incorrect JSON formatting.
Output mismatch — offering full sentences when structured numerical values have been required.

These errors spotlight gaps in precision and reliability, each essential in medical deployment.

Abstract

MedAgentBench establishes the primary large-scale benchmark for evaluating LLM brokers in life like EHR settings, pairing 300 clinician-authored duties with a FHIR-compliant setting and 100 affected person profiles. Outcomes present robust potential however restricted reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the hole between question success and protected motion execution. Whereas constrained by single-institution knowledge and EHR-focused scope, MedAgentBench supplies an open, reproducible framework to drive the following technology of reliable healthcare AI brokers

Try the PAPER and Technical Weblog. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling complicated datasets into actionable insights.

Previous articleInformation Analytics Driving the Fashionable E-commerce Warehouse

Next articleTech Mahindra pushes personal 5G for Trade 4.0

Stanford Researchers Launched MedAgentBench: A Actual-World Benchmark for Healthcare AI Brokers

Why Do We Want Agentic Benchmarks in Healthcare?

What Does MedAgentBench Include?

How Are the Duties Structured?

What Affected person Information Helps the Benchmark?

How Is the Surroundings Constructed?

How Are Fashions Evaluated?

Which Fashions Carried out Finest?

What Errors Did Fashions Make?

Abstract

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Korea Innovation Basis selects 2 AI/IoT corporations for World Know-how Commercialisation Help Program

CRISPR Slashes ‘Dangerous Ldl cholesterol’ Ranges by 95 % in Early Outcomes

Portuguese on-line buying reaches €11 billion in 2025

swift – iOS Firebase seems to hold resulting from StoreKit (which is not getting used)

Recent Comments

ABOUT US

POPULAR POSTS

Korea Innovation Basis selects 2 AI/IoT corporations for World Know-how Commercialisation Help Program

CRISPR Slashes ‘Dangerous Ldl cholesterol’ Ranges by 95 % in Early Outcomes

Portuguese on-line buying reaches €11 billion in 2025

POPULAR CATEGORY