Benchmarking Open-EndedInference Optimization by AI Agents

We evaluate whether frontier coding agents can optimize LLM serving workloads under a fixed compute budget, and find that their main bottleneck is experimental discipline rather than knowledge of relevant techniques.

Read Paper View Repository

Main results

Main Results

Across all four scenarios, agents outperform the PyTorch baseline and most engine defaults but trail matched-budget hyperparameter search.

Aggregate performance, geometric mean speedup

Agents achieve 8.08× geometric-mean speedup over the PyTorch baseline, compared to 11.53× for matched-budget hyperparameter search.

Benchmark setup

Each benchmark run targets one serving bottleneck, then scores only the final submitted server after correctness and integrity checks.

Benchmark flow

Each run gives the agent a base model, hardware environment, and a two-hour wall-clock budget to produce an OpenAI-compatible inference server. The objective is speedup over the PyTorch baseline on one bottleneck scenario, or on the balanced all-in-one scenario.

Final submissions must pass correctness checks and an integrity audit for reward hacking. If the final server fails these checks, is unreachable, or regresses below the PyTorch baseline, the run is scored at the PyTorch baseline; earlier intermediate results do not count.

Four scenarios

Four serving scenarios, each targeting a distinct bottleneck.

Prefill Latency

Long-context prompts; measured as time to first token.

Decode Latency

Long generations; measured as time per output token.

Throughput

Concurrent traffic; measured across burst, Poisson, and constant-rate profiles.

All-In-One

Balanced serving; geometric mean of latency and throughput metrics.

Leaderboard

Per-agent aggregate performance.

Ranked by aggregate speedup of each agent's final valid server.

Default ranking uses aggregate speedup of the final submitted server; scenario focus reranks by a single bottleneck.

Top agent Claude Sonnet 4.6

Sonnet 4.6 ranks first by combining competitive per-scenario speedups with reliably valid final submissions. Several larger models reach higher peak configurations during the run but submit a degraded or invalid server.

Rank Model / Agent Aggregate Prefill Decode Throughput All-in-one

Key finding Aggregate rank reflects repeatable final-server quality.

Agents that find one excellent configuration but fail many other runs score below agents that repeatedly submit valid improvements. This matters for low-supervision automated R&D, where consistency is part of the result.

Scenario Breakdown

Median final-agent outcome, agent range, and matched search baseline per scenario. Higher is better.

Top Agents Across Scenarios

Per-scenario speedup for the leading final-submitted agents.

Agent traces

What the transcripts sound like

Premature stop

Baseline performance: generation throughput 63.53 tokens/s. TTFT p50 51.8 ms, p90 400 ms. ITL p50 10.2 ms. TPOT p50 15.7 ms. Quality check: PASS.

The key metric for Scenario B is generation throughput... Let me stop the current server and optimize.

The current server is already valid and passing the checks. Since further changes risk breaking the deployment, I will keep this configuration for final evaluation.

The agent treats the first valid server as terminal rather than as a baseline for further optimization.

Rollback discipline

V1 improved the short speed-only probe but did not hold up in the longer end-to-end path, so I’m treating it as unstable for the final submission.

I’m using the remaining time on lower-risk V0 variants; the next one is eager execution, which might trim per-request overhead without changing the serving surface.

--enforce-eager is promising: TTFT improved modestly and, more importantly, the tail got tighter than the current default.

This run exemplifies the propose-measure-compare-rollback cycle that the benchmark rewards.

Controlled candidate

candidate A: increase --max-num-seqs to improve batching headroom

candidate B: enable --enable-prefix-caching to reduce repeated-prefix cost

candidate C: change KV-cache dtype to reduce memory pressure

Stronger runs isolate single-variable changes and retain only those that show measured improvement.

Deep dive

Where agent gains come from.

Additional time helps initially, but gains saturate quickly and reward-hacking behavior increases at longer budgets.

Time Budget Ablation

Aggregate speedup versus native PyTorch as the per-run time budget increases. Longer runs also show more late-stage regression and reward-hacking pressure.

Takeaway

Most of the speedup is captured within the first two hours. Beyond that, we observe increased reward hacking, late-stage destabilizing edits, and more invalid final submissions.

Forced-Engine Comparison

Starting from vLLM improves reliability, and forced engine choice helps on the right scenario, but neither closes the non-agent search gap.

Configuration	A Prefill	B Decode	C Throughput	D All-In-One
Default auto	3.53×	2.24×	25.84×	3.25×
vLLM-only	3.71×	4.08×	27.41×	3.48×
SGLang-only	4.62×	6.84×	29.76×	3.71×
TGI-only	3.74×	3.96×	61.38×	5.24×
Per-scenario best non-agent search	5.06×	15.23×	89.00×	6.10×

Takeaway

Restricting agents to a single engine improves reliability and matches certain scenarios well (e.g., TGI on throughput), but no engine-restricted configuration reaches the non-agent search baseline.

Insights

Failure modes.

Across 180 runs, agents identify appropriate optimizations in transcripts but fail to validate them, commit to them, or preserve them in the final submitted server.

Run outcomes across 180 runs

Agent behavior patterns

93.9% of runs submit a vLLM-based server.

Transcripts repeatedly name relevant optimizations, while the measured search remains shallow.

97% Chunked prefill 96% Quantization 84% Speculative decoding

Distribution of distinct non-default configs per run

Found vs submitted

Found configurations often disappear before submission because agents fail to validate and preserve them.

Speedup Drop relative to previous stage

Agents are aware of relevant techniques but conduct shallow search and frequently fail to preserve their best configuration in the final submission.

Team

Jehyeok Yeon^1,2 Ben Rank^1,2,3 Maksym Andriushchenko^1,2,3

¹ELLIS Institute Tübingen ²Max Planck Institute for Intelligent Systems ³Tübingen AI Center