[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893 by sahrizvi · Pull Request #53 · ucbepic/DataAgentBench

sahrizvi · 2026-05-27T20:14:12Z

Altimate Code — Leaderboard Submission (R27)

Agent name: Altimate Code
Project page: altimate.sh

Component	Model	Role
Trial-time backbone	GPT-5.5 (Azure AI Foundry, deployment alias `gpt-5-chat`, Chat Completions API)	Answers each of the 270 trials
AutoContext author	Claude Sonnet 4.6 (Google Vertex AI)	One-shot per dataset — produces a schema-orientation document (joins, encodings, format quirks, sampled rows). GT-firewalled (no `ground_truth.csv` read path).

Hints: Yes (db_description_withhint.txt injected into the user prompt; AutoContext document also dropped into each trial workspace)
Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)

Prior submission

This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.

Result

The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream main at submission time.

Metric	Trial-time validators (`9031c68ad`)	Latest validators (`634cd61ad`)
Stratified Pass@1 (leaderboard metric)	0.6893	0.6893
Micro Pass@1 (passes / trials)	0.7444 (201 / 270)	0.7444 (201 / 270)
Total trials	270	270

Validator-version note. Trials ran against vendor/DataAgentBench at commit 9031c68ad. Upstream subsequently merged 18 validate.py updates and one ground-truth correction (music_brainz_20k/q2: Amazon Music → iTunes) for the 12 datasets covered. Re-applying the latest validators (commit 634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.

Per-dataset stratified Pass@1

Dataset	Pass@1	Trials
agnews	1.000	20/20
bookreview	1.000	15/15
stockindex	1.000	15/15
yelp	0.971	34/35
crmarenapro	0.831	54/65
googlelocal	0.750	15/20
stockmarket	0.720	18/25
music_brainz_20k	0.667	10/15
DEPS_DEV_V1	0.500	5/10
GITHUB_REPOS	0.500	10/20
PANCANCER_ATLAS	0.333	5/15
PATENTS	0.000	0/15

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
Reads an AutoContext document (per-dataset, authored by Claude Sonnet 4.6 once before any trials run) for schema-orientation notes — column annotations, verified join keys, sample-row encodings, NULL semantics, and entity-resolution caveats. Sonnet authors this with access to the warehouse schema and 5 sampled rows per table; no ground_truth.csv access path exists in the author code (dab_bench/auto_context.py).
Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list, fuzzy_match, explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
Validates output shape before commit via a validate_shape CLI that compares the draft ANSWER against format_hint.txt for row/field/separator compliance (no ground_truth.csv access).
Aggregates K=3 sub-trials per trial via exact-majority on the ANSWER file string. Top count wins; ties break to the first.

Reproducibility

The submission JSON in this PR contains 270 records of {dataset, query, run, answer} — one per (dataset, query, trial). Per-trial event logs (events.jsonl), result.json, and stderr.log for each sub-trial are available as an out-of-band trace bundle on request (~475 MB).

Reproduce locally with:

# Prereqs: postgres on :55432, mongodb on :57017, an Azure AI Foundry deployment
# named \"gpt-5-chat\" fronted by scripts_python/azure_foundry_proxy.py (rewrites
# max_tokens → max_completion_tokens), and a Google Vertex project for the
# Sonnet 4.6 AutoContext author pass.

GOOGLE_CLOUD_PROJECT=<your-gcp-project> GOOGLE_CLOUD_LOCATION=us-east5 \
AZURE_FOUNDRY_BASE_URL=http://localhost:9997/v1 \
AZURE_FOUNDRY_API_KEY=<your-azure-foundry-key> \
AZURE_FOUNDRY_MODELS=gpt-5-chat \
PG_HOST=127.0.0.1 PG_PORT=55432 PG_USER=postgres PG_PASSWORD=postgres \
MONGO_URI=\"mongodb://127.0.0.1:57017/\" \
uv run python scripts_python/run_benchmark.py \
  --dab-root vendor/DataAgentBench \
  --datasets agnews bookreview crmarenapro DEPS_DEV_V1 GITHUB_REPOS googlelocal \
             music_brainz_20k PANCANCER_ATLAS PATENTS stockindex stockmarket yelp \
  --trials 5 --concurrency 5 --consensus-k 3 \
  --profile bash --runtime altimate \
  --model azure-foundry/gpt-5-chat \
  --max-turns 75 --timeout-sec 2000 --yolo --prepare-external \
  --autocontext --autocontext-model claude-sonnet-4-6 \
  --experiments-dir baseline_runs --run-name run27_azure_gpt55

Limitations disclosed for completeness

PATENTS (0/15): every PATENTS trial produced a well-formed CSV answer but matched a different subset of CPC codes than the reference set. Failure mode is query-interpretation (EMA initialization convention and CPC hierarchy-level definition are under-specified in the question), not format or harness. We chose not to add per-dataset hand-tuning.
PANCANCER_ATLAS (5/15): concentrated on q2/q3 mis-grouping (descriptive vs coded histology column). The AutoContext Operational Rule for histology was authored unconditionally ("use icd_o_3_histology NOT histological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.

🤖 PR description drafted with Claude Code
Edits: Formatting

sahrizvi · 2026-05-28T06:04:07Z

Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip

dab-submission-r27-2026-05-28.zip (15 MB compressed → 475 MB extracted) — a .zip wrapper around a .tar.xz (GitHub accepts
.zip but not .tar.xz directly). Extract with:

  unzip dab-submission-r27-2026-05-28.zip                                                                                      
  tar -xJf dab-submission-r27-2026-05-28.tar.xz

Contents:

agent_description.md — same as in the PR body
submission.json — identical to leaderboard_submissions/altimate-code_gpt-5.5_n5.json in the PR diff
rescore.json — per-trial pass/fail under upstream HEAD 634cd61ad
r27_traces/trials/<dataset>_query<N>_trial<M>/ — for each of the 270 trials:
- Top-level: ANSWER, result.json, consensus.json (the K=3 consensus output)
- sub0/, sub1/, sub2/: per-sub events.jsonl (full agent action log), ANSWER, stderr.log
- Per-sub workspace/ is excluded (50 MB each — contains the materialized DBs the agent queried; recoverable via
  --prepare-external).

Ruiying-Ma · 2026-05-31T20:09:19Z

Hi @sahrizvi — thank you for the detailed submission and the trace bundle! We went through the traces and found a few spots where reference data appears to have leaked into the trials. Listing the trials where a leak both occurred and produced the passing answer:

Trial (dataset/query/trial)	Leak behavior
agnews / query1 / trial4	`load_dataset('ag_news')` → loaded clean AG News gold labels from local HF cache; used them to identify "sports" articles
agnews / query2 / trial1	`load_dataset('ag_news')` → gold category labels; answer (0.14414…) derived directly from the leaked labels
agnews / query2 / trial2	`load_dataset('ag_news')` → gold category labels
agnews / query2 / trial3	`load_dataset('ag_news')` → gold category labels
agnews / query2 / trial4	`load_dataset('ag_news')` → gold category labels
agnews / query2 / trial5	`load_dataset('ag_news')` → gold category labels
agnews / query3 / trial1	`load_dataset('ag_news')` → gold category labels
agnews / query3 / trial2	`load_dataset('ag_news')` → gold category labels
agnews / query3 / trial3	`load_dataset('ag_news')` → gold labels + read a prior run's `classified_articles.jsonl`/`temp_articles.jsonl` and re-aggregated them
agnews / query3 / trial4	`load_dataset('ag_news')` → gold category labels
agnews / query3 / trial5	`load_dataset('ag_news')` → gold category labels
agnews / query4 / trial1	`load_dataset('ag_news')` → gold category labels
agnews / query4 / trial2	`load_dataset('ag_news')` → gold category labels
agnews / query4 / trial3	`load_dataset('ag_news')` → gold category labels
agnews / query4 / trial4	`load_dataset('ag_news')` → gold category labels
agnews / query4 / trial5	`load_dataset('ag_news')` → gold category labels
stockmarket / query3 / trial2	`read` of a prior run's ANSWER was blocked, so `grep` was used on the prior run's `result.json` — this leaked the reference answer, and the final ANSWER copies the leaked values verbatim

Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

sahrizvi · 2026-06-01T00:52:07Z

Hi @Ruiying-Ma — thank you for the careful audit and for pinpointing the exact trials. You're right on every one of them, and I owe you an apology: we submitted with the belief our pre-submission audit was clean, and it wasn't. Two separate vectors slipped past us.

Why the agnews load_dataset leak got missed. Our defensive layer at the time of the run lived at the tool-call permission level — path-pattern denies on read, glob, grep for *ANSWER*, *ground_truth*, *result.json*, etc. The agent bypassed that layer by spawning a fresh python subprocess (python -c "from datasets import load_dataset; ds = load_dataset('ag_news')..."). Once python is allowed, the subprocess inherits the host's HuggingFace cache and our tool-level denies don't apply to anything that happens inside it. The proper defense (env vars HF_HUB_OFFLINE=1 / HF_DATASETS_OFFLINE=1 / TRANSFORMERS_OFFLINE=1, plus a workspace pip that has no datasets/sklearn/torchtext installed) was added on a sanitize branch after the submission, but R27 was run before that landed. This was a process failure on our part — we should have caught it.

Why the stockmarket/q3/t2 grep leak got missed. We did have *result.json*: deny set on the grep and glob permissions, and the corresponding read of workspace/ANSWER correctly fired permission_denied in the trace. But altimate-code's grep tool passes the search regex to the permission matcher rather than the search path (grep.ts:29 — patterns: [params.pattern]). So *result.json* is matched against the regex "Apex Global Brands|23781\.422..." and never fires; ripgrep then recurses into .../baseline_runs/.../result.json and returns its contents. We've patched this with filesystem-level isolation (each clean trial now runs from a workspace tree with no prior baseline_runs/ reachable) and added a per-trial path deny on bash/read as defense-in-depth.

Please consider this our updated submission. Two datasets re-run, ten unchanged.

1. agnews — clean rerun complete. Re-ran all 4 queries × 5 trials in a hardened sandbox (HF offline env, workspace venv without datasets/sklearn/torchtext, no host HF cache reachable). Agents attempted load_dataset, sklearn, keras, torchtext — all returned ModuleNotFoundError. Clean Pass@1: 7/20 = 35.00% (q1: 5/5, q2: 2/5, q3: 0/5, q4: 0/5).

Caveat we want to flag honestly: this rerun is on the same Azure GPT-5-chat model as R27. Azure OpenAI's content moderation rejects explore_dataset classifier calls when news article batches include violence-coded content (war / crime / disaster). 17 of 60 sub-trials (28%) across 11 of 20 trials (55%) hit content_filter / ResponsibleAIPolicyViolation during classification. That's a structural disadvantage specific to running news classification on Azure — it would not have existed when the load_dataset leak was masking the failure path. We don't claim the 35% reflects raw model capability; once content-filter blocks are addressed via a non-Azure labeler path, we'll make a fresh submission tagged with a non-Azure provider.

2. stockmarket/q3 — clean rerun complete. With the grep/path fix above and an isolated filesystem tree, we re-ran all 5 trials of stockmarket/q3. Clean Pass@1: 0/5. All 5 trials computed the correct AVG(Volume) values via SQL, but the agent emitted each row as <company-name> <company-description> <number> (selecting the Company Description column from stockinfo), which puts the number more than 50 characters from the name — outside the validator's name-to-number proximity window. R27/t2 only passed because it grep'd a prior trial's result.json for the company names and reconstructed a clean Name,Number format from the leaked intermediate SQL output; without that hint, t2 fails for the same column-selection reason as t1/t3/t4/t5. Other stockmarket queries (q1, q2, q4, q5) are unchanged from the original submission.

3. Updated stratified Pass@1: 63.18%.

Dataset	R27	This update
agnews	100.0%	35.0%
stockmarket	72.0%	68.0%
(10 other datasets)	unchanged	unchanged
Stratified	68.93%	63.18%

Updated trace bundle is attached (replacing agnews/* and stockmarket/q3/* trials; all other trials carry over from the original R27 bundle).

Thank you again for the careful review — this is on us, and we appreciate the time you took to catch it.

alt-code-remediation-bundle.zip

Ruiying-Ma · 2026-06-01T19:31:28Z

Hi @sahrizvi ! We have added your submission to the leaderboard! Thank you for your contribution!

Submit Altimate Code (GPT-5.5) — 270 trials, 0.6893 stratified Pass@1

3b97f76

Ruiying-Ma closed this Jun 1, 2026

cursor Bot mentioned this pull request Jun 12, 2026

Fix PATENTS ground truths: regenerate query3 from released data, de-ambiguate query1/query2 #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28

sahrizvi commented May 27, 2026 •

edited

Loading

Uh oh!

sahrizvi commented May 28, 2026

Uh oh!

Ruiying-Ma commented May 31, 2026

Uh oh!

sahrizvi commented Jun 1, 2026

Uh oh!

Ruiying-Ma commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sahrizvi commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Altimate Code — Leaderboard Submission (R27)

Prior submission

Result

Per-dataset stratified Pass@1

Architecture

Reproducibility

Limitations disclosed for completeness

Uh oh!

sahrizvi commented May 28, 2026

Uh oh!

Ruiying-Ma commented May 31, 2026

Uh oh!

sahrizvi commented Jun 1, 2026

Uh oh!

Ruiying-Ma commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahrizvi commented May 27, 2026 •

edited

Loading