Skip to content

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53

Closed
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28
Closed

[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
sahrizvi wants to merge 1 commit into
ucbepic:mainfrom
sahrizvi:submit/altimate-code-gpt5.5-2026-05-28

Conversation

@sahrizvi

@sahrizvi sahrizvi commented May 27, 2026

Copy link
Copy Markdown

Altimate Code — Leaderboard Submission (R27)

Agent name: Altimate Code
Project page: altimate.sh

Component Model Role
Trial-time backbone GPT-5.5 (Azure AI Foundry, deployment alias gpt-5-chat, Chat Completions API) Answers each of the 270 trials
AutoContext author Claude Sonnet 4.6 (Google Vertex AI) One-shot per dataset — produces a schema-orientation document (joins, encodings, format quirks, sampled rows). GT-firewalled (no ground_truth.csv read path).

Hints: Yes (db_description_withhint.txt injected into the user prompt; AutoContext document also dropped into each trial workspace)
Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)

Prior submission

This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.

Result

The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream main at submission time.

Metric Trial-time validators (9031c68ad) Latest validators (634cd61ad)
Stratified Pass@1 (leaderboard metric) 0.6893 0.6893
Micro Pass@1 (passes / trials) 0.7444 (201 / 270) 0.7444 (201 / 270)
Total trials 270 270

Validator-version note. Trials ran against vendor/DataAgentBench at commit 9031c68ad. Upstream subsequently merged 18 validate.py updates and one ground-truth correction (music_brainz_20k/q2: Amazon MusiciTunes) for the 12 datasets covered. Re-applying the latest validators (commit 634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.

Per-dataset stratified Pass@1

Dataset Pass@1 Trials
agnews 1.000 20/20
bookreview 1.000 15/15
stockindex 1.000 15/15
yelp 0.971 34/35
crmarenapro 0.831 54/65
googlelocal 0.750 15/20
stockmarket 0.720 18/25
music_brainz_20k 0.667 10/15
DEPS_DEV_V1 0.500 5/10
GITHUB_REPOS 0.500 10/20
PANCANCER_ATLAS 0.333 5/15
PATENTS 0.000 0/15

Architecture

A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:

  1. Reads each dataset's db_description_withhint.txt (injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.
  2. Reads an AutoContext document (per-dataset, authored by Claude Sonnet 4.6 once before any trials run) for schema-orientation notes — column annotations, verified join keys, sample-row encodings, NULL semantics, and entity-resolution caveats. Sonnet authors this with access to the warehouse schema and 5 sampled rows per table; no ground_truth.csv access path exists in the author code (dab_bench/auto_context.py).
  3. Uses native data tools (schema_index, schema_search, schema_inspect, sql_execute, warehouse_list, fuzzy_match, explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.
  4. Validates output shape before commit via a validate_shape CLI that compares the draft ANSWER against format_hint.txt for row/field/separator compliance (no ground_truth.csv access).
  5. Aggregates K=3 sub-trials per trial via exact-majority on the ANSWER file string. Top count wins; ties break to the first.

Reproducibility

The submission JSON in this PR contains 270 records of {dataset, query, run, answer} — one per (dataset, query, trial). Per-trial event logs (events.jsonl), result.json, and stderr.log for each sub-trial are available as an out-of-band trace bundle on request (~475 MB).

Reproduce locally with:

# Prereqs: postgres on :55432, mongodb on :57017, an Azure AI Foundry deployment
# named \"gpt-5-chat\" fronted by scripts_python/azure_foundry_proxy.py (rewrites
# max_tokens → max_completion_tokens), and a Google Vertex project for the
# Sonnet 4.6 AutoContext author pass.

GOOGLE_CLOUD_PROJECT=<your-gcp-project> GOOGLE_CLOUD_LOCATION=us-east5 \
AZURE_FOUNDRY_BASE_URL=http://localhost:9997/v1 \
AZURE_FOUNDRY_API_KEY=<your-azure-foundry-key> \
AZURE_FOUNDRY_MODELS=gpt-5-chat \
PG_HOST=127.0.0.1 PG_PORT=55432 PG_USER=postgres PG_PASSWORD=postgres \
MONGO_URI=\"mongodb://127.0.0.1:57017/\" \
uv run python scripts_python/run_benchmark.py \
  --dab-root vendor/DataAgentBench \
  --datasets agnews bookreview crmarenapro DEPS_DEV_V1 GITHUB_REPOS googlelocal \
             music_brainz_20k PANCANCER_ATLAS PATENTS stockindex stockmarket yelp \
  --trials 5 --concurrency 5 --consensus-k 3 \
  --profile bash --runtime altimate \
  --model azure-foundry/gpt-5-chat \
  --max-turns 75 --timeout-sec 2000 --yolo --prepare-external \
  --autocontext --autocontext-model claude-sonnet-4-6 \
  --experiments-dir baseline_runs --run-name run27_azure_gpt55

Limitations disclosed for completeness

  • PATENTS (0/15): every PATENTS trial produced a well-formed CSV answer but matched a different subset of CPC codes than the reference set. Failure mode is query-interpretation (EMA initialization convention and CPC hierarchy-level definition are under-specified in the question), not format or harness. We chose not to add per-dataset hand-tuning.
  • PANCANCER_ATLAS (5/15): concentrated on q2/q3 mis-grouping (descriptive vs coded histology column). The AutoContext Operational Rule for histology was authored unconditionally ("use icd_o_3_histology NOT histological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.

🤖 PR description drafted with Claude Code
Edits: Formatting

@sahrizvi

Copy link
Copy Markdown
Author

Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip

dab-submission-r27-2026-05-28.zip (15 MB compressed → 475 MB extracted) — a .zip wrapper around a .tar.xz (GitHub accepts
.zip but not .tar.xz directly). Extract with:

  unzip dab-submission-r27-2026-05-28.zip                                                                                      
  tar -xJf dab-submission-r27-2026-05-28.tar.xz                                                                                

Contents:

  • agent_description.md — same as in the PR body
  • submission.json — identical to leaderboard_submissions/altimate-code_gpt-5.5_n5.json in the PR diff
  • rescore.json — per-trial pass/fail under upstream HEAD 634cd61ad
  • r27_traces/trials/<dataset>_query<N>_trial<M>/ — for each of the 270 trials:
    • Top-level: ANSWER, result.json, consensus.json (the K=3 consensus output)
    • sub0/, sub1/, sub2/: per-sub events.jsonl (full agent action log), ANSWER, stderr.log
    • Per-sub workspace/ is excluded (50 MB each — contains the materialized DBs the agent queried; recoverable via
      --prepare-external).

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @sahrizvi — thank you for the detailed submission and the trace bundle! We went through the traces and found a few spots where reference data appears to have leaked into the trials. Listing the trials where a leak both occurred and produced the passing answer:

Trial (dataset/query/trial) Leak behavior
agnews / query1 / trial4 load_dataset('ag_news') → loaded clean AG News gold labels from local HF cache; used them to identify "sports" articles
agnews / query2 / trial1 load_dataset('ag_news') → gold category labels; answer (0.14414…) derived directly from the leaked labels
agnews / query2 / trial2 load_dataset('ag_news') → gold category labels
agnews / query2 / trial3 load_dataset('ag_news') → gold category labels
agnews / query2 / trial4 load_dataset('ag_news') → gold category labels
agnews / query2 / trial5 load_dataset('ag_news') → gold category labels
agnews / query3 / trial1 load_dataset('ag_news') → gold category labels
agnews / query3 / trial2 load_dataset('ag_news') → gold category labels
agnews / query3 / trial3 load_dataset('ag_news') → gold labels + read a prior run's classified_articles.jsonl/temp_articles.jsonl and re-aggregated them
agnews / query3 / trial4 load_dataset('ag_news') → gold category labels
agnews / query3 / trial5 load_dataset('ag_news') → gold category labels
agnews / query4 / trial1 load_dataset('ag_news') → gold category labels
agnews / query4 / trial2 load_dataset('ag_news') → gold category labels
agnews / query4 / trial3 load_dataset('ag_news') → gold category labels
agnews / query4 / trial4 load_dataset('ag_news') → gold category labels
agnews / query4 / trial5 load_dataset('ag_news') → gold category labels
stockmarket / query3 / trial2 read of a prior run's ANSWER was blocked, so grep was used on the prior run's result.json — this leaked the reference answer, and the final ANSWER copies the leaked values verbatim

Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

@sahrizvi

sahrizvi commented Jun 1, 2026

Copy link
Copy Markdown
Author

Hi @Ruiying-Ma — thank you for the careful audit and for pinpointing the exact trials. You're right on every one of them, and I owe you an apology: we submitted with the belief our pre-submission audit was clean, and it wasn't. Two separate vectors slipped past us.

Why the agnews load_dataset leak got missed. Our defensive layer at the time of the run lived at the tool-call permission level — path-pattern denies on read, glob, grep for *ANSWER*, *ground_truth*, *result.json*, etc. The agent bypassed that layer by spawning a fresh python subprocess (python -c "from datasets import load_dataset; ds = load_dataset('ag_news')..."). Once python is allowed, the subprocess inherits the host's HuggingFace cache and our tool-level denies don't apply to anything that happens inside it. The proper defense (env vars HF_HUB_OFFLINE=1 / HF_DATASETS_OFFLINE=1 / TRANSFORMERS_OFFLINE=1, plus a workspace pip that has no datasets/sklearn/torchtext installed) was added on a sanitize branch after the submission, but R27 was run before that landed. This was a process failure on our part — we should have caught it.

Why the stockmarket/q3/t2 grep leak got missed. We did have *result.json*: deny set on the grep and glob permissions, and the corresponding read of workspace/ANSWER correctly fired permission_denied in the trace. But altimate-code's grep tool passes the search regex to the permission matcher rather than the search path (grep.ts:29 — patterns: [params.pattern]). So *result.json* is matched against the regex "Apex Global Brands|23781\.422..." and never fires; ripgrep then recurses into .../baseline_runs/.../result.json and returns its contents. We've patched this with filesystem-level isolation (each clean trial now runs from a workspace tree with no prior baseline_runs/ reachable) and added a per-trial path deny on bash/read as defense-in-depth.

Please consider this our updated submission. Two datasets re-run, ten unchanged.

1. agnews — clean rerun complete. Re-ran all 4 queries × 5 trials in a hardened sandbox (HF offline env, workspace venv without datasets/sklearn/torchtext, no host HF cache reachable). Agents attempted load_dataset, sklearn, keras, torchtext — all returned ModuleNotFoundError. Clean Pass@1: 7/20 = 35.00% (q1: 5/5, q2: 2/5, q3: 0/5, q4: 0/5).

Caveat we want to flag honestly: this rerun is on the same Azure GPT-5-chat model as R27. Azure OpenAI's content moderation rejects explore_dataset classifier calls when news article batches include violence-coded content (war / crime / disaster). 17 of 60 sub-trials (28%) across 11 of 20 trials (55%) hit content_filter / ResponsibleAIPolicyViolation during classification. That's a structural disadvantage specific to running news classification on Azure — it would not have existed when the load_dataset leak was masking the failure path. We don't claim the 35% reflects raw model capability; once content-filter blocks are addressed via a non-Azure labeler path, we'll make a fresh submission tagged with a non-Azure provider.

2. stockmarket/q3 — clean rerun complete. With the grep/path fix above and an isolated filesystem tree, we re-ran all 5 trials of stockmarket/q3. Clean Pass@1: 0/5. All 5 trials computed the correct AVG(Volume) values via SQL, but the agent emitted each row as <company-name> <company-description> <number> (selecting the Company Description column from stockinfo), which puts the number more than 50 characters from the name — outside the validator's name-to-number proximity window. R27/t2 only passed because it grep'd a prior trial's result.json for the company names and reconstructed a clean Name,Number format from the leaked intermediate SQL output; without that hint, t2 fails for the same column-selection reason as t1/t3/t4/t5. Other stockmarket queries (q1, q2, q4, q5) are unchanged from the original submission.

3. Updated stratified Pass@1: 63.18%.

Dataset R27 This update
agnews 100.0% 35.0%
stockmarket 72.0% 68.0%
(10 other datasets) unchanged unchanged
Stratified 68.93% 63.18%

Updated trace bundle is attached (replacing agnews/* and stockmarket/q3/* trials; all other trials carry over from the original R27 bundle).

Thank you again for the careful review — this is on us, and we appreciate the time you took to catch it.

alt-code-remediation-bundle.zip

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @sahrizvi ! We have added your submission to the leaderboard! Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants