[Leaderboard] Altimate Code (GPT-5.5 + Claude Sonnet 4.6 AutoContext) — Pass@1 0.6893#53
Conversation
|
Missed earlier, attaching the full per-trial trace bundle for review: dab-submission-r27-2026-05-28.zip
Contents:
|
|
Hi @sahrizvi — thank you for the detailed submission and the trace bundle! We went through the traces and found a few spots where reference data appears to have leaked into the trials. Listing the trials where a leak both occurred and produced the passing answer:
Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you! |
|
Hi @Ruiying-Ma — thank you for the careful audit and for pinpointing the exact trials. You're right on every one of them, and I owe you an apology: we submitted with the belief our pre-submission audit was clean, and it wasn't. Two separate vectors slipped past us. Why the agnews Why the stockmarket/q3/t2 Please consider this our updated submission. Two datasets re-run, ten unchanged. 1. agnews — clean rerun complete. Re-ran all 4 queries × 5 trials in a hardened sandbox (HF offline env, workspace venv without Caveat we want to flag honestly: this rerun is on the same Azure GPT-5-chat model as R27. Azure OpenAI's content moderation rejects 2. stockmarket/q3 — clean rerun complete. With the grep/path fix above and an isolated filesystem tree, we re-ran all 5 trials of stockmarket/q3. Clean Pass@1: 0/5. All 5 trials computed the correct 3. Updated stratified Pass@1: 63.18%.
Updated trace bundle is attached (replacing agnews/* and stockmarket/q3/* trials; all other trials carry over from the original R27 bundle). Thank you again for the careful review — this is on us, and we appreciate the time you took to catch it. |
|
Hi @sahrizvi ! We have added your submission to the leaderboard! Thank you for your contribution! |
Altimate Code — Leaderboard Submission (R27)
Agent name: Altimate Code
Project page: altimate.sh
gpt-5-chat, Chat Completions API)ground_truth.csvread path).Hints: Yes (
db_description_withhint.txtinjected into the user prompt; AutoContext document also dropped into each trial workspace)Trials: 5 per query
Consensus: K=3 sub-trials per trial, top-of-modal-answer wins
Total trials: 270 (12 datasets × 54 queries × 5)
Prior submission
This supersedes our earlier submission #44 (Altimate Code + Claude Sonnet 4.6, 0.6040 stratified, 2026-05-10). The harness is the same; the trial-time backbone has been swapped from Claude Sonnet 4.6 to GPT-5.5, and a K=3 consensus pass has been added per trial. AutoContext (authored by Sonnet 4.6) is new in this submission relative to #44.
Result
The 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the current upstream
mainat submission time.9031c68ad)634cd61ad)Validator-version note. Trials ran against
vendor/DataAgentBenchat commit9031c68ad. Upstream subsequently merged 18validate.pyupdates and one ground-truth correction (music_brainz_20k/q2:Amazon Music→iTunes) for the 12 datasets covered. Re-applying the latest validators (commit634cd61ad) against the same 270 saved answers produced an identical pass/fail distribution. DB content for our 12 datasets was unchanged between the two commits.Per-dataset stratified Pass@1
Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
db_description_withhint.txt(injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.ground_truth.csvaccess path exists in the author code (dab_bench/auto_context.py).schema_index,schema_search,schema_inspect,sql_execute,warehouse_list,fuzzy_match,explore_dataset) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.validate_shapeCLI that compares the draft ANSWER againstformat_hint.txtfor row/field/separator compliance (noground_truth.csvaccess).Reproducibility
The submission JSON in this PR contains 270 records of
{dataset, query, run, answer}— one per (dataset, query, trial). Per-trial event logs (events.jsonl),result.json, andstderr.logfor each sub-trial are available as an out-of-band trace bundle on request (~475 MB).Reproduce locally with:
Limitations disclosed for completeness
icd_o_3_histologyNOThistological_type"), which fails q2/q3 whose GT expects descriptive labels. A planned next-iteration fix updates the AutoContext author prompt to emit conditional rules.🤖 PR description drafted with Claude Code
Edits: Formatting