TOT-SQL Safeguard submission by NG-VikasV · Pull Request #56 · ucbepic/DataAgentBench

NG-VikasV · 2026-06-08T08:32:11Z

Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No

A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.

Ruiying-Ma · 2026-06-08T21:34:21Z

Hi @NG-VikasV — thank you for your contribution!

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

NG-VikasV · 2026-06-09T05:22:17Z

TT_SQL_V2_traces_all_runs.zip

Hi @Ruiying-Ma,

I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you.

Ruiying-Ma · 2026-06-09T19:49:57Z

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

NG-VikasV · 2026-06-11T04:29:03Z

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/
submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

NG-VikasV · 2026-06-16T07:48:21Z

Hi @Ruiying-Ma,

We have not yet received any update regarding our submission. We are fully prepared to make any corrections, revisions, or improvements required to support the review process.

As a company based in Hyderabad, Telangana, India, we would appreciate it if you could kindly acknowledge receipt of our submission and provide an update on its current status.

We look forward to your response and are available to address any feedback promptly.

Thanks & Regards,
Vikas

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

NG-VikasV · 2026-06-16T08:23:45Z

Hi @Ruiying-Ma,

I think you might have missed our previous results, nevertheless, we are attaching our new runs and their traces, this time we have used any "hint' in contrast to our earlier submission.

Please find our submission for "DataAgentBench" leaderboard below.

Our Results: 54 queries × 5 runs = 270 total run slots · Pass@1 47.0% · Pass@K (K=5) 70.4%

Submission answers (270 runs): click here
Full traces + SQL + eval (52.8 MB): click here

Model: gpt-oss-safeguard-120b via AWS Bedrock ·
Hint: No

Thanks & Regards,
Vikas V.

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/ submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

Ruiying-Ma · 2026-06-17T07:29:50Z

Hi @NG-VikasV, thanks for the traces. Apologies for my late reply. Thank you for your patience.

We found a data leakage issue in at least 79/270 trials:

At the final answer step (the SELF_CORRECTOR / "CONCISE ANSWER" stage) the prompt handed to the model includes a line like GROUND TRUTH HINT (format only, not the answer): 'MI', and that value is the actual gold answer for the query (a literal value such as MI or a Salesforce ID, a "text value similar to ''", or a CSV preview of the gold rows). Building that line requires reading the ground truth, which is off limits as an answer source.

We list all possible leakage we could find in the table below.

Happy to re-verify once they're ready!

Full list of trials where the gold value was injected (79)

Trial	Injected gold hint	Outcome
bookreview/q2/run0	, not the answer A text value similar to 'The Sludge	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run2	, not the answer A text value similar to 'The Sludge	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run3	, not the answer A text value similar to 'The Sludge	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run4	, not the answer A text value similar to 'The Sludge	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run0	, not the answer A text value similar to 'Around the World Maz...	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run1	, not the answer A text value similar to 'Around the World Maz...	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run2	, not the answer A text value similar to 'Around the World Maz...	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run3	, not the answer A text value similar to 'Around the World Maz...	pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run4	, not the answer A text value similar to 'Around the World Maz...	pass — gold injected into answer prompt (SQL also produced output)
crmarenapro/q2/run0	, not the answer 'ka0Wt000000Eq0MIAS	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run1	, not the answer 'November	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run2	, not the answer 'November	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run3	, not the answer 'November	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q5/run1	, not the answer 'a03Wt00000JqnHwIAJ	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q6/run0	, not the answer 'ka0Wt000000EnwvIAC	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q7/run1	, not the answer 'ka0Wt000000EoD3IAK	fail — gold injected but trial failed anyway
crmarenapro/q8/run1	, not the answer '005Wt000003NIliIAG	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run0	, not the answer 'MI	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run1	, not the answer 'MI	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run2	, not the answer 'MI	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run3	, not the answer 'MI	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run4	, not the answer 'MI	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run0	, not the answer '01tWt000006hV8LIAU	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run1	, not the answer '01tWt000006hV8LIAU	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run2	, not the answer '01tWt000006hV8LIAU	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run3	, not the answer '01tWt000006hV8LIAU	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run4	, not the answer '01tWt000006hV8LIAU	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q13/run1	, not the answer '005Wt000003NIXCIA4	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q13/run2	, not the answer '005Wt000003NIXCIA4	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
deps_dev_v1/q1/run0	, not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2	fail — gold injected but trial failed anyway
deps_dev_v1/q1/run1	, not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2	fail — gold injected but trial failed anyway
deps_dev_v1/q1/run2	, not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2	fail — gold injected but trial failed anyway
deps_dev_v1/q1/run3	, not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2	fail — gold injected but trial failed anyway
deps_dev_v1/q1/run4	, not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2	fail — gold injected but trial failed anyway
deps_dev_v1/q2/run1	, not the answer A text value similar to ' ProjectName,Version...	fail — gold injected but trial failed anyway
deps_dev_v1/q2/run2	, not the answer A text value similar to ' ProjectName,Version...	fail — gold injected but trial failed anyway
deps_dev_v1/q2/run3	, not the answer A text value similar to ' ProjectName,Version...	fail — gold injected but trial failed anyway
github_repos/q4/run0	, not the answer A CSV result with columns [repo_name,num_commits], 5 row(s). First data row looks like: apple/swift,1051	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run0	, not the answer A text value similar to 'Widows Peak Salon,4....	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run1	, not the answer A text value similar to 'Widows Peak Salon,4....	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run2	, not the answer A text value similar to 'Widows Peak Salon,4....	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run4	, not the answer A text value similar to 'Widows Peak Salon,4....	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q2/run0	, not the answer A text value similar to 'Elite Massage,5.0	fail — gold injected but trial failed anyway
googlelocal/q2/run1	, not the answer A text value similar to 'Elite Massage,5.0	fail — gold injected but trial failed anyway
googlelocal/q3/run0	, not the answer A text value similar to 'TACOS LA CABANA,"[['...	fail — gold injected but trial failed anyway
googlelocal/q4/run0	, not the answer A text value similar to 'Encino Dermatology &...	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run1	, not the answer A text value similar to 'Encino Dermatology &...	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run2	, not the answer A text value similar to 'Encino Dermatology &...	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run3	, not the answer A text value similar to 'Encino Dermatology &...	pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run4	, not the answer A text value similar to 'Encino Dermatology &...	pass — gold injected into answer prompt (SQL also produced output)
pancancer_atlas/q2/run1	, not the answer A text value similar to 'Histological_Type,mu...	fail — gold injected but trial failed anyway
pancancer_atlas/q2/run3	, not the answer A text value similar to 'Histological_Type,mu...	fail — gold injected but trial failed anyway
pancancer_atlas/q3/run0	, not the answer A text value similar to 'Chi2	PASS — answer copied solely from the injected gold value (agent SQL returned no result)
patents/q1/run1	, not the answer A text value similar to 'cpc_group	fail — gold injected but trial failed anyway
stockindex/q3/run0	, not the answer A text value similar to '399001.SZ,China	pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run2	, not the answer A text value similar to '399001.SZ,China	pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run3	, not the answer A text value similar to '399001.SZ,China	pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run4	, not the answer A text value similar to '399001.SZ,China	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q2/run1	, not the answer A text value similar to 'ProShares Ultra Bloo...	fail — gold injected but trial failed anyway
stockmarket/q2/run2	, not the answer A text value similar to 'ProShares Ultra Bloo...	fail — gold injected but trial failed anyway
stockmarket/q2/run3	, not the answer A text value similar to 'ProShares Ultra Bloo...	fail — gold injected but trial failed anyway
stockmarket/q2/run4	, not the answer A text value similar to 'ProShares Ultra Bloo...	fail — gold injected but trial failed anyway
stockmarket/q3/run0	, not the answer A text value similar to 'Apex Global Brands I...	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q3/run1	, not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14	fail — gold injected but trial failed anyway
stockmarket/q3/run2	, not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14	fail — gold injected but trial failed anyway
stockmarket/q3/run3	, not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14	fail — gold injected but trial failed anyway
stockmarket/q3/run4	, not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q4/run0	, not the answer A text value similar to 'MFA Financial, Inc	fail — gold injected but trial failed anyway
stockmarket/q4/run1	, not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd	fail — gold injected but trial failed anyway
stockmarket/q4/run2	, not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd	fail — gold injected but trial failed anyway
stockmarket/q4/run3	, not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd	fail — gold injected but trial failed anyway
stockmarket/q4/run4	, not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd	fail — gold injected but trial failed anyway
stockmarket/q5/run0	, not the answer A text value similar to 'Synthesis Energy Sys...	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run1	, not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run2	, not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run3	, not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock	pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run4	, not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock	pass — gold injected into answer prompt (SQL also produced output)
yelp/q7/run0	, not the answer A CSV result with columns [Restaurants], 4 row(s). First data row looks like: Food	pass — gold injected into answer prompt (SQL also produced output)
yelp/q7/run4	, not the answer A CSV result with columns [Restaurants], 4 row(s). First data row looks like: Food	fail — gold injected but trial failed anyway

Add TOT-SQL Safeguard submission

842b570

submission: add 5-run traces (270 slots) and reconciled submission JSON

dd0f009

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOT-SQL Safeguard submission#56

TOT-SQL Safeguard submission#56
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard

NG-VikasV commented Jun 8, 2026

Uh oh!

Ruiying-Ma commented Jun 8, 2026

Uh oh!

NG-VikasV commented Jun 9, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented Jun 9, 2026

Uh oh!

NG-VikasV commented Jun 11, 2026 •

edited

Loading

Uh oh!

NG-VikasV commented Jun 16, 2026

Uh oh!

NG-VikasV commented Jun 16, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NG-VikasV commented Jun 8, 2026

Uh oh!

Ruiying-Ma commented Jun 8, 2026

Uh oh!

NG-VikasV commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruiying-Ma commented Jun 9, 2026

Uh oh!

NG-VikasV commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NG-VikasV commented Jun 16, 2026

Uh oh!

NG-VikasV commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruiying-Ma commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NG-VikasV commented Jun 9, 2026 •

edited

Loading

NG-VikasV commented Jun 11, 2026 •

edited

Loading

NG-VikasV commented Jun 16, 2026 •

edited

Loading

Ruiying-Ma commented Jun 17, 2026 •

edited

Loading