Skip to content

TOT-SQL Safeguard submission#56

Open
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard
Open

TOT-SQL Safeguard submission#56
NG-VikasV wants to merge 2 commits into
ucbepic:mainfrom
NG-VikasV:submit-tot-sql-safeguard

Conversation

@NG-VikasV

Copy link
Copy Markdown

Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No

A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @NG-VikasV — thank you for your contribution!

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

@NG-VikasV

NG-VikasV commented Jun 9, 2026

Copy link
Copy Markdown
Author

TT_SQL_V2_traces_all_runs.zip

Hi @Ruiying-Ma,

I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you.

@Ruiying-Ma

Copy link
Copy Markdown
Collaborator

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

@NG-VikasV

NG-VikasV commented Jun 11, 2026

Copy link
Copy Markdown
Author

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/
submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

@NG-VikasV

Copy link
Copy Markdown
Author

Hi @Ruiying-Ma,

We have not yet received any update regarding our submission. We are fully prepared to make any corrections, revisions, or improvements required to support the review process.

As a company based in Hyderabad, Telangana, India, we would appreciate it if you could kindly acknowledge receipt of our submission and provide an update on its current status.

We look forward to your response and are available to address any feedback promptly.

Thanks & Regards,
Vikas

Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1.

First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers?

Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists THECHAT, but the trace answer is The Rundown 4 Miami at N.C. State.. For query 2 the submission lists N/A, but the trace shows 12/13. The other 52 queries match. Could you update the submission JSON so the answers reflect the runs you actually executed?

Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result.

@NG-VikasV

NG-VikasV commented Jun 16, 2026

Copy link
Copy Markdown
Author

Hi @Ruiying-Ma,

I think you might have missed our previous results, nevertheless, we are attaching our new runs and their traces, this time we have used any "hint' in contrast to our earlier submission.

Please find our submission for "DataAgentBench" leaderboard below.

Our Results: 54 queries × 5 runs = 270 total run slots · Pass@1 47.0% · Pass@K (K=5) 70.4%

Submission answers (270 runs): click here
Full traces + SQL + eval (52.8 MB): click here

Model: gpt-oss-safeguard-120b via AWS Bedrock ·
Hint: No

Thanks & Regards,
Vikas V.

Hi, @Ruiying-Ma

Thank you for the detailed feedback — apologies for the issues with the initial submission.

Both points have been addressed:

Please find the updated files committed directly to the PR branch:

submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/ submissions/submission_spiderdin.json — reconciled submission with actual per-run answers

Direct link: Click here

Best regards, Vikas

@Ruiying-Ma

Ruiying-Ma commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Hi @NG-VikasV, thanks for the traces. Apologies for my late reply. Thank you for your patience.

We found a data leakage issue in at least 79/270 trials:

At the final answer step (the SELF_CORRECTOR / "CONCISE ANSWER" stage) the prompt handed to the model includes a line like GROUND TRUTH HINT (format only, not the answer): 'MI', and that value is the actual gold answer for the query (a literal value such as MI or a Salesforce ID, a "text value similar to ''", or a CSV preview of the gold rows). Building that line requires reading the ground truth, which is off limits as an answer source.

We list all possible leakage we could find in the table below.

Happy to re-verify once they're ready!

Full list of trials where the gold value was injected (79)
Trial Injected gold hint Outcome
bookreview/q2/run0 , not the answer A text value similar to 'The Sludge pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run2 , not the answer A text value similar to 'The Sludge pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run3 , not the answer A text value similar to 'The Sludge pass — gold injected into answer prompt (SQL also produced output)
bookreview/q2/run4 , not the answer A text value similar to 'The Sludge pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run0 , not the answer A text value similar to 'Around the World Maz... pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run1 , not the answer A text value similar to 'Around the World Maz... pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run2 , not the answer A text value similar to 'Around the World Maz... pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run3 , not the answer A text value similar to 'Around the World Maz... pass — gold injected into answer prompt (SQL also produced output)
bookreview/q3/run4 , not the answer A text value similar to 'Around the World Maz... pass — gold injected into answer prompt (SQL also produced output)
crmarenapro/q2/run0 , not the answer 'ka0Wt000000Eq0MIAS PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run1 , not the answer 'November PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run2 , not the answer 'November PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q4/run3 , not the answer 'November PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q5/run1 , not the answer 'a03Wt00000JqnHwIAJ PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q6/run0 , not the answer 'ka0Wt000000EnwvIAC PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q7/run1 , not the answer 'ka0Wt000000EoD3IAK fail — gold injected but trial failed anyway
crmarenapro/q8/run1 , not the answer '005Wt000003NIliIAG PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run0 , not the answer 'MI PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run1 , not the answer 'MI PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run2 , not the answer 'MI PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run3 , not the answer 'MI PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q9/run4 , not the answer 'MI PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run0 , not the answer '01tWt000006hV8LIAU PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run1 , not the answer '01tWt000006hV8LIAU PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run2 , not the answer '01tWt000006hV8LIAU PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run3 , not the answer '01tWt000006hV8LIAU PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q11/run4 , not the answer '01tWt000006hV8LIAU PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q13/run1 , not the answer '005Wt000003NIXCIA4 PASS — answer copied solely from the injected gold value (agent SQL returned no result)
crmarenapro/q13/run2 , not the answer '005Wt000003NIXCIA4 PASS — answer copied solely from the injected gold value (agent SQL returned no result)
deps_dev_v1/q1/run0 , not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2 fail — gold injected but trial failed anyway
deps_dev_v1/q1/run1 , not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2 fail — gold injected but trial failed anyway
deps_dev_v1/q1/run2 , not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2 fail — gold injected but trial failed anyway
deps_dev_v1/q1/run3 , not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2 fail — gold injected but trial failed anyway
deps_dev_v1/q1/run4 , not the answer A CSV result with columns [ Name,Version], 5 row(s). First data row looks like: @dmrvos/infrajs>0.0.6>typescript,2.6.2 fail — gold injected but trial failed anyway
deps_dev_v1/q2/run1 , not the answer A text value similar to ' ProjectName,Version... fail — gold injected but trial failed anyway
deps_dev_v1/q2/run2 , not the answer A text value similar to ' ProjectName,Version... fail — gold injected but trial failed anyway
deps_dev_v1/q2/run3 , not the answer A text value similar to ' ProjectName,Version... fail — gold injected but trial failed anyway
github_repos/q4/run0 , not the answer A CSV result with columns [repo_name,num_commits], 5 row(s). First data row looks like: apple/swift,1051 pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run0 , not the answer A text value similar to 'Widows Peak Salon,4.... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run1 , not the answer A text value similar to 'Widows Peak Salon,4.... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run2 , not the answer A text value similar to 'Widows Peak Salon,4.... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q1/run4 , not the answer A text value similar to 'Widows Peak Salon,4.... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q2/run0 , not the answer A text value similar to 'Elite Massage,5.0 fail — gold injected but trial failed anyway
googlelocal/q2/run1 , not the answer A text value similar to 'Elite Massage,5.0 fail — gold injected but trial failed anyway
googlelocal/q3/run0 , not the answer A text value similar to 'TACOS LA CABANA,"[['... fail — gold injected but trial failed anyway
googlelocal/q4/run0 , not the answer A text value similar to 'Encino Dermatology &... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run1 , not the answer A text value similar to 'Encino Dermatology &... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run2 , not the answer A text value similar to 'Encino Dermatology &... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run3 , not the answer A text value similar to 'Encino Dermatology &... pass — gold injected into answer prompt (SQL also produced output)
googlelocal/q4/run4 , not the answer A text value similar to 'Encino Dermatology &... pass — gold injected into answer prompt (SQL also produced output)
pancancer_atlas/q2/run1 , not the answer A text value similar to 'Histological_Type,mu... fail — gold injected but trial failed anyway
pancancer_atlas/q2/run3 , not the answer A text value similar to 'Histological_Type,mu... fail — gold injected but trial failed anyway
pancancer_atlas/q3/run0 , not the answer A text value similar to 'Chi2 PASS — answer copied solely from the injected gold value (agent SQL returned no result)
patents/q1/run1 , not the answer A text value similar to 'cpc_group fail — gold injected but trial failed anyway
stockindex/q3/run0 , not the answer A text value similar to '399001.SZ,China pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run2 , not the answer A text value similar to '399001.SZ,China pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run3 , not the answer A text value similar to '399001.SZ,China pass — gold injected into answer prompt (SQL also produced output)
stockindex/q3/run4 , not the answer A text value similar to '399001.SZ,China pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q2/run1 , not the answer A text value similar to 'ProShares Ultra Bloo... fail — gold injected but trial failed anyway
stockmarket/q2/run2 , not the answer A text value similar to 'ProShares Ultra Bloo... fail — gold injected but trial failed anyway
stockmarket/q2/run3 , not the answer A text value similar to 'ProShares Ultra Bloo... fail — gold injected but trial failed anyway
stockmarket/q2/run4 , not the answer A text value similar to 'ProShares Ultra Bloo... fail — gold injected but trial failed anyway
stockmarket/q3/run0 , not the answer A text value similar to 'Apex Global Brands I... pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q3/run1 , not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14 fail — gold injected but trial failed anyway
stockmarket/q3/run2 , not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14 fail — gold injected but trial failed anyway
stockmarket/q3/run3 , not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14 fail — gold injected but trial failed anyway
stockmarket/q3/run4 , not the answer A CSV result with columns [Apex Global Brands Inc,23781.42], 14 row(s). First data row looks like: BIO-key International, Inc,10988.14 pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q4/run0 , not the answer A text value similar to 'MFA Financial, Inc fail — gold injected but trial failed anyway
stockmarket/q4/run1 , not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd fail — gold injected but trial failed anyway
stockmarket/q4/run2 , not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd fail — gold injected but trial failed anyway
stockmarket/q4/run3 , not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd fail — gold injected but trial failed anyway
stockmarket/q4/run4 , not the answer A CSV result with columns [MFA Financial, Inc], 4 row(s). First data row looks like: Argo Group International Holdings, Ltd fail — gold injected but trial failed anyway
stockmarket/q5/run0 , not the answer A text value similar to 'Synthesis Energy Sys... pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run1 , not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run2 , not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run3 , not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock pass — gold injected into answer prompt (SQL also produced output)
stockmarket/q5/run4 , not the answer A CSV result with columns [Synthesis Energy Systems, Inc. - Common Stock], 4 row(s). First data row looks like: TD Holdings, Inc. - Common Stock pass — gold injected into answer prompt (SQL also produced output)
yelp/q7/run0 , not the answer A CSV result with columns [Restaurants], 4 row(s). First data row looks like: Food pass — gold injected into answer prompt (SQL also produced output)
yelp/q7/run4 , not the answer A CSV result with columns [Restaurants], 4 row(s). First data row looks like: Food fail — gold injected but trial failed anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants