TOT-SQL Safeguard submission#56
Conversation
|
Hi @NG-VikasV — thank you for your contribution! Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result. |
|
Hi @Ruiying-Ma, I have attached the same (attached all traces including agnews), please let us know if you need anything more. Thank you. |
|
Hi @NG-VikasV — thanks for sending the traces. We went through them and have two things to sort out before we can compute Pass@1. First, run coverage. The leaderboard score averages over 5 runs per query, and we require at least 5 runs per query. The bundle currently contains a single run per query (54 traces total), and the submission JSON repeats that one answer across all 5 run slots. Could you re-run each query at least 5 times and share the per-run traces and answers? Second, the submission JSON does not match the traces for two agnews queries. For query 1 the submission lists Once we have the 5-run traces and a reconciled submission file, we'll re-run the verification and post the Pass@1 result. |
|
Hi, @Ruiying-Ma Thank you for the detailed feedback — apologies for the issues with the initial submission. Both points have been addressed: Please find the updated files committed directly to the PR branch: submissions/TT_SQL_V2_traces_all_runs.zip — all 270 per-run traces (54 queries × 5 runs), structured as traces/{dataset}/query{id}/run{0–4}/ Direct link: Click here Best regards, Vikas |
|
Hi @Ruiying-Ma, We have not yet received any update regarding our submission. We are fully prepared to make any corrections, revisions, or improvements required to support the review process. As a company based in Hyderabad, Telangana, India, we would appreciate it if you could kindly acknowledge receipt of our submission and provide an update on its current status. We look forward to your response and are available to address any feedback promptly. Thanks & Regards,
|
|
Hi @Ruiying-Ma, I think you might have missed our previous results, nevertheless, we are attaching our new runs and their traces, this time we have used any "hint' in contrast to our earlier submission. Please find our submission for "DataAgentBench" leaderboard below. Our Results: 54 queries × 5 runs = 270 total run slots · Pass@1 47.0% · Pass@K (K=5) 70.4% Submission answers (270 runs): click here Model: gpt-oss-safeguard-120b via AWS Bedrock · Thanks & Regards,
|
|
Hi @NG-VikasV, thanks for the traces. Apologies for my late reply. Thank you for your patience. We found a data leakage issue in at least 79/270 trials: At the final answer step (the SELF_CORRECTOR / "CONCISE ANSWER" stage) the prompt handed to the model includes a line like We list all possible leakage we could find in the table below. Happy to re-verify once they're ready! Full list of trials where the gold value was injected (79)
|
Agent Name: TOT-SQL Safeguard
Backbone LLM: openai.gpt-oss-safeguard-120b
Dataset Hints Used: No
A reasoning-based agent architecture executing multi-step Data-to-SQL tasks. It leverages LLM reasoning to dynamically construct SQL queries, execute execution-verification loops, and retrieve correct query results across SQLite, PostgreSQL, DuckDB, and MongoDB databases without using any hardcoded schemas or domain-specific hints.