Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

This repository contains the code and data for the paper:

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards Yuxuan Zhu, Daniel Kang arXiv:2603.16140

Repository Structure

rlvr-noisy-data/
├── data/                          # Datasets for both math and SQL experiments
├── data_curation/                 # Data curation pipeline (Section 3.3)
│   └── math/                      # <-- Scripts for curating noisy math data
├── SkyRL/                         # Math RLVR framework and experiments (Sections 4-5)
│   └── skyrl-train/rl_noise/math/ # <-- Key experiment code
└── tinker-sql/                    # Text2SQL RLVR framework and experiments (Section 6)
    ├── scripts/                   # <-- Training scripts (GRPO and PGFC per model)
    ├── experiments/bird/          # <-- Experiment entry points and evaluation
    └── tinker_cookbook/recipes/sql_rl/  # <-- SQL RL environment and grader

Datasets

`data/` (root)

File	Description
`noisy_data.json`	DeepScaleR dataset with verified incorrect annotations from Qwen2.5-Math-7B
`data_with_issues.json`	Data points where "incorrect" annotations were found to be actually correct
`bird-corrected-600.json`	600 BIRD instances with manually corrected SQL annotations
`bird-original-600.json`	Original (noisy) BIRD instances for comparison
`bird-test-600.json`	BIRD Mini-Dev test set (598 instances) for evaluation
`bird_db_schemas.json`	Database schemas (with column descriptions) for all 79 BIRD databases

Datasets are also available on HuggingFace:

Reproducing Math Experiments (Sections 4-5)

Code is in SkyRL/skyrl-train/rl_noise/math/. Built on the SkyRL framework.

Setup

cd SkyRL/skyrl-train
pip install -e .

Data Curation (Section 3.3)

See data_curation/README.md for the full pipeline. Key steps:

Convert data/noisy_data.json to training parquet: python data_curation/math/json_to_parquet.py --mode clean --output <output.parquet>
Download evaluation benchmarks: python data_curation/math/test_dataset.py --output_dir <output_dir>
Create controlled noise levels:

python data_curation/math/adjust_noise_rate.py 0.5 \
    --clean_parquet <clean.parquet> \
    --noisy_parquet <noisy.parquet> \
    --output <output.parquet>

Section 4: Noisy Data Degrades RLVR

cd SkyRL/skyrl-train/rl_noise/math

# GRPO on clean data
bash run_deepscaler.sh --base_dir=<path>

# GRPO with noise (adjust noise_level: 0.1, 0.2, ..., 1.0)
bash run_deepscaler_noise.sh --base_dir=<path> --noise_level=1.0

# Format-only reward
bash run_deepscaler.sh --base_dir=<path> --format_only=true

Section 5: Algorithm Improvements

# All scripts use 50% noise by default
bash run_deepscaler_dapo.sh --base_dir=<path>     # DAPO
bash run_deepscaler_sapo.sh --base_dir=<path>     # SAPO
bash run_deepscaler_tis.sh --base_dir=<path>      # TIS
bash run_deepscaler_drgrpo.sh --base_dir=<path>   # Dr. GRPO

# PGFC (reward correction using noise rate)
bash run_deepscaler_noise.sh --base_dir=<path> --noise_level=0.5 --use_pgfc

Reproducing Text2SQL Experiments (Section 6)

Code is in tinker-sql/. Built on the Tinker SDK.

Setup

cd tinker-sql
pip install -e .

Download the BIRD databases following the official instructions at https://bird-bench.github.io/ and place them in tinker-sql/databases/.

GRPO Experiments

Training scripts for each model on BIRD-Corrected and BIRD-Original:

Model	Corrected	Original
Qwen3-235B	`scripts/qwen-235b-clean.sh`	`scripts/qwen-235b-noisy.sh`
DeepSeek-V3.1	`scripts/deepseek_clean.sh`	`scripts/deepseek_noisy.sh`
Qwen3-32B	`scripts/qwen-32b-clean.sh`	`scripts/qwen-32b-noisy.sh`
GPT-OSS-120B-A5B	`scripts/gpt-oss-clean.sh`	`scripts/gpt-oss-noisy.sh`
Llama-3.3-70B	`scripts/llama-70b-clean.sh`	`scripts/llama-70b-noisy.sh`

PGFC Experiments

PGFC is enabled by setting noise_rate (the estimated annotation error rate):

bash scripts/pgfc_qwen_235b.sh    # Qwen3-235B
bash scripts/pgfc_deepseek.sh     # DeepSeek-V3.1
bash scripts/pgfc_qwen_32b.sh     # Qwen3-32B
bash scripts/pgfc_gpt.sh          # GPT-OSS-120B-A5B
bash scripts/pgfc_llama_70b.sh    # Llama-3.3-70B

Evaluation

cd tinker-sql
python experiments/bird/evaluation.py \
    --file_path_prefix <generated_queries> \
    --data_path <bird_mini_dev_path> \
    --db_path <database_path> \
    --dump_path <results_output> \
    --run_name <wandb_run_name> \
    --log_path <log_directory>

Citation

@article{zhu2025noisy,
  title={Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards},
  author={Zhu, Yuxuan and Kang, Daniel},
  journal={arXiv preprint arXiv:2603.16140},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
SkyRL		SkyRL
data		data
data_curation		data_curation
tinker-sql		tinker-sql
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
preprint.pdf		preprint.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Repository Structure

Datasets

`data/` (root)

Reproducing Math Experiments (Sections 4-5)

Setup

Data Curation (Section 3.3)

Section 4: Noisy Data Degrades RLVR

Section 5: Algorithm Improvements

Reproducing Text2SQL Experiments (Section 6)

Setup

GRPO Experiments

PGFC Experiments

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Repository Structure

Datasets

data/ (root)

Reproducing Math Experiments (Sections 4-5)

Setup

Data Curation (Section 3.3)

Section 4: Noisy Data Degrades RLVR

Section 5: Algorithm Improvements

Reproducing Text2SQL Experiments (Section 6)

Setup

GRPO Experiments

PGFC Experiments

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/` (root)

Packages