Skip to content

feat(cubestore): tolerant job reads + cleanup of unknown job types#11087

Merged
waralexrom merged 2 commits into
masterfrom
cubestore-tolerant-job-read
Jun 15, 2026
Merged

feat(cubestore): tolerant job reads + cleanup of unknown job types#11087
waralexrom merged 2 commits into
masterfrom
cubestore-tolerant-job-read

Conversation

@waralexrom

Copy link
Copy Markdown
Member

Summary

Make the CubeStore worker's job read path forward-compatible across latest/release channel switches. Today, a job row carrying a JobType variant this binary doesn't know fails to deserialize, and because the shard scan does get_row(id)? in a loop, the whole scan errors — the worker stops picking up any of its jobs until upgraded. This makes it unsafe to ever add a new JobType, since a deployment can move between channels in both directions. This PR is the groundwork that makes future JobType additions channel-switch safe.

Changes

  • JobType::Unknown fallback (metastore/job.rs): add #[serde(other)] Unknown so a foreign serialized variant decodes to Unknown instead of erroring. Verified flexbuffers honors #[serde(other)] — including for data-carrying unknown variants (payload is skipped, no Err) — both bare and nested in the Job struct. Stable index id (11), lowest priority.
  • Worker ignores Unknown silently in start_processing_job selection; route_job rejects it with a clean Err instead of a non-exhaustive-match panic. process_separate_job already had a catch-all.
  • Cleanup sweep MetaStore::delete_unknown_jobs: scans, deletes Unknown jobs, then rebuilds the Jobs RowReference index. The rebuild is required because delete recomputes the unique-index key from the (now Unknown) job_type, while the stored key was written by the newer binary over the real variant — a plain delete would leave a dangling unique-index entry and cause a phantom add_job dedup on a later upgrade. Wired into the scheduler reconcile loop (runs on the router, ~30s); warns per removed job and bumps cs.jobs.unknown.deleted.

This lets a leftover unknown job stay in the metastore without blocking the worker from picking up its other jobs (the success bar), while the sweep removes them so they don't pile up on a long downgrade.

Testing

  • New unit tests in metastore/job.rs: flexbuffers round-trip of unknown unit/data variants → Unknown (bare + nested in Job), and known variants still round-trip.
  • New start_processing_job_skips_unknown in metastore/mod.rs: a known sibling job is selected over an Unknown one, the Unknown job is never picked, scan paths still succeed, and the cleanup sweep removes it (count 1 then 0) leaving the known job + a consistent index (re-add for the freed reference succeeds).
  • cargo build -p cubestore clean; cargo test -p cubestore --lib metastore 35 passed; cargo test -p cubestore --lib scheduler 4 passed; cargo fmt applied.

@waralexrom waralexrom requested a review from a team as a code owner June 15, 2026 15:57
@github-actions github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 15, 2026
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 9s —— View job


Code Review — tolerant job reads + Unknown cleanup

Verdict: LGTM. Approach is sound, scope is tight, and tests cover the surprising parts (data-carrying unknown variants, nested-in-struct decode, index rebuild). A few low/medium notes below — nothing blocking.

Review details (4 notes, 0 high)

Medium

1. Cleanup sweep runs an unconditional all_rows() over Jobs every reconcile (~30s).
scheduler/mod.rs:294metastore/mod.rs:4338-4341: even in the steady state (zero unknown jobs, the overwhelmingly common case), every reconcile tick takes the write queue and iterates every job row to filter for Unknown. Jobs tables are typically small so absolute cost is low, but this competes with real writes on the same single-writer queue. Cheap mitigation: do a read_operation_out_of_queue scan first and only enter write_operation when at least one Unknown is found.

2. add_job does not reject JobType::Unknown.
metastore/mod.rs:4233: nothing in this binary writes Unknown, but the variant is now a real, addressable value. A future caller (or a test) could insert one and the cleanup sweep would silently delete it on the next tick. A debug_assert or early Err in add_job would make this intent explicit and prevent footguns.

Low

3. Index rebuild runs in a second write_operation, separate from the delete.
metastore/mod.rs:4356-4364: the comment correctly explains why the rebuild is needed, but splitting the delete and rebuild across two write transactions means a crash or shutdown between them leaves the unique-index entry dangling until the next sweep. Not a correctness bug (next sweep finds it again — wait, no: the row is already deleted, so the next sweep finds zero Unknown rows and skips the rebuild, leaving the dangling index until the next time an unknown job appears). Consider doing the delete + rebuild in a single write_operation so they commit atomically.

4. route_job returns Err for Unknown, but start_processing_job filters them out first.
cluster/ingestion/job_runner.rs:400 + metastore/mod.rs:4392: in practice the Err arm is unreachable because the selection filter skips Unknown. That's fine as defense-in-depth (the PR body calls this out), just worth a one-line comment on the match arm so a future reader doesn't think it's a live code path.

Nits

  • app_metrics.rs:140 — nice that the doc-comment explains what a non-zero counter means operationally. Keep doing this.
  • The test start_processing_job_skips_unknown (metastore/mod.rs:7547) uses let _ = fs::remove_dir_all(...) for cleanup, matching the surrounding test style — good consistency.
  • flex_roundtrip helper in job.rs is well-scoped and the NewerJobType mirror to simulate version skew is a clean pattern worth reusing for other forward-compat tests.
• Branch: `cubestore-tolerant-job-read`

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 43s —— View job


Re-review — follow-up commit f1a51fd

Verdict: LGTM. All four notes from the prior review are addressed cleanly.

Walkthrough of fixes
Prior note Fix in f1a51fd Status
#1 unconditional all_rows() every reconcile delete_unknown_jobs now does an read_operation_out_of_queue scan first and only enters write_operation when an Unknown is observed (metastore/mod.rs:4346-4358). Steady state no longer takes the writer queue. Fixed
#2 add_job silently accepting Unknown debug_assert! on JobType::Unknown at the top of add_job (metastore/mod.rs:4233-4240); the test updated to insert directly via JobRocksTable::insert to simulate a newer-binary write. Fixed
#3 delete + rebuild split across two write txns Folded into a single write_operation (metastore/mod.rs:4361-4387), so the cleanup commits atomically — no window where the unique-index entry is dangling. Early unknown.is_empty() short-circuit handles the TOCTOU between the gate scan and the write. Fixed
#4 route_job Unknown arm look unreachable One-line comment added at cluster/ingestion/job_runner.rs:400 explaining it's defense-in-depth, not a live path. Fixed

Small things still worth noting (non-blocking)

  • The out-of-queue gate scan still does a full all_rows() on Jobs, but it's lock-free and Jobs are small — the original concern was the write-queue contention, which is gone. Fine as-is.
  • The debug_assert is debug-only; in release builds an Unknown insert would still succeed silently. That's deliberate (asserts are the right tool for "this binary's own code never does X"), but worth being aware of if you ever wire add_job to anything that proxies external input.
• Branch: `cubestore-tolerant-job-read`

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.50%. Comparing base (689d842) to head (2b1ad41).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #11087   +/-   ##
=======================================
  Coverage   58.50%   58.50%           
=======================================
  Files         216      216           
  Lines       17270    17270           
  Branches     3524     3524           
=======================================
  Hits        10103    10103           
  Misses       6652     6652           
  Partials      515      515           
Flag Coverage Δ
cube-backend 58.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Make the worker's job read path forward-compatible across latest/release
channel switches: an unknown JobType variant (written by a newer binary)
no longer fails the whole job-shard scan.

- Add `#[serde(other)] Unknown` to JobType so foreign variants decode to
  Unknown instead of erroring (flexbuffers honors serde(other), incl.
  data-carrying variants). Stable index id (11) and lowest priority.
- start_processing_job filters Unknown out silently; route_job rejects it
  cleanly instead of panicking.
- Add MetaStore::delete_unknown_jobs: scans, deletes Unknown jobs and
  rebuilds the Jobs RowReference index (a plain delete would leave the
  unique-index entry dangling, since its key was written over the real
  variant). Wired into the scheduler reconcile loop; warns per removed
  job and bumps cs.jobs.unknown.deleted.
- Gate delete_unknown_jobs with an out-of-queue scan so the common
  steady state (no unknown jobs) doesn't take the single-writer queue
  every reconcile tick.
- Fold the RowReference index rebuild into the same write_operation as
  the delete so the cleanup commits atomically (no window that leaves a
  dangling unique-index entry).
- Reject JobType::Unknown in add_job via debug_assert; this binary must
  never create one. Test injects Unknown via a direct insert instead.
- Note the route_job Unknown arm is defense-in-depth (not a live path).
@waralexrom waralexrom force-pushed the cubestore-tolerant-job-read branch from f1a51fd to 2b1ad41 Compare June 15, 2026 19:26
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 46s —— View job


Re-review — commit 2b1ad41

Verdict: LGTM. The follow-up cleanly addresses all four notes from the prior review; PR is also approved by @paveltiunov.

Review details
Prior note Fix in 2b1ad41 Status
#1 unconditional all_rows() every reconcile delete_unknown_jobs now gates on a read_operation_out_of_queue scan (metastore/mod.rs:4365-4376) and only enters write_operation when an Unknown is observed. Steady state no longer takes the writer queue. Fixed
#2 add_job silently accepting Unknown debug_assert! at metastore/mod.rs:4254-4258; the test now injects Unknown via a direct JobRocksTable::insert to simulate a newer-binary write (metastore/mod.rs:7592-7605). Fixed
#3 delete + rebuild split across two write txns Folded into a single write_operation (metastore/mod.rs:4378-4406); the unknown.is_empty() early-return handles the TOCTOU between the gate scan and the write. Cleanup now commits atomically. Fixed
#4 route_job Unknown arm looking unreachable One-line comment added (cluster/ingestion/job_runner.rs) noting it is defense-in-depth, not a live path. Fixed

Non-blocking observations

  • The out-of-queue gate scan still does a full all_rows() materialize before .any(...) — it can't short-circuit because all_rows() returns a Vec. Fine: Jobs is small and this path runs outside the writer queue, which was the real concern.
  • debug_assert! is release-stripped by design — correct choice for "this binary's own code never does X." Worth being aware of if add_job is ever wired to anything that proxies external input.
  • Atomic delete+rebuild closes the crash-window correctly; the unknown.is_empty() guard inside the write is the right belt-and-braces for the gate→write race.
• Branch: `cubestore-tolerant-job-read`

@waralexrom waralexrom merged commit 87ae095 into master Jun 15, 2026
66 of 67 checks passed
@waralexrom waralexrom deleted the cubestore-tolerant-job-read branch June 15, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants