[adapter] Detect concurrent dependency mutation in PlanValidity#37078
[adapter] Detect concurrent dependency mutation in PlanValidity#37078mtabebe wants to merge 1 commit into
Conversation
656cd63 to
38d6107
Compare
ggevay
left a comment
There was a problem hiding this comment.
LGTM, just a test comment.
(At first glance I was worried about the performance cost the hashing introduces, but PlanValidity is not used by peeks nowadays that we have frontend_peek, and other statements are not that performance-critical, so it's fine.)
There was a problem hiding this comment.
How about a fail_point! between rotate_keys_ensure and rotate_keys_finish? I think that would make this test deterministic, plus it would dramatically simplify it.
There was a problem hiding this comment.
It would be great if we don't add a new mzcompose.py for this specific test. If you want a real race, maybe test/race-condition, otherwise what Gabor says works great.
def-
left a comment
There was a problem hiding this comment.
With this extra test:
diff --git a/test/race-condition-rotate-keys/mzcompose.py b/test/race-condition-rotate-keys/mzcompose.py
index 1674d8dd56..e13b7cae7d 100644
--- a/test/race-condition-rotate-keys/mzcompose.py
+++ b/test/race-condition-rotate-keys/mzcompose.py
@@ -27,8 +27,21 @@ instead of silently reverting. The test asserts two things:
- zero silent reverts across the loop, and
- at least one OCC conflict, so the test fails if the race window moves and
we stop actually exercising the fix.
+
+Workflows:
+ - `default`: the SQL-272 regression test described above.
+ - `peek-rename-conflict`: proves Finding 1 from the PR review -- the same
+ `create_sql`-hash check spuriously fails a SELECT staged through the
+ Coordinator when a dependency is RENAMEd during the off-thread optimize
+ window, even though a rename can't invalidate the plan. It also documents
+ the real scope: in the default config simple SELECTs use frontend peek
+ sequencing and bypass PlanValidity, so the regression only bites the
+ Coordinator peek path (frontend sequencing off, or a fallback query shape).
+ RED on the PR branch; would go GREEN once the check is scoped to the DDL
+ paths that need it.
"""
+import threading
import time
from textwrap import dedent
@@ -190,3 +203,234 @@ def workflow_default(c: Composition, parser: WorkflowArgumentParser) -> None:
"iterations; saw zero. Either the race window has moved, the "
"stagger is wrong, or the fix in PlanValidity has regressed."
)
+
+
+def _race_select_vs_rename(
+ c: Composition, iterations: int, join_width: int
+) -> tuple[int, list[str], int]:
+ """Run a tight `SELECT` loop against a tight `RENAME` loop on one of the
+ SELECT's dependencies, and report how many SELECTs spuriously failed with
+ `ConcurrentDependencyMutation` / SQLSTATE 40001.
+
+ The caller decides which sequencing path the SELECT takes by setting
+ `enable_frontend_peek_sequencing` *before* calling this (the value is read
+ when a session is created, so the SELECT connection opened here picks up the
+ current setting).
+
+ Returns `(selects_ok, spurious_conflict_messages, renames_committed)`.
+ """
+ join_width = max(2, join_width)
+ # A wide self-join over a 3-row table: cheap to *execute*, expensive to
+ # *optimize*. We need the optimizer's off-thread `spawn_blocking` window
+ # (peek.rs `peek_optimize`) to be wide enough that a concurrently-committed
+ # RENAME reliably lands inside it; the join width buys that window without
+ # inflating data volume. (~50ms/SELECT at width 8.)
+ rels = ", ".join(f"rename_race.t AS r{i}" for i in range(join_width))
+ preds = " AND ".join(f"r0.a = r{i}.a" for i in range(1, join_width))
+ # A dedicated schema dropped with CASCADE makes setup idempotent regardless
+ # of what name the table was left under by a previous run's RENAME loop.
+ c.sql(dedent(f"""
+ DROP SCHEMA IF EXISTS rename_race CASCADE;
+ CREATE SCHEMA rename_race;
+ CREATE TABLE rename_race.t (a int);
+ INSERT INTO rename_race.t VALUES (1), (2), (3);
+ CREATE VIEW rename_race.v AS SELECT count(*) AS c FROM {rels} WHERE {preds};
+ """))
+
+ # The RENAME loop renames the base table to a *fresh* name every iteration
+ # (t_1, t_2, ...). Renaming to brand-new names -- rather than flipping
+ # between two -- means the dependency's `create_sql` never cycles back to a
+ # value the SELECT captured at plan time, so any rename that commits inside
+ # the optimize window is guaranteed to mismatch. (Flipping between two names
+ # tends to cancel out: ~2 renames land per window, an even number, leaving
+ # `create_sql` back where it started.)
+ #
+ # `v` is never renamed, so `SELECT c FROM rename_race.v` always resolves from
+ # the client's side; the only rename-induced failure the SELECT can observe
+ # is the spurious `create_sql`-hash 40001. We never DROP anything mid-loop,
+ # so a `ConcurrentDependencyDrop` can't fire either.
+ stop = threading.Event()
+ renames_committed = 0
+ rename_err: Exception | None = None
+
+ def renamer() -> None:
+ nonlocal renames_committed, rename_err
+ cur_name = "t"
+ i = 0
+ try:
+ # Autocommit is on (see Composition.sql_connection), so each RENAME
+ # is its own committed catalog transaction that bumps the catalog's
+ # transient revision -- exactly what forces `check` off its fast path.
+ with c.sql_cursor(reuse_connection=False) as cur:
+ while not stop.is_set():
+ i += 1
+ nxt = f"t_{i}"
+ cur.execute(
+ f"ALTER TABLE rename_race.{cur_name} RENAME TO {nxt}".encode()
+ )
+ cur_name = nxt
+ renames_committed += 1
+ except Exception as e:
+ rename_err = e
+
+ t_rename = PropagatingThread(target=renamer, name="renamer")
+ t_rename.start()
+
+ spurious_conflicts: list[str] = []
+ selects_ok = 0
+ try:
+ # Fresh connection so the SELECTs run concurrently with the renamer and
+ # so the session picks up the current `enable_frontend_peek_sequencing`.
+ with c.sql_cursor(reuse_connection=False) as cur:
+ for i in range(iterations):
+ try:
+ cur.execute(b"SELECT c FROM rename_race.v")
+ cur.fetchall()
+ selects_ok += 1
+ except Exception as e:
+ msg = str(e)
+ sqlstate = getattr(e, "sqlstate", None)
+ if OCC_CONFLICT_MESSAGE in msg or sqlstate == "40001":
+ # The bug: a benign concurrent RENAME spuriously failed a
+ # plain SELECT.
+ spurious_conflicts.append(
+ f"iteration {i}: sqlstate={sqlstate}: {msg}"
+ )
+ else:
+ # Any other error is not what this test is about; surface
+ # it loudly rather than letting it look like the bug.
+ raise Exception(
+ f"iteration {i}: SELECT failed with an unexpected error: {e}"
+ )
+ finally:
+ stop.set()
+ t_rename.join()
+
+ if rename_err is not None:
+ raise Exception(f"RENAME loop failed unexpectedly: {rename_err}")
+
+ return selects_ok, spurious_conflicts, renames_committed
+
+
+def workflow_peek_rename_conflict(
+ c: Composition, parser: WorkflowArgumentParser
+) -> None:
+ """
+ Demonstrates the blast radius of the SQL-272 fix (PR #37078 review,
+ Finding 1): the `create_sql`-hash check was added to the *shared*
+ `PlanValidity::check`, which runs for every plan staged through the
+ Coordinator -- not just the read-modify-write DDL (ROTATE KEYS) it was meant
+ for.
+
+ A SELECT staged through the Coordinator has an off-thread optimize hop
+ (`spawn_blocking` in peek.rs `peek_optimize`). `PlanValidity::new` captures a
+ hash of each dependency's `create_sql` before that hop; `check` re-compares
+ it after. A peek references its dependencies by stable id, so a concurrent,
+ *benign* `ALTER ... RENAME` of a dependency leaves the optimized dataflow
+ 100% valid -- yet the rename rewrites that dependency's `create_sql` (see
+ `Op::RenameItem` in catalog/transact.rs, which also rewrites every
+ dependent's `create_sql`), the hash changes, and the SELECT fails with
+ `ConcurrentDependencyMutation` / SQLSTATE 40001. Before PR #37078 the SELECT
+ succeeded: the old `PlanValidity` only failed on *dropped* dependencies.
+
+ IMPORTANT SCOPE REFINEMENT (discovered while building this test): in the
+ default configuration, simple SELECTs are sequenced in the Adapter Frontend
+ (`enable_frontend_peek_sequencing`, default true) and *never touch*
+ `PlanValidity`, so they are NOT affected. The regression only reaches the
+ Coordinator peek path, which is used when frontend peek sequencing is
+ disabled (a supported config) or when a query shape falls back to the old
+ sequencer. RENAME also can't race a lock-taking DDL like CREATE
+ INDEX/MV/VIEW: those serialize behind the global `serialized_ddl` lock (see
+ `must_serialize_ddl`), so the affected staged operations are the lock-exempt
+ ones -- peeks, SUBSCRIBE, and the purification/exempt DDLs.
+
+ The test runs the same SELECT-vs-RENAME race twice:
+ * Phase A, frontend peek sequencing ON (the default): documents that plain
+ SELECTs are unaffected (expected zero spurious failures).
+ * Phase B, frontend peek sequencing OFF (Coordinator peek path): asserts
+ the SELECT never fails with 40001. This is RED on the PR branch (the bug)
+ and would go GREEN once the hash check is scoped to the DDL paths that
+ actually need it.
+ """
+ parser.add_argument(
+ "--iterations",
+ default=100,
+ type=int,
+ help="Number of SELECTs per phase to run against the racing RENAME loop. "
+ "At the default join width the Coordinator path fails roughly half of "
+ "them, so even a few dozen reliably reproduces the bug.",
+ )
+ parser.add_argument(
+ "--join-width",
+ default=8,
+ type=int,
+ help="Number of self-joined copies of the base table in the view. A "
+ "wider join makes the optimizer's off-thread window longer, so a "
+ "concurrently-committed RENAME reliably lands inside it. Execution stays "
+ "instant because the base table is tiny -- the cost is all in planning.",
+ )
+ args = parser.parse_args()
+
+ c.up("materialized")
+
+ # Phase A: default config. Simple SELECTs go through frontend peek
+ # sequencing, which does not use PlanValidity, so the hash check can't fire.
+ c.sql(
+ "ALTER SYSTEM RESET enable_frontend_peek_sequencing",
+ user="mz_system",
+ port=6877,
+ )
+ ok_a, spurious_a, renames_a = _race_select_vs_rename(
+ c, args.iterations, args.join_width
+ )
+ print(
+ f"--- Phase A (enable_frontend_peek_sequencing = ON, the default): "
+ f"selects_ok={ok_a}, renames_committed={renames_a}, "
+ f"spurious_40001={len(spurious_a)} "
+ f"(expected 0: frontend peek sequencing bypasses PlanValidity)"
+ )
+
+ # Phase B: Coordinator peek path. New sessions pick up the changed value, and
+ # `_race_select_vs_rename` opens a fresh SELECT connection, so the SELECTs
+ # are staged through the Coordinator and run `PlanValidity::check`.
+ c.sql(
+ "ALTER SYSTEM SET enable_frontend_peek_sequencing = false",
+ user="mz_system",
+ port=6877,
+ )
+ ok_b, spurious_b, renames_b = _race_select_vs_rename(
+ c, args.iterations, args.join_width
+ )
+ c.sql(
+ "ALTER SYSTEM RESET enable_frontend_peek_sequencing",
+ user="mz_system",
+ port=6877,
+ )
+ print(
+ f"--- Phase B (enable_frontend_peek_sequencing = OFF, Coordinator path): "
+ f"selects_ok={ok_b}, renames_committed={renames_b}, "
+ f"spurious_40001={len(spurious_b)}"
+ )
+
+ # Guard against a silently-ineffective test: if the RENAME loop never
+ # actually committed, the race was never exercised and a green result means
+ # nothing.
+ if renames_b == 0:
+ raise Exception("Phase B committed zero renames; the race was never exercised.")
+
+ if spurious_b:
+ sample = "\n ".join(spurious_b[:10])
+ raise Exception(
+ f"Finding 1 reproduced: {len(spurious_b)} of {args.iterations} plain "
+ "SELECTs on the Coordinator peek path failed with "
+ "ConcurrentDependencyMutation / SQLSTATE 40001 because a concurrent, "
+ "benign ALTER ... RENAME changed a dependency's create_sql during the "
+ "SELECT's off-thread optimize window.\n"
+ "A rename leaves the optimized dataflow valid (dependencies are "
+ "referenced by stable id, not name), so this failure is spurious; "
+ "before PR #37078 the SELECT succeeded. The create_sql-hash check in "
+ "PlanValidity::check should not run for read-only peek plans.\n"
+ f"(For contrast, Phase A on the default frontend peek path saw "
+ f"{len(spurious_a)} such failures.)\n"
+ f"Sample failures:\n {sample}"
+ )We're seeing a bunch of concurrent dependency mutation errors which shouldn't fail:
iteration 5: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 6: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 8: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 11: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 12: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 13: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 14: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 16: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 17: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
iteration 19: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT: Another session modified one of this statement's dependencies before it could commit. Retry the statement.
|
(The nightly failures look unrelated) |
|
Triggered a nightly run with the new test: https://buildkite.com/materialize/nightly/builds/16822 Edit: The test doesn't seem to work:
|
b09d535 to
3fc560e
Compare
There was a problem hiding this comment.
The PR description is now wrong about the regression test location.
Triggered another nightly run because the test just timed out in previous runs: https://buildkite.com/materialize/nightly/builds/16859
Edit: Failed for me locally, works with this diff:
diff --git a/src/adapter/src/coord/sequencer/inner/secret.rs b/src/adapter/src/coord/sequencer/inner/secret.rs
index 71862615e0..09913a39c9 100644
--- a/src/adapter/src/coord/sequencer/inner/secret.rs
+++ b/src/adapter/src/coord/sequencer/inner/secret.rs
@@ -385,6 +385,13 @@ impl Coordinator {
name: entry.name,
to_item,
}];
+
+ // pause between the secret-store write and the catalog write so a concurrent ALTER
+ // CONNECTION SET can commit. Runs in the off-thread ensure task, not the
+ // coordinator main loop, so the pause won't freeze the coordinator (which would
+ // block that SET).
+ fail::fail_point!("rotate_keys_between_ensure_and_finish");
+
let stage = SecretStage::RotateKeysFinish(RotateKeysSecretFinish { validity, ops });
Ok(Box::new(stage))
}
@@ -398,10 +405,6 @@ impl Coordinator {
session: &Session,
RotateKeysSecretFinish { ops, validity: _ }: RotateKeysSecretFinish,
) -> Result<StageResult<Box<SecretStage>>, AdapterError> {
- // Test-only seam: lets a regression test pause ROTATE here so a
- // concurrent ALTER CONNECTION SET can commit before `catalog_transact`
- // re-validates dependencies.
- fail::fail_point!("rotate_keys_between_ensure_and_finish");
self.catalog_transact(Some(session), ops).await?;
Ok(StageResult::Response(ExecuteResponse::AlteredObject(
ObjectType::Connection,
diff --git a/test/race-condition/mzcompose.py b/test/race-condition/mzcompose.py
index f41a4491df..cfcc6cf68b 100644
--- a/test/race-condition/mzcompose.py
+++ b/test/race-condition/mzcompose.py
@@ -1407,8 +1407,10 @@ def toxiproxy_start(c: Composition, jitter: int) -> None:
assert r.status_code == 200, r
-# Failpoint that pauses `rotate_keys_finish` between the secret-store write and
-# the catalog write. Defined in src/adapter/src/coord/sequencer/inner/secret.rs.
+# Pauses ROTATE in its off-thread `rotate_keys_ensure` task, between the
+# secret-store write and the catalog write. Must fire off the coordinator main
+# loop; pausing on-loop would freeze the coordinator and block this test's SET.
+# Defined in src/adapter/src/coord/sequencer/inner/secret.rs.
ROTATE_KEYS_FAILPOINT = "rotate_keys_between_ensure_and_finish"
# Substring of the Display impl of `AdapterError::ConcurrentDependencyMutation`| skip: "Not stable yet, not clear if this is a product issue" | ||
|
|
||
| - id: race-condition-rotate-keys | ||
| label: "ALTER CONNECTION ROTATE KEYS vs SET race (SQL-272)" |
There was a problem hiding this comment.
| label: "ALTER CONNECTION ROTATE KEYS vs SET race (SQL-272)" | |
| label: "ALTER CONNECTION ROTATE KEYS vs SET race" |
Problem: `ALTER CONNECTION ... ROTATE KEYS` is a read-modify-write with a wide window between the catalog read and the catalog write. The slow steps in the middle are SSH key generation and secret-store I/O, all running in a spawned task. If `ALTER CONNECTION SET (...)` commits inside that window, the blind catalog write at the end of ROTATE overwrites the SET's change with the pre-SET `create_sql`. The SET silently disappears. Note: ROTATE KEYS is one of the few DDLs exempt from the global `serialized_ddl` lock (see `must_serialize_ddl` in `command_handler.rs`), which is what opens the race window in the first place. Other ALTER CONNECTION variants take the lock and can't race against themselves. Solution: `PlanValidity` already runs at every off-thread to on-thread hop and catches dropped dependencies. Extend it to also catch content changes: capture a hash of each dependency's `create_sql` at plan time, compare against the live hash in `check`, and return a new `ConcurrentDependencyMutation` error (SQLSTATE 40001) on mismatch. The user retries the statement. `PlanValidity::new` and `extend_dependencies` now take `&Catalog` so they can read entries to hash; all callsites are updated. Testing: Regression test that triggers the wait point Closes SQL-272.
|
Ah I updated the commit message but that didn't update the PR |
Problem:
ALTER CONNECTION ... ROTATE KEYSis a read-modify-write with a wide window between the catalog read and the catalog write. The slow steps in the middle are SSH key generation and secret-store I/O, all running in a spawned task. IfALTER CONNECTION SET (...)commits inside that window, the blind catalog write at the end of ROTATE overwrites the SET's change with the pre-SETcreate_sql. The SET silently disappears.Note:
ROTATE KEYS is one of the few DDLs exempt from the global
serialized_ddllock (seemust_serialize_ddlincommand_handler.rs), which is what opens the race window in the first place. Other ALTER CONNECTION variants take the lock and can't race against themselves.Solution:
PlanValidityalready runs at every off-thread to on-thread hop and catches dropped dependencies. Extend it to also catch content changes: capture a hash of each dependency'screate_sqlat plan time, compare against the live hash incheck, and return a newConcurrentDependencyMutationerror (SQLSTATE 40001) on mismatch. The user retries the statement.PlanValidity::newandextend_dependenciesnow take&Catalogso they can read entries to hash; all callsites are updated.Testing:
Regression test that triggers the wait point
Closes SQL-272.