Skip to content

[adapter] Detect concurrent dependency mutation in PlanValidity#37078

Open
mtabebe wants to merge 1 commit into
MaterializeInc:mainfrom
mtabebe:ma/sql-272
Open

[adapter] Detect concurrent dependency mutation in PlanValidity#37078
mtabebe wants to merge 1 commit into
MaterializeInc:mainfrom
mtabebe:ma/sql-272

Conversation

@mtabebe

@mtabebe mtabebe commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Problem:
ALTER CONNECTION ... ROTATE KEYS is a read-modify-write with a wide window between the catalog read and the catalog write. The slow steps in the middle are SSH key generation and secret-store I/O, all running in a spawned task. If ALTER CONNECTION SET (...) commits inside that window, the blind catalog write at the end of ROTATE overwrites the SET's change with the pre-SET create_sql. The SET silently disappears.

Note:
ROTATE KEYS is one of the few DDLs exempt from the global serialized_ddl lock (see must_serialize_ddl in command_handler.rs), which is what opens the race window in the first place. Other ALTER CONNECTION variants take the lock and can't race against themselves.

Solution:
PlanValidity already runs at every off-thread to on-thread hop and catches dropped dependencies. Extend it to also catch content changes: capture a hash of each dependency's create_sql at plan time, compare against the live hash in check, and return a new ConcurrentDependencyMutation error (SQLSTATE 40001) on mismatch. The user retries the statement.

PlanValidity::new and extend_dependencies now take &Catalog so they can read entries to hash; all callsites are updated.

Testing:
Regression test that triggers the wait point

Closes SQL-272.

@mtabebe mtabebe force-pushed the ma/sql-272 branch 4 times, most recently from 656cd63 to 38d6107 Compare June 16, 2026 20:42
@mtabebe mtabebe marked this pull request as ready for review June 16, 2026 22:27
@mtabebe mtabebe requested review from a team as code owners June 16, 2026 22:27
@ggevay ggevay self-requested a review June 17, 2026 08:11

@ggevay ggevay left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a test comment.

(At first glance I was worried about the performance cost the hashing introduces, but PlanValidity is not used by peeks nowadays that we have frontend_peek, and other statements are not that performance-critical, so it's fine.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a fail_point! between rotate_keys_ensure and rotate_keys_finish? I think that would make this test deterministic, plus it would dramatically simplify it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we don't add a new mzcompose.py for this specific test. If you want a real race, maybe test/race-condition, otherwise what Gabor says works great.

@def- def- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this extra test:

diff --git a/test/race-condition-rotate-keys/mzcompose.py b/test/race-condition-rotate-keys/mzcompose.py
index 1674d8dd56..e13b7cae7d 100644
--- a/test/race-condition-rotate-keys/mzcompose.py
+++ b/test/race-condition-rotate-keys/mzcompose.py
@@ -27,8 +27,21 @@ instead of silently reverting. The test asserts two things:
   - zero silent reverts across the loop, and
   - at least one OCC conflict, so the test fails if the race window moves and
     we stop actually exercising the fix.
+
+Workflows:
+  - `default`: the SQL-272 regression test described above.
+  - `peek-rename-conflict`: proves Finding 1 from the PR review -- the same
+    `create_sql`-hash check spuriously fails a SELECT staged through the
+    Coordinator when a dependency is RENAMEd during the off-thread optimize
+    window, even though a rename can't invalidate the plan. It also documents
+    the real scope: in the default config simple SELECTs use frontend peek
+    sequencing and bypass PlanValidity, so the regression only bites the
+    Coordinator peek path (frontend sequencing off, or a fallback query shape).
+    RED on the PR branch; would go GREEN once the check is scoped to the DDL
+    paths that need it.
 """

+import threading
 import time
 from textwrap import dedent

@@ -190,3 +203,234 @@ def workflow_default(c: Composition, parser: WorkflowArgumentParser) -> None:
             "iterations; saw zero. Either the race window has moved, the "
             "stagger is wrong, or the fix in PlanValidity has regressed."
         )
+
+
+def _race_select_vs_rename(
+    c: Composition, iterations: int, join_width: int
+) -> tuple[int, list[str], int]:
+    """Run a tight `SELECT` loop against a tight `RENAME` loop on one of the
+    SELECT's dependencies, and report how many SELECTs spuriously failed with
+    `ConcurrentDependencyMutation` / SQLSTATE 40001.
+
+    The caller decides which sequencing path the SELECT takes by setting
+    `enable_frontend_peek_sequencing` *before* calling this (the value is read
+    when a session is created, so the SELECT connection opened here picks up the
+    current setting).
+
+    Returns `(selects_ok, spurious_conflict_messages, renames_committed)`.
+    """
+    join_width = max(2, join_width)
+    # A wide self-join over a 3-row table: cheap to *execute*, expensive to
+    # *optimize*. We need the optimizer's off-thread `spawn_blocking` window
+    # (peek.rs `peek_optimize`) to be wide enough that a concurrently-committed
+    # RENAME reliably lands inside it; the join width buys that window without
+    # inflating data volume. (~50ms/SELECT at width 8.)
+    rels = ", ".join(f"rename_race.t AS r{i}" for i in range(join_width))
+    preds = " AND ".join(f"r0.a = r{i}.a" for i in range(1, join_width))
+    # A dedicated schema dropped with CASCADE makes setup idempotent regardless
+    # of what name the table was left under by a previous run's RENAME loop.
+    c.sql(dedent(f"""
+            DROP SCHEMA IF EXISTS rename_race CASCADE;
+            CREATE SCHEMA rename_race;
+            CREATE TABLE rename_race.t (a int);
+            INSERT INTO rename_race.t VALUES (1), (2), (3);
+            CREATE VIEW rename_race.v AS SELECT count(*) AS c FROM {rels} WHERE {preds};
+            """))
+
+    # The RENAME loop renames the base table to a *fresh* name every iteration
+    # (t_1, t_2, ...). Renaming to brand-new names -- rather than flipping
+    # between two -- means the dependency's `create_sql` never cycles back to a
+    # value the SELECT captured at plan time, so any rename that commits inside
+    # the optimize window is guaranteed to mismatch. (Flipping between two names
+    # tends to cancel out: ~2 renames land per window, an even number, leaving
+    # `create_sql` back where it started.)
+    #
+    # `v` is never renamed, so `SELECT c FROM rename_race.v` always resolves from
+    # the client's side; the only rename-induced failure the SELECT can observe
+    # is the spurious `create_sql`-hash 40001. We never DROP anything mid-loop,
+    # so a `ConcurrentDependencyDrop` can't fire either.
+    stop = threading.Event()
+    renames_committed = 0
+    rename_err: Exception | None = None
+
+    def renamer() -> None:
+        nonlocal renames_committed, rename_err
+        cur_name = "t"
+        i = 0
+        try:
+            # Autocommit is on (see Composition.sql_connection), so each RENAME
+            # is its own committed catalog transaction that bumps the catalog's
+            # transient revision -- exactly what forces `check` off its fast path.
+            with c.sql_cursor(reuse_connection=False) as cur:
+                while not stop.is_set():
+                    i += 1
+                    nxt = f"t_{i}"
+                    cur.execute(
+                        f"ALTER TABLE rename_race.{cur_name} RENAME TO {nxt}".encode()
+                    )
+                    cur_name = nxt
+                    renames_committed += 1
+        except Exception as e:
+            rename_err = e
+
+    t_rename = PropagatingThread(target=renamer, name="renamer")
+    t_rename.start()
+
+    spurious_conflicts: list[str] = []
+    selects_ok = 0
+    try:
+        # Fresh connection so the SELECTs run concurrently with the renamer and
+        # so the session picks up the current `enable_frontend_peek_sequencing`.
+        with c.sql_cursor(reuse_connection=False) as cur:
+            for i in range(iterations):
+                try:
+                    cur.execute(b"SELECT c FROM rename_race.v")
+                    cur.fetchall()
+                    selects_ok += 1
+                except Exception as e:
+                    msg = str(e)
+                    sqlstate = getattr(e, "sqlstate", None)
+                    if OCC_CONFLICT_MESSAGE in msg or sqlstate == "40001":
+                        # The bug: a benign concurrent RENAME spuriously failed a
+                        # plain SELECT.
+                        spurious_conflicts.append(
+                            f"iteration {i}: sqlstate={sqlstate}: {msg}"
+                        )
+                    else:
+                        # Any other error is not what this test is about; surface
+                        # it loudly rather than letting it look like the bug.
+                        raise Exception(
+                            f"iteration {i}: SELECT failed with an unexpected error: {e}"
+                        )
+    finally:
+        stop.set()
+        t_rename.join()
+
+    if rename_err is not None:
+        raise Exception(f"RENAME loop failed unexpectedly: {rename_err}")
+
+    return selects_ok, spurious_conflicts, renames_committed
+
+
+def workflow_peek_rename_conflict(
+    c: Composition, parser: WorkflowArgumentParser
+) -> None:
+    """
+    Demonstrates the blast radius of the SQL-272 fix (PR #37078 review,
+    Finding 1): the `create_sql`-hash check was added to the *shared*
+    `PlanValidity::check`, which runs for every plan staged through the
+    Coordinator -- not just the read-modify-write DDL (ROTATE KEYS) it was meant
+    for.
+
+    A SELECT staged through the Coordinator has an off-thread optimize hop
+    (`spawn_blocking` in peek.rs `peek_optimize`). `PlanValidity::new` captures a
+    hash of each dependency's `create_sql` before that hop; `check` re-compares
+    it after. A peek references its dependencies by stable id, so a concurrent,
+    *benign* `ALTER ... RENAME` of a dependency leaves the optimized dataflow
+    100% valid -- yet the rename rewrites that dependency's `create_sql` (see
+    `Op::RenameItem` in catalog/transact.rs, which also rewrites every
+    dependent's `create_sql`), the hash changes, and the SELECT fails with
+    `ConcurrentDependencyMutation` / SQLSTATE 40001. Before PR #37078 the SELECT
+    succeeded: the old `PlanValidity` only failed on *dropped* dependencies.
+
+    IMPORTANT SCOPE REFINEMENT (discovered while building this test): in the
+    default configuration, simple SELECTs are sequenced in the Adapter Frontend
+    (`enable_frontend_peek_sequencing`, default true) and *never touch*
+    `PlanValidity`, so they are NOT affected. The regression only reaches the
+    Coordinator peek path, which is used when frontend peek sequencing is
+    disabled (a supported config) or when a query shape falls back to the old
+    sequencer. RENAME also can't race a lock-taking DDL like CREATE
+    INDEX/MV/VIEW: those serialize behind the global `serialized_ddl` lock (see
+    `must_serialize_ddl`), so the affected staged operations are the lock-exempt
+    ones -- peeks, SUBSCRIBE, and the purification/exempt DDLs.
+
+    The test runs the same SELECT-vs-RENAME race twice:
+      * Phase A, frontend peek sequencing ON (the default): documents that plain
+        SELECTs are unaffected (expected zero spurious failures).
+      * Phase B, frontend peek sequencing OFF (Coordinator peek path): asserts
+        the SELECT never fails with 40001. This is RED on the PR branch (the bug)
+        and would go GREEN once the hash check is scoped to the DDL paths that
+        actually need it.
+    """
+    parser.add_argument(
+        "--iterations",
+        default=100,
+        type=int,
+        help="Number of SELECTs per phase to run against the racing RENAME loop. "
+        "At the default join width the Coordinator path fails roughly half of "
+        "them, so even a few dozen reliably reproduces the bug.",
+    )
+    parser.add_argument(
+        "--join-width",
+        default=8,
+        type=int,
+        help="Number of self-joined copies of the base table in the view. A "
+        "wider join makes the optimizer's off-thread window longer, so a "
+        "concurrently-committed RENAME reliably lands inside it. Execution stays "
+        "instant because the base table is tiny -- the cost is all in planning.",
+    )
+    args = parser.parse_args()
+
+    c.up("materialized")
+
+    # Phase A: default config. Simple SELECTs go through frontend peek
+    # sequencing, which does not use PlanValidity, so the hash check can't fire.
+    c.sql(
+        "ALTER SYSTEM RESET enable_frontend_peek_sequencing",
+        user="mz_system",
+        port=6877,
+    )
+    ok_a, spurious_a, renames_a = _race_select_vs_rename(
+        c, args.iterations, args.join_width
+    )
+    print(
+        f"--- Phase A (enable_frontend_peek_sequencing = ON, the default): "
+        f"selects_ok={ok_a}, renames_committed={renames_a}, "
+        f"spurious_40001={len(spurious_a)} "
+        f"(expected 0: frontend peek sequencing bypasses PlanValidity)"
+    )
+
+    # Phase B: Coordinator peek path. New sessions pick up the changed value, and
+    # `_race_select_vs_rename` opens a fresh SELECT connection, so the SELECTs
+    # are staged through the Coordinator and run `PlanValidity::check`.
+    c.sql(
+        "ALTER SYSTEM SET enable_frontend_peek_sequencing = false",
+        user="mz_system",
+        port=6877,
+    )
+    ok_b, spurious_b, renames_b = _race_select_vs_rename(
+        c, args.iterations, args.join_width
+    )
+    c.sql(
+        "ALTER SYSTEM RESET enable_frontend_peek_sequencing",
+        user="mz_system",
+        port=6877,
+    )
+    print(
+        f"--- Phase B (enable_frontend_peek_sequencing = OFF, Coordinator path): "
+        f"selects_ok={ok_b}, renames_committed={renames_b}, "
+        f"spurious_40001={len(spurious_b)}"
+    )
+
+    # Guard against a silently-ineffective test: if the RENAME loop never
+    # actually committed, the race was never exercised and a green result means
+    # nothing.
+    if renames_b == 0:
+        raise Exception("Phase B committed zero renames; the race was never exercised.")
+
+    if spurious_b:
+        sample = "\n  ".join(spurious_b[:10])
+        raise Exception(
+            f"Finding 1 reproduced: {len(spurious_b)} of {args.iterations} plain "
+            "SELECTs on the Coordinator peek path failed with "
+            "ConcurrentDependencyMutation / SQLSTATE 40001 because a concurrent, "
+            "benign ALTER ... RENAME changed a dependency's create_sql during the "
+            "SELECT's off-thread optimize window.\n"
+            "A rename leaves the optimized dataflow valid (dependencies are "
+            "referenced by stable id, not name), so this failure is spurious; "
+            "before PR #37078 the SELECT succeeded. The create_sql-hash check in "
+            "PlanValidity::check should not run for read-only peek plans.\n"
+            f"(For contrast, Phase A on the default frontend peek path saw "
+            f"{len(spurious_a)} such failures.)\n"
+            f"Sample failures:\n  {sample}"
+        )

We're seeing a bunch of concurrent dependency mutation errors which shouldn't fail:

  iteration 5: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 6: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 8: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 11: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 12: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 13: sqlstate=40001: catalog item 'u1027' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 14: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 16: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 17: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.
  iteration 19: sqlstate=40001: catalog item 'u1028' was concurrently modified
HINT:  Another session modified one of this statement's dependencies before it could commit. Retry the statement.

@def-

def- commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

(The nightly failures look unrelated)

@def-

def- commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Triggered a nightly run with the new test: https://buildkite.com/materialize/nightly/builds/16822 Edit: The test doesn't seem to work:

ALTER CONNECTION ROTATE KEYS vs SET race (SQL-272): test timed out

@mtabebe mtabebe force-pushed the ma/sql-272 branch 4 times, most recently from b09d535 to 3fc560e Compare June 22, 2026 21:17
@mtabebe mtabebe requested a review from def- June 23, 2026 01:48

@def- def- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is now wrong about the regression test location.

Triggered another nightly run because the test just timed out in previous runs: https://buildkite.com/materialize/nightly/builds/16859
Edit: Failed for me locally, works with this diff:

diff --git a/src/adapter/src/coord/sequencer/inner/secret.rs b/src/adapter/src/coord/sequencer/inner/secret.rs
index 71862615e0..09913a39c9 100644
--- a/src/adapter/src/coord/sequencer/inner/secret.rs
+++ b/src/adapter/src/coord/sequencer/inner/secret.rs
@@ -385,6 +385,13 @@ impl Coordinator {
                     name: entry.name,
                     to_item,
                 }];
+
+                // pause between the secret-store write and the catalog write so a concurrent ALTER
+                // CONNECTION SET can commit.  Runs in the off-thread ensure task, not the
+                // coordinator main loop, so the pause won't freeze the coordinator (which would
+                // block that SET).
+                fail::fail_point!("rotate_keys_between_ensure_and_finish");
+
                 let stage = SecretStage::RotateKeysFinish(RotateKeysSecretFinish { validity, ops });
                 Ok(Box::new(stage))
             }
@@ -398,10 +405,6 @@ impl Coordinator {
         session: &Session,
         RotateKeysSecretFinish { ops, validity: _ }: RotateKeysSecretFinish,
     ) -> Result<StageResult<Box<SecretStage>>, AdapterError> {
-        // Test-only seam: lets a regression test pause ROTATE here so a
-        // concurrent ALTER CONNECTION SET can commit before `catalog_transact`
-        // re-validates dependencies.
-        fail::fail_point!("rotate_keys_between_ensure_and_finish");
         self.catalog_transact(Some(session), ops).await?;
         Ok(StageResult::Response(ExecuteResponse::AlteredObject(
             ObjectType::Connection,
diff --git a/test/race-condition/mzcompose.py b/test/race-condition/mzcompose.py
index f41a4491df..cfcc6cf68b 100644
--- a/test/race-condition/mzcompose.py
+++ b/test/race-condition/mzcompose.py
@@ -1407,8 +1407,10 @@ def toxiproxy_start(c: Composition, jitter: int) -> None:
     assert r.status_code == 200, r


-# Failpoint that pauses `rotate_keys_finish` between the secret-store write and
-# the catalog write. Defined in src/adapter/src/coord/sequencer/inner/secret.rs.
+# Pauses ROTATE in its off-thread `rotate_keys_ensure` task, between the
+# secret-store write and the catalog write. Must fire off the coordinator main
+# loop; pausing on-loop would freeze the coordinator and block this test's SET.
+# Defined in src/adapter/src/coord/sequencer/inner/secret.rs.
 ROTATE_KEYS_FAILPOINT = "rotate_keys_between_ensure_and_finish"

 # Substring of the Display impl of `AdapterError::ConcurrentDependencyMutation`

Comment thread ci/nightly/pipeline.template.yml Outdated
skip: "Not stable yet, not clear if this is a product issue"

- id: race-condition-rotate-keys
label: "ALTER CONNECTION ROTATE KEYS vs SET race (SQL-272)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label: "ALTER CONNECTION ROTATE KEYS vs SET race (SQL-272)"
label: "ALTER CONNECTION ROTATE KEYS vs SET race"

Problem:
`ALTER CONNECTION ... ROTATE KEYS` is a read-modify-write with a wide
window between the catalog read and the catalog write. The slow steps
in the middle are SSH key generation and secret-store I/O, all running
in a spawned task. If `ALTER CONNECTION SET (...)` commits inside that
window, the blind catalog write at the end of ROTATE overwrites the
SET's change with the pre-SET `create_sql`. The SET silently disappears.

Note:
ROTATE KEYS is one of the few DDLs exempt from the global
`serialized_ddl` lock (see `must_serialize_ddl` in `command_handler.rs`),
which is what opens the race window in the first place. Other ALTER
CONNECTION variants take the lock and can't race against themselves.

Solution:
`PlanValidity` already runs at every off-thread to on-thread hop and
catches dropped dependencies. Extend it to also catch content changes:
capture a hash of each dependency's `create_sql` at plan time, compare
against the live hash in `check`, and return a new
`ConcurrentDependencyMutation` error (SQLSTATE 40001) on mismatch. The
user retries the statement.

`PlanValidity::new` and `extend_dependencies` now take `&Catalog` so
they can read entries to hash; all callsites are updated.

Testing:
Regression test that triggers the wait point

Closes SQL-272.
@mtabebe

mtabebe commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Ah I updated the commit message but that didn't update the PR

@mtabebe mtabebe requested a review from def- June 23, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants