Skip to content

Harden instance/node ID collisions with retry + composite PK (#129)#238

Merged
pinodeca merged 1 commit into
microsoft:mainfrom
crprashant:crprashant/issue-129-full-uuid
Jun 23, 2026
Merged

Harden instance/node ID collisions with retry + composite PK (#129)#238
pinodeca merged 1 commit into
microsoft:mainfrom
crprashant:crprashant/issue-129-full-uuid

Conversation

@crprashant

@crprashant crprashant commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the maintainer-directed minimal approach for #129 (per @pinodeca / @tjgreen42):

  • df.instancesdf.start() now retries instance-ID generation on collision instead of trusting a single random draw. The ID stays VARCHAR(8) HEX, unchanged — minimal change.
  • df.nodes — the primary key is now composite (instance_id, id), so node IDs only need to be unique per instance rather than globally. The existing nodes_instance_node_key UNIQUE (instance_id, id) is promoted to the primary key.
  • Periodically prune the instance and node tables #261 — left as a future improvement (not implemented here).

How collisions are handled (atomic, no TOCTOU)

Both the instance reserve and every node insert claim their ID via INSERT … ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return — there is no check-then-insert window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID.

The instance row is reserved up front with root_node bound to a pre-generated ID, which is then forced onto the root node — satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node).

Node-status scoping

update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected.

Upgrade impact

Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 → 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. Documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance.

Tests

  • 51_node_composite_pk — composite-PK schema contract + instance_id-scoped node-status regression.
  • 52_node_id_collision_across_instances — the same node ID reused across two different instances.
  • Plus pick_id_with_retry unit tests (first-try success, re-roll on conflict, exhaustion without returning an unverified ID, claim-error propagation).

Validation (local, full CONTRIBUTING workflow)

fmt + clippy clean · unit 187/0 · e2e 35/0 · upgrade 36/0 (Scenario A schema-convergence + B1 backward-compat against 0.2.2 & 0.2.3 + B2 data-survival).

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not yet released v0.2.3, so my primary request is to reverse the version bump in Cargo.toml and to fold the 0.2.3->0.2.4 upgrade script into 0.2.2->0.2.3

Opus 4.7 also suggested these changes, which we can discuss individually and potentially address as follow-ups:

Minor suggestions (non-blocking)

  1. explain.rs UUID detection is too permissive. explain.rs uses uuid::Uuid::parse_str(trimmed).is_ok(), which accepts simple (32-char no hyphens), hyphenated, braced ({...}), and URN (urn:uuid:...) forms. Only canonical hyphenated UUIDs ever appear in df.instances.id, so non-canonical forms will silently fall through to "not found". Tightening to require length 36 + parse_str would make df.explain('garbage32hexcharsthatlookslikeauuid') behave as a DSL-parse error rather than an instance lookup miss. Trivial impact.

  2. Schema snapshot doesn't compare CHECK bodies. Per the PR's own constraint-drift note, test-upgrade.sh doesn't pin pg_get_constraintdef() / convalidated. The new id-format regex is duplicated byte-for-byte in lib.rs and sql/pg_durable--0.2.3--0.2.4.sql; a future drift in one but not the other would only be caught by pgspot + the functional B1/B2 path. Hardening the snapshot to include pg_get_constraintdef() is filed implicitly in the docs — worth a follow-up issue.

  3. No CHANGELOG.md entry. CHANGELOG.md exists at repo root but isn't in the diff. The upgrade-testing.md "v0.2.3 → v0.2.4" section is great, but a CHANGELOG line for the version bump is conventional.

  4. full_uuid() could just be Uuid::new_v4().to_string() inline. It's a one-liner used in exactly one place (new_id). The doc comment is valuable, but the wrapper is borderline. Optional.

  5. Existing-row revalidation. The upgrade re-adds all six format CHECKs as NOT VALID to skip the table scan, which is appropriate for a maintenance-window upgrade. Worth a brief operator-doc line that anyone wanting full constraint validation later can run ALTER TABLE … VALIDATE CONSTRAINT … (rows already conform since 8-char IDs match the relaxed regex).

Suggested follow-ups (separate PRs)

  • Tighten explain.rs UUID detection to canonical form only.
  • Extend test-upgrade.sh schema snapshot to include pg_get_constraintdef() and convalidated.
  • Add a CHANGELOG.md entry

@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from cc4f375 to 74ad427 Compare June 16, 2026 18:55
@crprashant

crprashant commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review, @pinodeca! Addressed the blocking item plus the cheap niceties.

Blocking — version fold

  • Reverted the version bump. There's no longer any Cargo.toml/Cargo.lock change, so the diff is now net-zero against main (0.2.3 is the in-progress version, not a new one).
  • Folded the UUID-widening DDL into the existing sql/pg_durable--0.2.2--0.2.3.sql — it now carries both the df.duroxide_schema() change and the id-column widening (disjoint objects, so order is irrelevant) — and deleted the separate 0.2.3--0.2.4.sql.
  • The runtime gate is now extversion >= 0.2.3.

Minor suggestions (numbered as in your review)

  • Suggestion 1 (explain.rs) — tightened to the canonical 36-char form (trimmed.len() == 36 && Uuid::parse_str(...).is_ok()), so non-canonical UUID spellings now surface as a DSL-parse error instead of a silent instance-lookup miss. ✅
  • Suggestion 3 (CHANGELOG) — added a [0.2.3] - Unreleased entry. ✅
  • Suggestion 5 (VALIDATE CONSTRAINT) — added an operator note in docs/upgrade-testing.md: because the relaxed regex also matches every legacy 8-char id, an operator can run ALTER TABLE … VALIDATE CONSTRAINT … later to clear the NOT VALID marker, and the validating scan cannot fail. ✅
  • Suggestion 4 (full_uuid() wrapper) — left as-is for the doc comment; happy to inline it if you'd prefer.
  • Suggestion 2 (schema-snapshot CHECK-body / convalidated hardening) — agree this is worth doing. I'll file it as a separate follow-up as you suggested, rather than expanding this PR.

Re-ran the full CONTRIBUTING dev workflow locally: fmt/build/clippy clean, unit 171/0, e2e 33/33, and upgrade 20/20 — Scenario A confirms the folded 0.2.2→0.2.3 upgrade converges with a fresh install, and B1/B2 confirm a 0.2.3 .so still emits 8-char ids against an un-upgraded 0.2.2 schema and that existing rows survive.

@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from 74ad427 to 12b50e4 Compare June 16, 2026 19:38
@crprashant

Copy link
Copy Markdown
Contributor Author

CI fix: hardened e2e test 45_connection_limit_timeout against a pre-existing status/output race

The Clippy & Tests (PG17) job failed on one e2e test (30 of 31 passing):

45_connection_limit_timeout.sql:73: ERROR: TEST FAILED: victim output is NULL

I've pushed a fix (folded into the existing commit). Details below.

Root cause — a pre-existing race, not a UUID change. A failed instance's status and its output come from two different sources:

  • df.instances.status is flipped to failed by the update-instance-status activity, which commits to Postgres immediately (a direct UPDATE df.instances SET status=...).
  • In execute_function_graph's failure path, the orchestration .awaits that activity and only afterwards returns Err(err). duroxide records the orchestration's terminal output at the moment the orchestrator returns.
  • df.instance_info() reads status from df.instances but reads output from duroxide.

So there is a brief window where wait_for_completion() already observes status = 'failed' (it polls df.instances.status every 100ms and returns the instant it sees a terminal status) while the output column has not yet been populated. The test read output exactly once, immediately after wait_for_completion() returned, and under slow CI I/O it landed inside that window → NULL.

Why this is not a UUID regression. instance_info() reads output by the very same full-UUID instance_id used for the status read — and the status assertion (victim_status = 'failed') passed. A key/ID mismatch would have broken that read too, and would have failed many other tests rather than this single one. This is also the only e2e test that reads a failed instance's output immediately after wait_for_completion; it passed on the prior CI run and passes locally.

The fix. Poll for the message to materialize instead of reading it once — the same poll-until pattern already used throughout the suite for status:

FOR i IN 1..100 LOOP
    SELECT output INTO victim_output FROM df.instance_info(victim_id);
    EXIT WHEN victim_output IS NOT NULL;
    PERFORM pg_sleep(0.1);
END LOOP;

Why up to 10s (100 × 100ms). It is a deliberately generous upper bound to absorb worst-case CI scheduling/I/O latency between the status commit and duroxide's terminal-output flush. The loop EXITs as soon as output is non-NULL, so on the normal path it resolves on the first iteration and adds no measurable delay; the 10s ceiling only ever applies if something is genuinely wrong (in which case the assertion message now says so). It stays well within the test's existing 30s wait_for_completion budget and the suite's per-test timeouts.

The change is isolated to tests/e2e/sql/45_connection_limit_timeout.sql — no source, schema, or upgrade-script changes. Verified locally: 45_connection_limit_timeout ... PASS.

@pinodeca

Copy link
Copy Markdown
Contributor

Hi @crprashant, I chatted with @tjgreen42 and we agreed on the following:

  • When generating the random df.instance ID, check the table and retry until there's no collision. This reduces collisions on df.instances as long as there are still lots of unused IDs. Keep the ID a varchar 8, HEX, same as today - i.e. minimal change
  • Remove risk of collision on df.nodes by making its PK include the instance_id
  • Periodically prune the instance and node tables #261 - we'll implement this at some point, it doesn't seem urgent now

Would you mind updating your PR accordingly? If you prefer to open a new PR, that's also fine, your choice.

@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from 7adae26 to 306f3bd Compare June 23, 2026 17:12
crprashant added a commit to crprashant/pg_durable that referenced this pull request Jun 23, 2026
…ft#129)

Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now
retries instance-ID generation on collision (the ID stays VARCHAR(8)
HEX, unchanged), and df.nodes uses a composite PRIMARY KEY
(instance_id, id) so node IDs only need to be unique per instance
rather than globally. The legacy nodes_instance_node_key UNIQUE
constraint is promoted to the primary key.

Collision handling is atomic: both the instance reserve and every node
insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING
id and re-roll when zero rows return, so there is no check-then-insert
TOCTOU window. The shared pick_id_with_retry helper treats the claim as
the loop tail and returns an error on exhaustion, never an unverified
ID. The instance row is reserved up front with its root_node bound to a
pre-generated ID that is then forced onto the root node, satisfying the
deferred same-instance FK at commit without an extra UPDATE (ordinary
df roles cannot UPDATE df.instances.root_node).

update_node_status now requires instance_id and always scopes its
UPDATE by (instance_id, id), removing the global-ID fallback and
asserting exactly one row is affected. Adding instance_id to the
activity input is a duroxide replay-breaking change for orchestrations
in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must
drain or recreate in-flight instances before upgrading. This is
documented in docs/upgrade-testing.md alongside the instance-retry vs
node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock
guidance.

Add e2e tests 50 (composite-PK schema contract + instance_id-scoped
node-status regression) and 51 (cross-instance node-ID collision).
Update CHANGELOG and the E2E test inventory.
@crprashant crprashant changed the title Implement full UUID instance/node IDs (#129) Harden instance/node ID collisions with retry + composite PK (#129) Jun 23, 2026
@crprashant

Copy link
Copy Markdown
Contributor Author

Thanks @pinodeca — done. I've updated this PR (same branch, force-pushed) to exactly the approach you and @tjgreen42 agreed on:

  • df.instances: df.start() now retries ID generation on collision; the ID stays VARCHAR(8) HEX — minimal change.
  • df.nodes: the PK is now composite (instance_id, id), so node IDs only need to be unique per instance (the existing nodes_instance_node_key UNIQUE is promoted to the primary key).
  • Periodically prune the instance and node tables #261: left as a future improvement, not touched here.

Both the instance reserve and every node insert claim their ID atomically via INSERT … ON CONFLICT DO NOTHING RETURNING id and re-roll on zero rows, so there's no check-then-insert race.

One upgrade caveat worth flagging: update_node_status now takes instance_id, which is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 → 0.2.4 binary upgrade — so in-flight instances should be drained/recreated before upgrading. That's documented in docs/upgrade-testing.md along with the retry-vs-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance.

Local gates all green: fmt/clippy clean, unit 187/0, e2e 35/0, upgrade 36/0. Updated the title and description to match the new approach.

…ft#129)

Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now
retries instance-ID generation on collision (the ID stays VARCHAR(8)
HEX, unchanged), and df.nodes uses a composite PRIMARY KEY
(instance_id, id) so node IDs only need to be unique per instance
rather than globally. The legacy nodes_instance_node_key UNIQUE
constraint is promoted to the primary key.

Collision handling is atomic: both the instance reserve and every node
insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING
id and re-roll when zero rows return, so there is no check-then-insert
TOCTOU window. The shared pick_id_with_retry helper treats the claim as
the loop tail and returns an error on exhaustion, never an unverified
ID. The instance row is reserved up front with its root_node bound to a
pre-generated ID that is then forced onto the root node, satisfying the
deferred same-instance FK at commit without an extra UPDATE (ordinary
df roles cannot UPDATE df.instances.root_node).

update_node_status now requires instance_id and always scopes its
UPDATE by (instance_id, id), removing the global-ID fallback and
asserting exactly one row is affected. Adding instance_id to the
activity input is a duroxide replay-breaking change for orchestrations
in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must
drain or recreate in-flight instances before upgrading. This is
documented in docs/upgrade-testing.md alongside the instance-retry vs
node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock
guidance.

Add e2e tests 50 (composite-PK schema contract + instance_id-scoped
node-status regression) and 51 (cross-instance node-ID collision).
Update CHANGELOG and the E2E test inventory.
@crprashant crprashant force-pushed the crprashant/issue-129-full-uuid branch from 306f3bd to e69198b Compare June 23, 2026 18:52

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Reviewed with Opus 4.8

@pinodeca pinodeca merged commit b82f9cd into microsoft:main Jun 23, 2026
12 checks passed
@pinodeca pinodeca linked an issue Jun 23, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: 8-character hex instance IDs have 32-bit collision risk

2 participants