Harden instance/node ID collisions with retry + composite PK (#129) by crprashant · Pull Request #238 · microsoft/pg_durable

crprashant · 2026-06-16T03:18:08Z

Summary

Implements the maintainer-directed minimal approach for #129 (per @pinodeca / @tjgreen42):

df.instances — df.start() now retries instance-ID generation on collision instead of trusting a single random draw. The ID stays VARCHAR(8) HEX, unchanged — minimal change.
df.nodes — the primary key is now composite (instance_id, id), so node IDs only need to be unique per instance rather than globally. The existing nodes_instance_node_key UNIQUE (instance_id, id) is promoted to the primary key.
Periodically prune the instance and node tables #261 — left as a future improvement (not implemented here).

How collisions are handled (atomic, no TOCTOU)

Both the instance reserve and every node insert claim their ID via INSERT … ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return — there is no check-then-insert window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID.

The instance row is reserved up front with root_node bound to a pre-generated ID, which is then forced onto the root node — satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node).

Node-status scoping

update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected.

Upgrade impact

Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 → 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. Documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance.

Tests

51_node_composite_pk — composite-PK schema contract + instance_id-scoped node-status regression.
52_node_id_collision_across_instances — the same node ID reused across two different instances.
Plus pick_id_with_retry unit tests (first-try success, re-roll on conflict, exhaustion without returning an unverified ID, claim-error propagation).

Validation (local, full CONTRIBUTING workflow)

fmt + clippy clean · unit 187/0 · e2e 35/0 · upgrade 36/0 (Scenario A schema-convergence + B1 backward-compat against 0.2.2 & 0.2.3 + B2 data-survival).

pinodeca

We have not yet released v0.2.3, so my primary request is to reverse the version bump in Cargo.toml and to fold the 0.2.3->0.2.4 upgrade script into 0.2.2->0.2.3

Opus 4.7 also suggested these changes, which we can discuss individually and potentially address as follow-ups:

Minor suggestions (non-blocking)

explain.rs UUID detection is too permissive. explain.rs uses uuid::Uuid::parse_str(trimmed).is_ok(), which accepts simple (32-char no hyphens), hyphenated, braced ({...}), and URN (urn:uuid:...) forms. Only canonical hyphenated UUIDs ever appear in df.instances.id, so non-canonical forms will silently fall through to "not found". Tightening to require length 36 + parse_str would make df.explain('garbage32hexcharsthatlookslikeauuid') behave as a DSL-parse error rather than an instance lookup miss. Trivial impact.
Schema snapshot doesn't compare CHECK bodies. Per the PR's own constraint-drift note, test-upgrade.sh doesn't pin pg_get_constraintdef() / convalidated. The new id-format regex is duplicated byte-for-byte in lib.rs and sql/pg_durable--0.2.3--0.2.4.sql; a future drift in one but not the other would only be caught by pgspot + the functional B1/B2 path. Hardening the snapshot to include pg_get_constraintdef() is filed implicitly in the docs — worth a follow-up issue.
No CHANGELOG.md entry. CHANGELOG.md exists at repo root but isn't in the diff. The upgrade-testing.md "v0.2.3 → v0.2.4" section is great, but a CHANGELOG line for the version bump is conventional.
full_uuid() could just be Uuid::new_v4().to_string() inline. It's a one-liner used in exactly one place (new_id). The doc comment is valuable, but the wrapper is borderline. Optional.
Existing-row revalidation. The upgrade re-adds all six format CHECKs as NOT VALID to skip the table scan, which is appropriate for a maintenance-window upgrade. Worth a brief operator-doc line that anyone wanting full constraint validation later can run ALTER TABLE … VALIDATE CONSTRAINT … (rows already conform since 8-char IDs match the relaxed regex).

Suggested follow-ups (separate PRs)

Tighten explain.rs UUID detection to canonical form only.
Extend test-upgrade.sh schema snapshot to include pg_get_constraintdef() and convalidated.
Add a CHANGELOG.md entry

crprashant · 2026-06-16T18:56:15Z

Thanks for the review, @pinodeca! Addressed the blocking item plus the cheap niceties.

Blocking — version fold

Reverted the version bump. There's no longer any Cargo.toml/Cargo.lock change, so the diff is now net-zero against main (0.2.3 is the in-progress version, not a new one).
Folded the UUID-widening DDL into the existing sql/pg_durable--0.2.2--0.2.3.sql — it now carries both the df.duroxide_schema() change and the id-column widening (disjoint objects, so order is irrelevant) — and deleted the separate 0.2.3--0.2.4.sql.
The runtime gate is now extversion >= 0.2.3.

Minor suggestions (numbered as in your review)

Suggestion 1 (explain.rs) — tightened to the canonical 36-char form (trimmed.len() == 36 && Uuid::parse_str(...).is_ok()), so non-canonical UUID spellings now surface as a DSL-parse error instead of a silent instance-lookup miss. ✅
Suggestion 3 (CHANGELOG) — added a [0.2.3] - Unreleased entry. ✅
Suggestion 5 (VALIDATE CONSTRAINT) — added an operator note in docs/upgrade-testing.md: because the relaxed regex also matches every legacy 8-char id, an operator can run ALTER TABLE … VALIDATE CONSTRAINT … later to clear the NOT VALID marker, and the validating scan cannot fail. ✅
Suggestion 4 (full_uuid() wrapper) — left as-is for the doc comment; happy to inline it if you'd prefer.
Suggestion 2 (schema-snapshot CHECK-body / convalidated hardening) — agree this is worth doing. I'll file it as a separate follow-up as you suggested, rather than expanding this PR.

Re-ran the full CONTRIBUTING dev workflow locally: fmt/build/clippy clean, unit 171/0, e2e 33/33, and upgrade 20/20 — Scenario A confirms the folded 0.2.2→0.2.3 upgrade converges with a fresh install, and B1/B2 confirm a 0.2.3 .so still emits 8-char ids against an un-upgraded 0.2.2 schema and that existing rows survive.

crprashant · 2026-06-16T19:39:22Z

CI fix: hardened e2e test `45_connection_limit_timeout` against a pre-existing status/output race

The Clippy & Tests (PG17) job failed on one e2e test (30 of 31 passing):

45_connection_limit_timeout.sql:73: ERROR: TEST FAILED: victim output is NULL

I've pushed a fix (folded into the existing commit). Details below.

Root cause — a pre-existing race, not a UUID change. A failed instance's status and its output come from two different sources:

df.instances.status is flipped to failed by the update-instance-status activity, which commits to Postgres immediately (a direct UPDATE df.instances SET status=...).
In execute_function_graph's failure path, the orchestration .awaits that activity and only afterwards returns Err(err). duroxide records the orchestration's terminal output at the moment the orchestrator returns.
df.instance_info() reads status from df.instances but reads output from duroxide.

So there is a brief window where wait_for_completion() already observes status = 'failed' (it polls df.instances.status every 100ms and returns the instant it sees a terminal status) while the output column has not yet been populated. The test read output exactly once, immediately after wait_for_completion() returned, and under slow CI I/O it landed inside that window → NULL.

Why this is not a UUID regression. instance_info() reads output by the very same full-UUID instance_id used for the status read — and the status assertion (victim_status = 'failed') passed. A key/ID mismatch would have broken that read too, and would have failed many other tests rather than this single one. This is also the only e2e test that reads a failed instance's output immediately after wait_for_completion; it passed on the prior CI run and passes locally.

The fix. Poll for the message to materialize instead of reading it once — the same poll-until pattern already used throughout the suite for status:

FOR i IN 1..100 LOOP
    SELECT output INTO victim_output FROM df.instance_info(victim_id);
    EXIT WHEN victim_output IS NOT NULL;
    PERFORM pg_sleep(0.1);
END LOOP;

Why up to 10s (100 × 100ms). It is a deliberately generous upper bound to absorb worst-case CI scheduling/I/O latency between the status commit and duroxide's terminal-output flush. The loop EXITs as soon as output is non-NULL, so on the normal path it resolves on the first iteration and adds no measurable delay; the 10s ceiling only ever applies if something is genuinely wrong (in which case the assertion message now says so). It stays well within the test's existing 30s wait_for_completion budget and the suite's per-test timeouts.

The change is isolated to tests/e2e/sql/45_connection_limit_timeout.sql — no source, schema, or upgrade-script changes. Verified locally: 45_connection_limit_timeout ... PASS.

pinodeca · 2026-06-22T20:32:21Z

Hi @crprashant, I chatted with @tjgreen42 and we agreed on the following:

When generating the random df.instance ID, check the table and retry until there's no collision. This reduces collisions on df.instances as long as there are still lots of unused IDs. Keep the ID a varchar 8, HEX, same as today - i.e. minimal change
Remove risk of collision on df.nodes by making its PK include the instance_id
Periodically prune the instance and node tables #261 - we'll implement this at some point, it doesn't seem urgent now

Would you mind updating your PR accordingly? If you prefer to open a new PR, that's also fine, your choice.

…ft#129) Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now retries instance-ID generation on collision (the ID stays VARCHAR(8) HEX, unchanged), and df.nodes uses a composite PRIMARY KEY (instance_id, id) so node IDs only need to be unique per instance rather than globally. The legacy nodes_instance_node_key UNIQUE constraint is promoted to the primary key. Collision handling is atomic: both the instance reserve and every node insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return, so there is no check-then-insert TOCTOU window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID. The instance row is reserved up front with its root_node bound to a pre-generated ID that is then forced onto the root node, satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node). update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected. Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. This is documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance. Add e2e tests 50 (composite-PK schema contract + instance_id-scoped node-status regression) and 51 (cross-instance node-ID collision). Update CHANGELOG and the E2E test inventory.

crprashant · 2026-06-23T17:17:17Z

Thanks @pinodeca — done. I've updated this PR (same branch, force-pushed) to exactly the approach you and @tjgreen42 agreed on:

df.instances: df.start() now retries ID generation on collision; the ID stays VARCHAR(8) HEX — minimal change.
df.nodes: the PK is now composite (instance_id, id), so node IDs only need to be unique per instance (the existing nodes_instance_node_key UNIQUE is promoted to the primary key).
Periodically prune the instance and node tables #261: left as a future improvement, not touched here.

Both the instance reserve and every node insert claim their ID atomically via INSERT … ON CONFLICT DO NOTHING RETURNING id and re-roll on zero rows, so there's no check-then-insert race.

One upgrade caveat worth flagging: update_node_status now takes instance_id, which is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 → 0.2.4 binary upgrade — so in-flight instances should be drained/recreated before upgrading. That's documented in docs/upgrade-testing.md along with the retry-vs-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance.

Local gates all green: fmt/clippy clean, unit 187/0, e2e 35/0, upgrade 36/0. Updated the title and description to match the new approach.

…ft#129) Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now retries instance-ID generation on collision (the ID stays VARCHAR(8) HEX, unchanged), and df.nodes uses a composite PRIMARY KEY (instance_id, id) so node IDs only need to be unique per instance rather than globally. The legacy nodes_instance_node_key UNIQUE constraint is promoted to the primary key. Collision handling is atomic: both the instance reserve and every node insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return, so there is no check-then-insert TOCTOU window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID. The instance row is reserved up front with its root_node bound to a pre-generated ID that is then forced onto the root node, satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node). update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected. Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. This is documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance. Add e2e tests 50 (composite-PK schema contract + instance_id-scoped node-status regression) and 51 (cross-instance node-ID collision). Update CHANGELOG and the E2E test inventory.

pinodeca

LGTM! Reviewed with Opus 4.8

crprashant mentioned this pull request Jun 16, 2026

Bug: 8-character hex instance IDs have 32-bit collision risk #129

Closed

pinodeca requested changes Jun 16, 2026

View reviewed changes

crprashant force-pushed the crprashant/issue-129-full-uuid branch from cc4f375 to 74ad427 Compare June 16, 2026 18:55

crprashant force-pushed the crprashant/issue-129-full-uuid branch from 74ad427 to 12b50e4 Compare June 16, 2026 19:38

crprashant mentioned this pull request Jun 16, 2026

test-upgrade.sh: schema snapshot should compare CHECK bodies (pg_get_constraintdef) and convalidated #241

Closed

crprashant requested a review from pinodeca June 16, 2026 20:18

pinodeca mentioned this pull request Jun 17, 2026

Release v0.2.3 #243

Closed

8 tasks

crprashant mentioned this pull request Jun 18, 2026

Remove df.debug_connection(); reclassify #110 as non-security cleanup #250

Merged

crprashant force-pushed the crprashant/issue-129-full-uuid branch from 7adae26 to 306f3bd Compare June 23, 2026 17:12

crprashant changed the title ~~Implement full UUID instance/node IDs (#129)~~ Harden instance/node ID collisions with retry + composite PK (#129) Jun 23, 2026

crprashant mentioned this pull request Jun 23, 2026

Follow-ups from #238 (#129): node-status scoping test, friendlier NULL-instance_id upgrade abort, short_id format test #264

Open

3 tasks

crprashant force-pushed the crprashant/issue-129-full-uuid branch from 306f3bd to e69198b Compare June 23, 2026 18:52

pinodeca approved these changes Jun 23, 2026

View reviewed changes

pinodeca merged commit b82f9cd into microsoft:main Jun 23, 2026
12 checks passed

pinodeca linked an issue Jun 23, 2026 that may be closed by this pull request

Bug: 8-character hex instance IDs have 32-bit collision risk #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden instance/node ID collisions with retry + composite PK (#129)#238

Harden instance/node ID collisions with retry + composite PK (#129)#238
pinodeca merged 1 commit into
microsoft:mainfrom
crprashant:crprashant/issue-129-full-uuid

crprashant commented Jun 16, 2026 •

edited

Loading

Uh oh!

pinodeca left a comment

Uh oh!

crprashant commented Jun 16, 2026 •

edited

Loading

Uh oh!

crprashant commented Jun 16, 2026

Uh oh!

pinodeca commented Jun 22, 2026

Uh oh!

crprashant commented Jun 23, 2026

Uh oh!

pinodeca left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

crprashant commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How collisions are handled (atomic, no TOCTOU)

Node-status scoping

Upgrade impact

Tests

Validation (local, full CONTRIBUTING workflow)

Uh oh!

pinodeca left a comment

Choose a reason for hiding this comment

Minor suggestions (non-blocking)

Suggested follow-ups (separate PRs)

Uh oh!

crprashant commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crprashant commented Jun 16, 2026

CI fix: hardened e2e test 45_connection_limit_timeout against a pre-existing status/output race

Uh oh!

pinodeca commented Jun 22, 2026

Uh oh!

crprashant commented Jun 23, 2026

Uh oh!

pinodeca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crprashant commented Jun 16, 2026 •

edited

Loading

crprashant commented Jun 16, 2026 •

edited

Loading

CI fix: hardened e2e test `45_connection_limit_timeout` against a pre-existing status/output race