Harden instance/node ID collisions with retry + composite PK (#129)#238
Conversation
pinodeca
left a comment
There was a problem hiding this comment.
We have not yet released v0.2.3, so my primary request is to reverse the version bump in Cargo.toml and to fold the 0.2.3->0.2.4 upgrade script into 0.2.2->0.2.3
Opus 4.7 also suggested these changes, which we can discuss individually and potentially address as follow-ups:
Minor suggestions (non-blocking)
-
explain.rs UUID detection is too permissive. explain.rs uses
uuid::Uuid::parse_str(trimmed).is_ok(), which accepts simple (32-char no hyphens), hyphenated, braced ({...}), and URN (urn:uuid:...) forms. Only canonical hyphenated UUIDs ever appear indf.instances.id, so non-canonical forms will silently fall through to "not found". Tightening to require length 36 +parse_strwould makedf.explain('garbage32hexcharsthatlookslikeauuid')behave as a DSL-parse error rather than an instance lookup miss. Trivial impact. -
Schema snapshot doesn't compare CHECK bodies. Per the PR's own constraint-drift note, test-upgrade.sh doesn't pin
pg_get_constraintdef()/convalidated. The new id-format regex is duplicated byte-for-byte in lib.rs and sql/pg_durable--0.2.3--0.2.4.sql; a future drift in one but not the other would only be caught by pgspot + the functional B1/B2 path. Hardening the snapshot to includepg_get_constraintdef()is filed implicitly in the docs — worth a follow-up issue. -
No CHANGELOG.md entry. CHANGELOG.md exists at repo root but isn't in the diff. The upgrade-testing.md "v0.2.3 → v0.2.4" section is great, but a CHANGELOG line for the version bump is conventional.
-
full_uuid()could just beUuid::new_v4().to_string()inline. It's a one-liner used in exactly one place (new_id). The doc comment is valuable, but the wrapper is borderline. Optional. -
Existing-row revalidation. The upgrade re-adds all six format CHECKs as
NOT VALIDto skip the table scan, which is appropriate for a maintenance-window upgrade. Worth a brief operator-doc line that anyone wanting full constraint validation later can runALTER TABLE … VALIDATE CONSTRAINT …(rows already conform since 8-char IDs match the relaxed regex).
Suggested follow-ups (separate PRs)
- Tighten explain.rs UUID detection to canonical form only.
- Extend test-upgrade.sh schema snapshot to include
pg_get_constraintdef()andconvalidated. - Add a CHANGELOG.md entry
cc4f375 to
74ad427
Compare
|
Thanks for the review, @pinodeca! Addressed the blocking item plus the cheap niceties. Blocking — version fold
Minor suggestions (numbered as in your review)
Re-ran the full CONTRIBUTING dev workflow locally: |
74ad427 to
12b50e4
Compare
CI fix: hardened e2e test
|
|
Hi @crprashant, I chatted with @tjgreen42 and we agreed on the following:
Would you mind updating your PR accordingly? If you prefer to open a new PR, that's also fine, your choice. |
7adae26 to
306f3bd
Compare
…ft#129) Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now retries instance-ID generation on collision (the ID stays VARCHAR(8) HEX, unchanged), and df.nodes uses a composite PRIMARY KEY (instance_id, id) so node IDs only need to be unique per instance rather than globally. The legacy nodes_instance_node_key UNIQUE constraint is promoted to the primary key. Collision handling is atomic: both the instance reserve and every node insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return, so there is no check-then-insert TOCTOU window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID. The instance row is reserved up front with its root_node bound to a pre-generated ID that is then forced onto the root node, satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node). update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected. Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. This is documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance. Add e2e tests 50 (composite-PK schema contract + instance_id-scoped node-status regression) and 51 (cross-instance node-ID collision). Update CHANGELOG and the E2E test inventory.
|
Thanks @pinodeca — done. I've updated this PR (same branch, force-pushed) to exactly the approach you and @tjgreen42 agreed on:
Both the instance reserve and every node insert claim their ID atomically via One upgrade caveat worth flagging: Local gates all green: |
…ft#129) Maintainer-directed minimal change for microsoft#129 / PR microsoft#238: df.start() now retries instance-ID generation on collision (the ID stays VARCHAR(8) HEX, unchanged), and df.nodes uses a composite PRIMARY KEY (instance_id, id) so node IDs only need to be unique per instance rather than globally. The legacy nodes_instance_node_key UNIQUE constraint is promoted to the primary key. Collision handling is atomic: both the instance reserve and every node insert claim their ID via INSERT ... ON CONFLICT DO NOTHING RETURNING id and re-roll when zero rows return, so there is no check-then-insert TOCTOU window. The shared pick_id_with_retry helper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID. The instance row is reserved up front with its root_node bound to a pre-generated ID that is then forced onto the root node, satisfying the deferred same-instance FK at commit without an extra UPDATE (ordinary df roles cannot UPDATE df.instances.root_node). update_node_status now requires instance_id and always scopes its UPDATE by (instance_id, id), removing the global-ID fallback and asserting exactly one row is affected. Adding instance_id to the activity input is a duroxide replay-breaking change for orchestrations in flight across the 0.2.3 -> 0.2.4 binary upgrade, so operators must drain or recreate in-flight instances before upgrading. This is documented in docs/upgrade-testing.md alongside the instance-retry vs node-composite-PK asymmetry rationale and ADD PRIMARY KEY lock guidance. Add e2e tests 50 (composite-PK schema contract + instance_id-scoped node-status regression) and 51 (cross-instance node-ID collision). Update CHANGELOG and the E2E test inventory.
306f3bd to
e69198b
Compare
pinodeca
left a comment
There was a problem hiding this comment.
LGTM! Reviewed with Opus 4.8
Summary
Implements the maintainer-directed minimal approach for #129 (per @pinodeca / @tjgreen42):
df.instances—df.start()now retries instance-ID generation on collision instead of trusting a single random draw. The ID staysVARCHAR(8)HEX, unchanged — minimal change.df.nodes— the primary key is now composite(instance_id, id), so node IDs only need to be unique per instance rather than globally. The existingnodes_instance_node_key UNIQUE (instance_id, id)is promoted to the primary key.How collisions are handled (atomic, no TOCTOU)
Both the instance reserve and every node insert claim their ID via
INSERT … ON CONFLICT DO NOTHING RETURNING idand re-roll when zero rows return — there is no check-then-insert window. The sharedpick_id_with_retryhelper treats the claim as the loop tail and returns an error on exhaustion, never an unverified ID.The instance row is reserved up front with
root_nodebound to a pre-generated ID, which is then forced onto the root node — satisfying the deferred same-instance FK at commit without an extraUPDATE(ordinarydfroles cannotUPDATE df.instances.root_node).Node-status scoping
update_node_statusnow requiresinstance_idand always scopes itsUPDATEby(instance_id, id), removing the global-ID fallback and asserting exactly one row is affected.Upgrade impact
Adding
instance_idto the activity input is a duroxide replay-breaking change for orchestrations in flight across the0.2.3 → 0.2.4binary upgrade, so operators must drain or recreate in-flight instances before upgrading. Documented indocs/upgrade-testing.mdalongside the instance-retry vs node-composite-PK asymmetry rationale andADD PRIMARY KEYlock guidance.Tests
51_node_composite_pk— composite-PK schema contract +instance_id-scoped node-status regression.52_node_id_collision_across_instances— the same node ID reused across two different instances.pick_id_with_retryunit tests (first-try success, re-roll on conflict, exhaustion without returning an unverified ID, claim-error propagation).Validation (local, full CONTRIBUTING workflow)
fmt+clippyclean · unit 187/0 · e2e 35/0 · upgrade 36/0 (Scenario A schema-convergence + B1 backward-compat against 0.2.2 & 0.2.3 + B2 data-survival).