Skip to content

feat(server-ng): commit-time metadata validation with result codes#3553

Merged
numinnex merged 5 commits into
apache:masterfrom
krishvishal:workload-prereqs
Jun 26, 2026
Merged

feat(server-ng): commit-time metadata validation with result codes#3553
numinnex merged 5 commits into
apache:masterfrom
krishvishal:workload-prereqs

Conversation

@krishvishal

Copy link
Copy Markdown
Contributor

Why

server-ng previously validated metadata business rules during preflight: primary-only, before replication. Invalid requests were then silently dropped.

This caused two issues:

1. VSR correctness. A rejection must be a deterministic function of the committed log. It should be computed during apply on every replica and recorded so that all replicas agree. Preflight validation uses a primary-local snapshot, and its verdict is never replicated. After a view change, the new primary is not guaranteed to reach the same decision.

2. Infinite retry. The in-process path mapped the preflight Err to Canceled, leaving the home shard silent. As a result, the SDK kept retrying permanently invalid requests forever, such as CreatePartitions on a missing topic.

What Changed

Result taxonomy. Added core/metadata/src/stm/result.rs. There is now one closed #[repr(u32)] enum per operation: Ok = 0, with other discriminants reused from the existing IggyError code space. This keeps SDK error mapping compatible with existing codes.

Apply now returns committed results. StateHandler::apply now returns ApplyReply { code, body } instead of Bytes. Every previous silent Bytes::new() no-op is now represented as a committed result code. MuxStateMachine::update now reserves Err for decode/corruption failures only.

Result is encoded in the reply body. The result rides in the reply body. The reply body now starts with a sparse result section, [count][index, result]*, followed by the payload. This is written in place via build_reply_message_with, using one allocation and one payload copy. There is no ReplyHeader layout change.

Eviction instead of silent drop. Structurally invalid requests are now evicted rather than silently dropped: not-client-allowed maps to InvalidRequestOperation, while undecodable/overflow maps to InvalidRequestBody.

Removed preflight partition-count read. Deleted the current_partition_count preflight read. Parent existence is now checked at commit time and returned as CreatePartitionsResult::{Stream, Topic}NotFound, removing the TOCTOU window.

Consensus frame size floor. Every consensus frame’s size must now span its header. This is asserted during consensus_message construction, PrepareOk projection, and in the simulator. This prevents reply-body slicing from underflowing.

Simulator changes. The simulator now uses outcome-first generation: operations target a chosen outcome, including error outcomes; on_reply decodes the committed code from the body; and the decoded code is asserted to be a declared result code. The strict targeted-equals-committed oracle is gated to serial runs only: client_count == 1 && CLIENT_REQUEST_QUEUE_MAX == 1.

Wire / SDK Impact

ReplyHeader is unchanged. This is not a #[repr(C)] layout change. However, the reply body format changes. A result section now precedes the payload, so body decoding changes for success replies as well. There is no released server-ng wire format to break.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.56687% with 335 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.01%. Comparing base (8c1fefb) to head (97bf3e3).

Files with missing lines Patch % Lines
core/metadata/src/stm/stream.rs 72.22% 40 Missing ⚠️
core/metadata/src/stm/consumer_group.rs 81.03% 22 Missing ⚠️
...re/simulator/src/workload/ops/create_partitions.rs 0.00% 20 Missing and 1 partial ⚠️
...re/simulator/src/workload/ops/delete_partitions.rs 0.00% 20 Missing ⚠️
core/simulator/src/workload/ops/create_topic.rs 0.00% 19 Missing ⚠️
core/simulator/src/workload/ops/update_user.rs 0.00% 19 Missing ⚠️
...imulator/src/workload/ops/delete_consumer_group.rs 0.00% 17 Missing ⚠️
core/metadata/src/stm/user.rs 79.48% 16 Missing ⚠️
core/simulator/src/workload/ops/update_topic.rs 0.00% 16 Missing ⚠️
core/metadata/src/stm/result.rs 91.66% 15 Missing ⚠️
... and 19 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3553      +/-   ##
============================================
- Coverage     74.02%   74.01%   -0.01%     
  Complexity      937      937              
============================================
  Files          1247     1249       +2     
  Lines        127567   128248     +681     
  Branches     103436   104159     +723     
============================================
+ Hits          94427    94927     +500     
- Misses        30104    30232     +128     
- Partials       3036     3089      +53     
Components Coverage Δ
Rust Core 74.70% <66.56%> (+0.04%) ⬆️
Java SDK 62.44% <ø> (ø)
C# SDK 71.41% <ø> (-0.70%) ⬇️
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.13% <ø> (-0.10%) ⬇️
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/binary_protocol/src/consensus/reply_result.rs 100.00% <100.00%> (ø)
core/common/src/error/iggy_error.rs 100.00% <ø> (ø)
core/consensus/src/metadata_helpers.rs 91.51% <100.00%> (ø)
core/consensus/src/plane_helpers.rs 92.53% <100.00%> (+0.14%) ⬆️
core/metadata/src/stm/mod.rs 75.44% <ø> (ø)
core/metadata/src/stm/mux.rs 94.23% <100.00%> (ø)
core/shard/src/lib.rs 73.97% <100.00%> (ø)
core/simulator/src/lib.rs 93.04% <100.00%> (+0.54%) ⬆️
core/simulator/src/packet.rs 80.21% <100.00%> (+0.25%) ⬆️
core/simulator/src/workload/auditor.rs 77.19% <100.00%> (+5.76%) ⬆️
... and 36 more

... and 38 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@numinnex

numinnex commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

I think is worth addressing in this PR is using the custom decoder in the SDK (hidden behind the vsr feature flag) to take into account those status error codes and map them to IggyError. Alternatively you could put that inside of the binary_protocol crate, and reuse that inside of simulator for the mapping.

@numinnex numinnex left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support to SDK to parse the new status codes

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 24, 2026
@numinnex numinnex merged commit 307fdb1 into apache:master Jun 26, 2026
165 of 167 checks passed
@github-actions github-actions Bot removed the S-waiting-on-author PR is waiting on author response label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants