Skip to content

User Tracing#223

Open
mar-cf wants to merge 7 commits into
cloudflare:mainfrom
mar-cf:user-tracing-7
Open

User Tracing#223
mar-cf wants to merge 7 commits into
cloudflare:mainfrom
mar-cf:user-tracing-7

Conversation

@mar-cf

@mar-cf mar-cf commented Jun 24, 2026

Copy link
Copy Markdown

Overview

This stack adds a user-facing span pipeline to foundations, parallel to the existing internal tracing pipeline. Application code can emit spans into a separate USER_HARNESS that exports OTLP HTTP over a Unix domain socket (gRPC unsupported) to an OTLP endpoint, with per-trace routing metadata carried on the wire and W3C traceparent continuation in/out.

The design is deliberately a mirror of internal tracing onto a second harness: every user API (user_span, start_user_trace, add_user_span_tags!, TelemetryContext.user_span, …) is the get_user() twin of an existing get() API. Two hard rules shape the surface:

  • The user and internal pipelines are independent — separate harnesses, scope stacks, and exporters
  • Public surface speaks W3C.

There are two layers to keep separate:

  • init-time: stand up USER_HARNESS + the exporter, pointed at the OTLP endpoint's socket.
  • per-request activation: start_user_trace(...) opens a root. Without a root, every user_span / with_user_span() / #[span_fn(user = true)] / add_user_span_tags! is a no-op.

User guide

Settings (init-time)

User tracing is configured via one optional block, gated by the user-tracing feature and fed to the existing foundations::telemetry::init:

TelemetrySettings.user_tracing: Option<UserTracingSettings>   // None = off
UserTracingSettings {
    enabled: bool,                       // default: true   (mirrors internal TracingSettings)
    max_queue_size: Option<NonZeroUsize>,// default: 1_000_000; None = unbounded (mirrors internal)
    output: UserTracesOutput,            // only variant today: OtlpUds(..)
}

OtlpUdsOutputSettings {
    socket_path: String,  // ← THE dial that matters: path to the OTLP endpoint's UDS (required, no default)
                          //   e.g. "/path/to/otlp-receptor.sock"
    num_tasks: usize,     // default: 2    (concurrent export workers)
    max_batch_size: usize,// default: 512  (spans drained per export batch)
}

The only setting a consumer must choose is socket_path (the OTLP endpoint's UDS). enabled and max_queue_size mirror their internal-tracing equivalents; num_tasks and max_batch_size tune the exporter. All have defaults and can be configured as needed.

There is deliberately no sampling configuration here. Unlike internal tracing, the user pipeline is not sampled inside foundations — the inbound user_tracing control header drives the activation (and therefore sampling) decision upstream.

Instrumentation APIs

Toy example covering the whole surface:

use foundations::telemetry::tracing::{
    self, RoutingMetadata, TraceparentContext, add_user_span_tags, user_span,
};

// Per request: open the root. `routing` is required and fixed at construction (inherited by all
// descendants); `inbound` continues an upstream W3C trace, or None for a fresh one.
let _root = tracing::start_user_trace(
    "example_span_name",
    RoutingMetadata { zone_id, account_id, workspace_id, destinations, managed },
    inbound,                       // Option<TraceparentContext>
);

// Children — pick whichever fits:
let _child = user_span("lookup");                 // explicit standalone child
let _s     = tracing::span("db").with_user_span();// parallel user child *off an internal span*

#[tracing::span_fn("handle_request", user = true)]// whole-fn child (sync or async)
async fn handle_request() { /* ... */ }

// Annotate the current user span (no-op if no user trace is active):
add_user_span_tags!("cache.status" => "HIT");

// Across .await / spawn / separate hooks (UserSpanScope is !Send): hold the context.
let ctx = tracing::start_user_trace("req", routing, None).into_context(); // -> TelemetryContext
ctx.apply(async { let _c = user_span("work"); }).await;

// Outbound propagation to the next hop:
let traceparent: Option<String> = tracing::user_tracing::w3c_traceparent();

// Inbound parsing (strict W3C):
let inbound = TraceparentContext::parse(header_bytes); // -> Option<TraceparentContext>

Key points for users:

  • The context carries the user span: into_context() / TelemetryContext::current() / #[span_fn] all propagate it across .await, even through an internal span's context — no manual threading.
  • with_user_span() and #[span_fn(user = true)] open a user span in parallel with an internal span — this matches the guideline to create an internal span for every user span (the #[span_fn] decorator does the same), so a single call feeds both pipelines.

Notes to reviewers

The stack is four stacked PRs / seven commits. PR2–4 introduce the core user-span APIs; the surrounding PRs prepare for and power them — PR1 makes them safe to build, PR5–6 export them off-box, PR7 propagates them over W3C.

PR1 — user-tracing-1: Make SharedSpan construction explicit

Internal-only pre-move: replaces the blanket impl From<Span> (which always produced a tracked span) with an explicit shared_span() constructor, so user spans can later be built untracked and never enter the internal live registry.

PR2–4 — user-tracing-2-4: the in-process user API

Introduces the core and the ways to drive it:

  • core: user_shared_span (untracked), start_user_trace / user_span, the UserSpanScope guard, the add_user_span_tags! / add_user_span_log_fields! / set_user_span_finish_callback! macros, get_user() + a dedicated USER_NOOP_HARNESS fallback, and the user_traces() test sink that observes user spans independently of the internal pipeline.
  • propagation: TelemetryContext.user_span + UserSpanScope::into_context(), so a user span survives .await / spawn and rides along on an internal span's context.
  • ergonomics: SpanScope::with_user_span() and #[span_fn(user = true)].

PR5–6 — user-tracing-5-6: ship spans off-box

  • a temporary cf-rustracing [patch.crates-io] for the construction-time RoutingMetadata span field (placeholder for a later version bump).
  • the per-process pipeline (UserTracingSettings, init_user / USER_HARNESS, the OTLP/UDS exporter that encodes RoutingMetadata into the cf-trace-config header) wired into telemetry::init, plus start_user_trace's required routing arg. Verified by producer tests that decode the exported OTLP body.

PR7 — user-tracing-7: W3C trace propagation

The TraceparentContext W3C parser, the optional inbound stitch on start_user_trace (continues the upstream trace — shared trace id, inbound parent), and user_tracing::w3c_traceparent() for outbound.


Alternatives considered

If constrained not to include user tracing specifics in this codebase, we can open up some seams to plug in similar functionality.

Opaque routing metadata. Rather than a typed RoutingMetadata, cf-rustracing/foundations could carry routing as a type-erased value they never interpret, leaving the only downcast to the exporter:

// cf-rustracing: opaque metadata on the span — no Cloudflare-specific types
metadata: Option<Arc<dyn Any + Send + Sync>>,
fn set_metadata(&mut self, value: Arc<dyn Any + Send + Sync>);
fn metadata(&self) -> Option<&(dyn Any + Send + Sync)>;

// exporter (e.g. in oxy): the only place that knows the concrete type
span.metadata().and_then(|m| m.downcast_ref::<Arc<RoutingMetadata>>())

This keeps Cloudflare routing concepts (zone / account / workspace / destinations) out of cf-rustracing. We chose the typed Option<RoutingMetadata> field: those types become visible in cf-rustracing, in exchange for direct typed access and no Any downcast.

Pluggable exporter seam (BatchHandler). Rather than a concrete exporter inside foundations, foundations could accept an object-safe Arc<dyn BatchHandler> and run a shared drain loop, with the OTLP/UDS handler implemented in oxy:

// foundations: generic over the handler, free of OTLP/UDS deps
trait BatchHandler { /* process_batch(spans), initializer() */ }
init_user_harness(.., handler: Arc<dyn BatchHandler>);
// oxy: UserTracingUdsHandler implements BatchHandler

This would keep the OTLP/UDS (hyper / prost) deps and wire format out of foundations. We chose a concrete output module (output_otlp_uds, mirroring output_jaeger_thrift_udp / output_otlp_grpc): no trait object or plugin point — future destinations are new UserTracesOutput enum variants.

Additional notes

  • Public surface speaks W3C; the jaeger SpanContext/SpanContextState conversion is internal-onlyTraceparentContext is the only stitch type users see.
  • The cf-rustracing [patch.crates-io] in PR5 is intentionally temporary; the matching change is a separate cf-rustracing PR and will become a normal version bump once released.

@mar-cf mar-cf changed the title User tracing 7 User Tracing Jun 24, 2026
Comment thread Cargo.toml
Comment on lines +126 to +128
# TEMPORARY
[patch.crates-io]
cf-rustracing = { git = "https://github.com/mar-cf/rustracing.git", branch = "user-tracing" }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the blanket `impl From<Span> for SharedSpan` — which always registered the span in the harness's `active_roots` — with an explicit `shared_span()` constructor, and migrate the internal call sites. This makes tracked construction a deliberate choice and sets up a later untracked variant for user spans.
mar-cf added 6 commits June 24, 2026 18:26
Adds the user-span pipeline core: an `Untracked` `user_shared_span` constructor, the `start_user_trace`/`user_span` entry points and `UserSpanScope` guard on a separate `USER_HARNESS` (with a `USER_NOOP_HARNESS` fallback), plus the `add_user_span_tags!`/`add_user_span_log_fields!`/`set_user_span_finish_callback!` macros. Includes the test harness (`user_traces()` second sink) so user spans are observed independently of the internal pipeline. `start_user_trace` is name-only here; routing and inbound W3C continuation are layered on in later changes.
Adds a `user_span` slot to `TelemetryContext` (captured by `current()`, re-established by `scope()`, cloned across forks) plus `UserSpanScope::into_context()` and a parallel carry on `SpanScope::into_context()`. This lets a user span survive `.await`/`tokio::spawn` and ride along even when propagation goes through an internal span's context — no explicit threading. Verified by `propagates_across_await` and `user_span_carried_by_internal_context`.
Adds `SpanScope::with_user_span()` to open a parallel user span off an internal span (named after it), and a `user = true` option on `#[span_fn]` that does the same for whole functions (sync and async). Both are no-ops when no user trace is active. Covered by macro snapshot tests plus parallel and no-op runtime tests.
Points `cf-rustracing` at the fork branch that adds `RoutingMetadata` as a span property, needed by the user-tracing exporter and `start_user_trace` routing. Placeholder to be replaced by a normal version bump once the rustracing change is released.
Adds the per-process user pipeline — `UserTracingSettings`, `init_user`/`USER_HARNESS`, and the OTLP-over-UDS exporter that encodes `RoutingMetadata` into the `cf-trace-config` header — wired into `telemetry::init`. `start_user_trace` now takes a required `RoutingMetadata` attached at span construction and inherited by descendants (the exporter drops routing-less spans). Verified end-to-end by producer tests that decode the exported OTLP body.
Adds the `TraceparentContext` W3C parser and wires it through: `start_user_trace` gains an optional `inbound` traceparent that stitches the user root onto the upstream trace (shared trace id, inbound parent), and `user_tracing::w3c_traceparent()` derives the header for the current user span for outbound propagation. Covered by parser unit tests plus continuation tests through the test harness and the OTLP/UDS producer path.
@mar-cf mar-cf marked this pull request as ready for review June 24, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant