spec(runner): sync spec with code drift, add OpenShell desired state by markturansky · Pull Request #1643 · ambient-code/platform

markturansky · 2026-06-03T20:43:14Z

Summary

Syncs runner spec with actual code: adds missing files (model.py, observability_config.py, observability_privacy.py, mlflow_observability.py, fixtures/), removes duplicate workspace.py entry
Documents undocumented features: AGUI_TOKEN session auth middleware, SDK_OPTIONS env var, POST /model endpoint
Adds "Desired State: OpenShell Credential Isolation" section describing how OpenShell's sandbox layers (netns, Landlock, seccomp-BPF, OPA) extend the existing sidecar credential model to prevent agent escape via Bash/network access

Test plan

Verified spec source layout matches actual filesystem
Confirmed OpenShell desired state is consistent with existing sidecar credential model on main
Review OpenShell desired state section for architectural accuracy

🤖 Generated with Claude Code

netlify · 2026-06-03T20:43:18Z

👷 Deploy Preview for cheerful-kitten-f556a0 processing.

Name	Link
🔨 Latest commit	`c71770d`
🔍 Latest deploy log	https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/6a22c4cf1bb94b0008de63b2

coderabbitai · 2026-06-03T20:43:19Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f3f70ee7-87fa-4fa6-862f-ad241779da73

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch spec/runner-openshell-desired-state

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch spec/runner-openshell-desired-state

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

markturansky

amber review — Round 1

Three findings from cross-checking the spec changes against the current codebase on main.

Required — Must Fix

Broken companion doc references in OpenShell section

The OpenShell Desired State section references two documents that do not exist anywhere in the repo:

See: docs/internal/agents/openshell-security-analysis.md
See: docs/internal/agents/openshell-runner-adaptation.md

Checked docs/internal/agents/ on main and on this branch — only README.md and an active/ subdirectory are present. Neither file exists.

Fix: remove the See: links, or replace them with a forward reference (TBD / planned) until the docs are created.

Notable Gap

operational_events.py missing from spec layout

This PR's stated goal is to sync the spec with code drift. operational_events.py was introduced in PR #1633 at bridges/claude/operational_events.py and is present in the repo, but it is absent from the File Layout table in this spec update.

The bridges/claude/ section now lists fixtures/ and other additions — operational_events.py should be included for completeness.

Architectural Note (non-blocking)

_grpc_client.py in OpenShell "Files to Modify" may be a no-op

The OpenShell desired-state section lists _grpc_client.py as a file to modify for the sandbox adaptation. gRPC communication already happens in the runner process (outside the Claude subprocess boundary), so sandboxing the Claude subprocess with Landlock/seccomp-BPF should not require changes to the gRPC client. Worth confirming whether this file actually needs modification before implementation begins, to avoid scope creep in the adaptation work.

Blocking on the companion doc fix before approve.

markturansky

amber review — Round 2

All three Round 1 findings addressed. Codebase verification passed: populate_runtime_credentials() in platform/auth.py ✅, secret_redaction.py ✅, _grpc_client.py ✅, operational_events.py ✅, SessionWorker in bridges/claude/session.py ✅.

One notable gap in the companion doc to fix before implementation starts — the spec and companion doc disagree on what gets sandboxed.

Notable Gap — Sandboxing scope inconsistent between spec and companion doc

runner.spec.md (the spec) says the Claude subprocess is sandboxed:

Desired:  Runner container → OpenShell Supervisor → Claude subprocess (sandboxed netns)

Files to Modify: bridges/claude/bridge.py — "Launch Claude CLI via OpenShell Supervisor instead of direct subprocess." gRPC: "Verify no changes needed — gRPC runs in runner process outside sandbox boundary."

openshell-runner-adaptation.md (companion doc) describes wrapping the entire uvicorn process:

CMD ["openshell-sandbox", "--provider", "ambient", "--", \
     "/bin/bash", "-c", "umask 0022 && cd /app/ambient-runner && uvicorn main:app ..."]

The Supervisor wraps the uvicorn process, applying Landlock + seccomp + netns before exec.

In this model, the runner itself is inside the sandbox, and the companion doc's Files to Modify table accordingly lists _grpc_client.py — "Proxy-aware channel construction for intra-cluster gRPC."

These are different architectures. If an implementer follows the companion doc's Dockerfile, they'll sandbox the runner (not just Claude subprocess), and gRPC will need proxy-awareness — directly contradicting the spec. If they follow the spec, the Claude subprocess is sandboxed via bridge.py, and gRPC stays outside the sandbox.

Fix: align the companion doc to match the spec's model. Update the Dockerfile section to show openshell-sandbox wrapping the Claude CLI invocation (from bridge.py), not the uvicorn entrypoint. Update _grpc_client.py entry in companion doc's Files to Modify to "No changes needed — gRPC runs in runner process, outside Claude subprocess sandbox boundary."

Spec itself is correct and internally consistent — approving the spec. Companion doc should be fixed before this moves out of draft.

Round 2: LGTM on the spec. Companion doc fix is pre-implementation, not pre-merge.

markturansky

amber review — Round 3 LGTM

Round 2 finding addressed. The companion doc (openshell-runner-adaptation.md) now correctly describes sandboxing only the Claude subprocess:

Dockerfile keeps uvicorn entrypoint unchanged; openshell-sandbox is invoked by bridge.py when launching the Claude CLI, not wrapped around the runner itself
Explicit prose added: "The runner itself (FastAPI/uvicorn, gRPC client, credential fetching) runs outside the sandbox boundary"
_grpc_client.py row in Files to Modify updated to "No changes needed — gRPC runs in runner process, outside Claude subprocess sandbox boundary"
Step 5 is now a standalone "No changes needed" section with a clear rationale

Spec and companion docs are now consistent. No further findings.

Approved. Ready to undraft when the author is satisfied with the OpenShell desired-state section.

jsell-rh

Nice analysis — the security model breakdown and strategy comparison are thorough. A few things we learned integrating OpenShell in the ADP project that are worth flagging:

1. inference.local doesn't work outside OpenShell-managed sandboxes. The placeholder/proxy pattern (openshell:resolve:env:*) relies on the Supervisor owning the network namespace — it creates a veth pair, routes all traffic through 10.200.0.1:3128, and rewrites placeholders at the HTTP layer. If the runner pod keeps its default network namespace (no unshare(CLONE_NEWNET)), the proxy won't intercept requests and placeholders will be sent as-is to upstream APIs. Strategy 1's implementation steps should account for the Supervisor needing NET_ADMIN capability to set up the netns before the agent process starts.

2. OpenShell's provider management API is gRPC-only, not REST. The Operator integration (Step 7) will need a gRPC client to call CreateProvider, SetClusterInference, etc. on the Gateway. There are no REST equivalents — the Gateway multiplexes gRPC and HTTP on port 8080, but provider/inference management is exclusively via the openshell.v1.OpenShell gRPC service. The proto definitions are at proto/openshell.proto and proto/inference.proto in the upstream repo.

4. The Supervisor needs to control process lifecycle, not just run as a sidecar. The Supervisor expects to be the process that fork/execs the agent — it applies Landlock, seccomp, and privilege drop in the pre_exec closure (after fork, before exec). Running it as a separate sidecar container in the same pod doesn't give it control over the agent's process setup. For standard K8s pods, the Supervisor binary needs to be the entrypoint that wraps the Claude CLI invocation (as the PR's Step 3 suggests with the Dockerfile change), not a parallel container. The bridge.py modification (Step 6) is the right place to invoke it.

5. The gh wrapper and git credential helper pattern works at the HTTP layer only. The proxy can rewrite Authorization: Bearer openshell:resolve:env:GITHUB_TOKEN in HTTP requests, but it can't intercept credential usage in non-HTTP contexts. The git credential helper works because git ultimately makes HTTPS requests that pass through the proxy. But if any tool reads the placeholder token and uses it for SSH auth or embeds it in a non-HTTP protocol, the placeholder string will be sent verbatim. Worth documenting this limitation — the current install_gh_wrapper() and install_git_credential_helper() patterns should work fine since they're HTTP-based, but future credential types (SSH keys, kubeconfig client certs) would need a different approach.

- Fix source layout: add model.py, observability files, fixtures/, remove duplicate workspace.py - Document AGUI_TOKEN session auth middleware and SDK_OPTIONS env var - Document runtime model switching via POST /model - Add 'Desired State: OpenShell Credential Isolation' section with migration path 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

markturansky

Amber Review — PR #1643: runner spec sync + OpenShell desired state

Verified spec claims against actual code in components/backend/runner/. 7 of 9 sync claims check out; 2 are factually wrong and need correction before merge. The OpenShell analysis docs are high quality with one naming inconsistency to fix.

Issues requiring fixes before merge

F1 — `workspace.py` still exists and must stay in the spec layout

The spec removes workspace.py from the file layout with the note "duplicate entry." It is not a duplicate — it is a real, active module at components/backend/runner/workspace.py. It exports setup_workflow_paths() and resolve_workspace_paths() which are used by the runner. Removing it makes the spec inaccurate. Either:

Keep the entry in the layout (correct fix), or
Explain why it will be removed and mark it 🔲 planned if removal is genuinely intended

F2 — Wrong path for the model endpoint

The spec adds routers/model.py to the file layout. That path does not exist. The actual file is at components/backend/runner/endpoints/model.py. Please correct the path.

Non-blocking

N1 — "Sidecar" misnomer in strategy comparison table (`openshell-runner-adaptation.md`)

The comparison table labels Strategy 1 as "Sidecar", but the document body explicitly states the Supervisor is not a sidecar container. The correct label is "Supervisor" (or "Supervisor Process"). The mismatch will confuse readers who skim the table first.

N2 — Isolation layer 2 label in `runner.spec.md` is imprecise

The OpenShell section lists layer 2 as "Credential proxy". The body and the security analysis describe this layer as process isolation / pre-exec credential stripping — not a proxy. The credential proxy (placeholder rewriting) is a separate mechanism. Consider labeling it "Process isolation (pre-exec credential strip)" to match the five-layer model in the security analysis.

What's verified ✅

Claim	Status
`observability_config.py` exists	✅
`observability_privacy.py` exists	✅
`mlflow_observability.py` exists	✅
`operational_events.py` exists	✅
`fixtures/` directory exists (4 JSONL files)	✅
AGUI_TOKEN uses `secrets.compare_digest()`	✅ (app.py:257–275)
`SDK_OPTIONS` env var referenced in bridge.py	✅
Living-doc `> Status: Proposed` on OpenShell section	✅ correct convention
`populate_runtime_credentials()` writes real tokens to `os.environ`	✅ (auth.py:556–593)

The OpenShell security analysis is thorough and architecturally sound. The five isolation layers are accurately described, the AGUI_TOKEN middleware implementation matches the spec, and the BPF/Landlock threat model is well-reasoned.

Fix F1 and F2 before merging; N1 and N2 are recommended but non-blocking.

— Amber

markturansky

Amber Re-Review — PR #1643

Both required fixes are confirmed:

F1 resolved ✅ — workspace.py is back in the spec layout under platform/ with the correct annotation
F2 resolved ✅ — model endpoint correctly listed as endpoints/model.py, not routers/model.py

The two non-blocking items (N1: "Sidecar" label in the strategy comparison table; N2: "Credential proxy" label for isolation layer 2) are unchanged. These were flagged as nice-to-have and do not block merge.

Approved. Ready to merge.

— Amber

…hell sandbox integration Update specs and docs to reflect the actual implemented state of the OpenShell sandbox integration, replacing the original "desired state" proposal with detailed implementation records including: - Runner spec: replace proposed desired state with implemented architecture, verified isolation layers, required capabilities (7, not 1), policy format, CP integration, known limitations, and design decisions - New security spec (openshell-sandbox.spec.md): formal RFC 2119 requirements for sandbox activation, network namespace isolation, TLS proxy, Landlock filesystem restrictions, privilege drop, seccomp-BPF, and ConfigMap propagation - Adaptation doc: rewrite from proposal to implementation record with full debugging journey (7-error progression, EINVAL misdiagnosis, ptrace tracing), verified results, OpenShift SCC reference, and future work phases - Security analysis: add implementation status, integration point status table, and lessons learned (file mode, 7 caps, Landlock ABI compat) - Bookmarks: add OpenShell sandbox spec entry 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…dboxing Add NVIDIA OpenShell Supervisor (v0.0.56, file mode) to the runner image, wrapping the Claude Code subprocess in five isolation layers: network namespace, TLS L7 proxy, Landlock filesystem sandbox, seccomp-BPF, and privilege drop to unprivileged sandbox user. Dockerfile changes: - Pin openshell-sandbox v0.0.56 from ghcr.io/nvidia/openshell/supervisor - Add iproute package for network namespace management (ip netns) - Create sandbox user/group for privilege drop target - Pre-create /workspace owned by sandbox, /var/run/netns for mount points - Symlink bundled Claude CLI to /usr/local/bin/claude for stable policy path - Set /home/sandbox permissions to 755 New files: - openshell-claude-wrapper.sh: dispatches to supervisor or direct claude based on OPENSHELL_ENABLED env var - .openshell-ref/policy.rego: official OPA Rego from OpenShell repository - .openshell-ref/policy.yaml: filesystem, network, process policy data with endpoint ACLs for Anthropic, Vertex AI, GitHub, GitLab, npm, PyPI bridge.py: 1-line change sets cli_path to wrapper when OPENSHELL_ENABLED=true 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…iler Add conditional OpenShell sandbox support to the CP reconciler, activated by OPENSHELL_ENABLED=true environment variable. Reconciler changes (kube_reconciler.go): - buildRunnerSecurityContext: grant 7 capabilities (NET_ADMIN, SYS_ADMIN, SYS_PTRACE, SETUID, SETGID, CHOWN, DAC_OVERRIDE), allowPrivilegeEscalation, runAsUser:0 when OpenShell enabled - ensurePod: set pod-level seccompProfile to Unconfined - buildVolumes/buildVolumeMounts: mount openshell-policy ConfigMap at /etc/openshell - buildEnv: inject OPENSHELL_ENABLED, OPENSHELL_POLICY_RULES, OPENSHELL_POLICY_DATA - ensureOpenShellPolicy: propagate policy ConfigMap from CP namespace to runner namespace Config changes (config.go): - OpenShellEnabled (from OPENSHELL_ENABLED env var) - OpenShellPolicyName (from OPENSHELL_POLICY_CONFIGMAP, default: openshell-policy) KubeClient changes (kubeclient.go): - Add ConfigMapGVR, GetConfigMap, CreateConfigMap methods 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

markturansky · 2026-06-05T00:28:43Z

OpenShell Sandbox Validation Report

Summary

The OpenShell sandbox integration has been validated end-to-end on ROSA OpenShift (vteam-uat cluster) with runner image quay.io/ambient_code/vteam_claude_runner:openshell-v5. All sessions complete successfully through the sandbox's captive portal proxy.

Root Cause Fixed

The SYN-SENT networking issue that blocked sandbox connectivity across sessions 6-12 has been resolved. Root cause: when a runner pod is terminated, the OpenShell supervisor's Drop cleanup for NetworkNamespace doesn't fire, leaving stale /var/run/netns/sandbox-* files and veth interfaces. On the next session, the supervisor creates a new sandbox namespace, but the old veth's 10.200.0.0/24 route persists, creating duplicate routes that prevent the proxy from receiving connections.

Fix (openshell-claude-wrapper.sh):

Cleans stale sandbox-* netns and their veth interfaces before launching the supervisor
Remounts /proc/sys rw and disables rp_filter on default/all so ARP resolves on dynamically-created veth interfaces

Test Sessions

< /dev/null | Session | Type | Result |
|---------|------|--------|
| 3Eh11nFWuYEDHqEYmvzhDVVOQUV | Simple (file create/read) | ✅ Success |
| 3Eh1db8pBEn1jU08vw4TDQFOEee | Complex (write Python script, execute, create summary — multi-tool-use) | ✅ Success |
| 3Eh7TDfRJAOSBXI9gLn0UOLcKlq | User-initiated (4 tool_use round-trips, 1274-byte response) | ✅ Success |

Verification Evidence

Sandbox state (inside pod):

Single sandbox namespace: sandbox-8f1e7b4a (no duplicates)
Single route: 10.200.0.0/24 dev veth-h-8f1e7b4a
rp_filter=0 on all, default, and veth interface

Proxy connectivity:

OCSF HTTP:POST [INFO] ALLOWED POST http://us-east5-aiplatform.googleapis.com:443/v1/projects/ambient-code-platform/locations/us-east5/publishers/anthropic/models/claude-sonnet-4-6@default:streamRawPredict
relay_response framing / relay_chunked complete — proxy relaying responses

Session lifecycle:

ResultMessage (subtype=success) — tasks completed
Turn complete — clean shutdown
Cleared credentials — proper cleanup
Multiple tool_use/tool_result pairs — Bash tool execution works inside sandbox

What Changed

File	Change
`openshell-claude-wrapper.sh`	Stale sandbox netns cleanup + rp_filter disable before supervisor launch
`kube_reconciler.go`	Pass `OPENSHELL_LOG_LEVEL` env var to runner pods
`.openshell-ref/policy.yaml`	Allow bundled Claude CLI path and `node-22` binary

How to Reproduce

# Set up
export API_HOST=$(oc get route ambient-api-server -n ambient-openshell -o jsonpath='{.spec.host}')
export API_TOKEN=$(oc get secret ambient-control-plane-token -n ambient-openshell -o jsonpath='{.data.token}' | base64 -d)

# Create session
curl -sk -X POST "https://${API_HOST}/api/ambient/v1/sessions" \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"name":"test","prompt":"Create /tmp/hello.txt with hello world, then read it back.","project_id":"openshell-test"}'

# Watch pod
oc get pods -n openshell-test -w

# Check logs — look for ALLOWED, ResultMessage (subtype=success), Turn complete
oc logs <pod-name> -n openshell-test

github-actions · 2026-06-05T12:45:17Z

⚠️ SDD Preflight — Managed Paths Modified

This PR modifies files in SDD-managed component(s). These components are migrating to Spec-Driven Development.

File	Component	Mode
`components/runners/ambient-runner/Dockerfile`	runner	warn
`components/runners/ambient-runner/openshell-claude-wrapper.sh`	runner	warn

No action required — these components are in warn mode. Consider using the component's agent workflow for future changes.

📖 Specs: Runner Spec · Runner Constitution

markturansky commented Jun 3, 2026

View reviewed changes

markturansky force-pushed the spec/runner-openshell-desired-state branch 2 times, most recently from fc5c3e2 to 90ccbcd Compare June 3, 2026 21:05

markturansky commented Jun 3, 2026

View reviewed changes

markturansky force-pushed the spec/runner-openshell-desired-state branch from 90ccbcd to f13261a Compare June 3, 2026 21:23

markturansky commented Jun 3, 2026

View reviewed changes

jsell-rh reviewed Jun 3, 2026

View reviewed changes

markturansky force-pushed the spec/runner-openshell-desired-state branch from f13261a to 597cc4c Compare June 3, 2026 22:02

markturansky commented Jun 4, 2026

View reviewed changes

user and others added 3 commits June 4, 2026 15:51

Merge branch 'main' into spec/runner-openshell-desired-state

c71770d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(runner): sync spec with code drift, add OpenShell desired state#1643

spec(runner): sync spec with code drift, add OpenShell desired state#1643
markturansky wants to merge 5 commits into
mainfrom
spec/runner-openshell-desired-state

markturansky commented Jun 3, 2026

Uh oh!

netlify Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review skipped

Uh oh!

markturansky left a comment

Uh oh!

markturansky left a comment

Uh oh!

markturansky left a comment

Uh oh!

jsell-rh left a comment

Uh oh!

markturansky left a comment

Uh oh!

markturansky left a comment

Uh oh!

markturansky commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

markturansky commented Jun 3, 2026

Summary

Test plan

Uh oh!

netlify Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👷 Deploy Preview for cheerful-kitten-f556a0 processing.

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Required — Must Fix

Notable Gap

Architectural Note (non-blocking)

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Notable Gap — Sandboxing scope inconsistent between spec and companion doc

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Uh oh!

jsell-rh left a comment

Choose a reason for hiding this comment

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Amber Review — PR #1643: runner spec sync + OpenShell desired state

Issues requiring fixes before merge

F1 — workspace.py still exists and must stay in the spec layout

F2 — Wrong path for the model endpoint

Non-blocking

N1 — "Sidecar" misnomer in strategy comparison table (openshell-runner-adaptation.md)

N2 — Isolation layer 2 label in runner.spec.md is imprecise

What's verified ✅

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Amber Re-Review — PR #1643

Uh oh!

markturansky commented Jun 5, 2026

OpenShell Sandbox Validation Report

Summary

Root Cause Fixed

Test Sessions

Verification Evidence

What Changed

How to Reproduce

Uh oh!

github-actions Bot commented Jun 5, 2026

⚠️ SDD Preflight — Managed Paths Modified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

F1 — `workspace.py` still exists and must stay in the spec layout

N1 — "Sidecar" misnomer in strategy comparison table (`openshell-runner-adaptation.md`)

N2 — Isolation layer 2 label in `runner.spec.md` is imprecise