Skip to content

docs(rfc): add sandbox resource requirements proposal#1360

Open
elezar wants to merge 1 commit into
NVIDIA:mainfrom
elezar:1338-sandbox-resource-requirements/elezar
Open

docs(rfc): add sandbox resource requirements proposal#1360
elezar wants to merge 1 commit into
NVIDIA:mainfrom
elezar:1338-sandbox-resource-requirements/elezar

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 13, 2026

Summary

Add RFC 0004 proposing a typed sandbox resource requirements model for CPU, memory, GPUs, and future resource domains. The RFC separates portable resource requirements from driver/platform-specific configuration and realization, and includes concrete realization examples for Kubernetes, Docker, Podman, and VM drivers.

Related Issue

Related to #1338 and #1340.

Changes

  • Adds rfc/0004-sandbox-resource-requirements/README.md.
  • Proposes SandboxSpec.resource_requirements with compute, device, dataset, and extension domains.
  • Reserves JSON-formatted CLI input for --driver-config-json, mapped to SandboxTemplate.driver_config.
  • Explicitly avoids exposing JSON-formatted portable resource request flags.
  • Documents how CPU/memory and GPU requests map to Kubernetes resources, CDI device injection, and VM device assignment.
  • Captures conflict handling between portable resource requirements and SandboxTemplate.resources passthrough.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (not applicable; RFC-only change)
  • E2E tests added/updated (not applicable; RFC-only change)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 13, 2026 14:54
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the 1338-sandbox-resource-requirements/elezar branch from 96945aa to 1aafb69 Compare May 13, 2026 16:29
@drew drew mentioned this pull request May 14, 2026
5 tasks
Comment on lines +97 to +100
The CLI should not expose a JSON flag for `resource_requirements`. Common
portable requests should use typed flags such as CPU, memory, and GPU-count
flags, and SDK/API callers should use the typed protobuf messages directly.
JSON-formatted CLI input is reserved for driver-specific configuration.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems right to me. NemoClaw team needs basic mem/cpu requests, so I started to implement this here, #1376.

@drew drew added the rfc label May 14, 2026
repeated GenericResourceRequirement extensions = 100;
}

message ComputeResourceRequirements {
Copy link
Copy Markdown
Collaborator

@derekwaynecarr derekwaynecarr May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m good with this.

We may want ephemeral storage in future, but am happy to defer that now.

I am not aware of any use case right now that would demand hugepages.

Right now pid limiting in k8s is a node level pod setting that is per pod cgroup enforced, if we wanted to expose a pids.max. Ultimately we need our pid limiting to be cgroup enforced but I think having that as settable can come later.


| Driver | Realization |
|---|---|
| Kubernetes | Populate pod container `resources.requests.cpu`, `resources.limits.cpu`, `resources.requests.memory`, and `resources.limits.memory`. |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if and when we separate the proxy into a separate pod, we may need to revisit this slightly to either have a fixed proxy overhead or enable it to have a separate resource configuration. We can explore that when we get there as this rfc improves the current model.

Copy link
Copy Markdown
Collaborator

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good next step. I wanted to let @mrunalp take a look as well, but this is LGTM for me.

@drew drew moved this from Todo to In progress in OpenShell Roadmap May 15, 2026
Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +515 to +518
```shell
openshell sandbox create \
--driver-config-json '{"kubernetes.openshell.ai":{"nodeSelector":{"accelerator":"nvidia"}}}'
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be really useful across all our drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants