prevent startup deadlock when watching many CRDs by RezaMash · Pull Request #290 · salesforce/sloop

RezaMash · 2026-06-10T14:51:59Z

On clusters with a large number of CRDs (e.g. 500), sloop would
hang during startup and never bind its webserver on :8080. Liveness and
startup probes against /healthz then got "connection refused" forever,
putting the pod into a permanent CrashLoopBackOff.

Root cause was a lock-ordering deadlock during the initial informer sync:

writeToOutChan() sent to the bounded kubeWatchChan (buffer 1000) while
holding i.protection. The author had already flagged this line as
dangerous.
The single processing goroutine drains that channel doing several
synchronous, fsync'd (SyncWrites=true) BadgerDB transactions per event,
so it could not keep up with the initial-sync burst from the CRD
informers plus the core resources.
Once the channel filled, informer handlers blocked on the send while
holding i.protection. Meanwhile the main goroutine was still in
startCustomInformers() iterating over every CRD, and each informer
setup also needs i.protection. It blocked, so NewKubeWatcherSource()
never returned and webserver.Run() was never reached.

Two changes:

ingress/kubewatcher.go: do not hold i.protection across the channel
send. Take the lock only to read stopped, release it, then send via a
select on i.outchan / i.stopChan. The send no longer blocks the lock,
and it unblocks promptly on shutdown instead of blocking forever on a
full channel. This removes the deadlock.
server/server.go: construct the kube watcher in a background goroutine
so the main goroutine reaches webserver.Run() and binds /healthz
immediately, regardless of how long the initial CRD sync takes. This is
defense in depth: even a merely slow (not deadlocked) watcher no longer
causes probes to kill the pod during startup. Shutdown reads
kubeWatcherSource under a mutex since it is now set from a goroutine.

With these changes the pod becomes Ready in seconds with watchCrds=true,
and CRD-backed resources are recorded.

Verified: go build ./pkg/..., go test ./pkg/sloop/ingress/...
./pkg/sloop/server/..., and go test -race ./pkg/sloop/ingress/... all pass.

…eta1 to policy/v1

On clusters with a large number of CRDs (e.g. ~477 on GDCH), sloop would hang during startup and never bind its webserver on :8080. Liveness and startup probes against /healthz then got "connection refused" forever, putting the pod into a permanent CrashLoopBackOff. Root cause was a lock-ordering deadlock during the initial informer sync: - writeToOutChan() sent to the bounded kubeWatchChan (buffer 1000) while holding i.protection. The author had already flagged this line as dangerous. - The single processing goroutine drains that channel doing several synchronous, fsync'd (SyncWrites=true) BadgerDB transactions per event, so it could not keep up with the initial-sync burst from the CRD informers plus the core resources. - Once the channel filled, informer handlers blocked on the send while holding i.protection. Meanwhile the main goroutine was still in startCustomInformers() iterating over every CRD, and each informer setup also needs i.protection. It blocked, so NewKubeWatcherSource() never returned and webserver.Run() was never reached. Two changes: 1. ingress/kubewatcher.go: do not hold i.protection across the channel send. Take the lock only to read `stopped`, release it, then send via a select on i.outchan / i.stopChan. The send no longer blocks the lock, and it unblocks promptly on shutdown instead of blocking forever on a full channel. This removes the deadlock. 2. server/server.go: construct the kube watcher in a background goroutine so the main goroutine reaches webserver.Run() and binds /healthz immediately, regardless of how long the initial CRD sync takes. This is defense in depth: even a merely slow (not deadlocked) watcher no longer causes probes to kill the pod during startup. Shutdown reads kubeWatcherSource under a mutex since it is now set from a goroutine. With these changes the pod becomes Ready in seconds with watchCrds=true, and CRD-backed resources (e.g. *.dbadmin.gdc.goog Instances) are recorded. Verified: go build ./pkg/..., go test ./pkg/sloop/ingress/... ./pkg/sloop/server/..., and go test -race ./pkg/sloop/ingress/... all pass.

salesforce-cla · 2026-06-10T14:52:06Z

Thanks for the contribution! Before we can merge this, we need @RezaMash to sign the Salesforce Inc. Contributor License Agreement.

RezaMash · 2026-06-10T15:08:50Z

I signed the CLA.

Add a GetResDescribe query that renders stored payloads via k8s.io/kubectl describers backed by a fake clientset seeded with the payload and its Events, surfacing fields like Image ID that the raw JSON views bury. Kinds without constructible describers and CRDs fall back to a generic field-tree rendering. Adds a Describe pane to the resource detail page.

Return the effective query window in view_options and use it for the timeline axis and end-time display. Default to an empty end_time so the server anchors to the newest data (the backup time when browsing a restore); replace the Now button with Latest.

getEndOfTime now uses the newest resource-summary lastSeen instead of the hour-aligned partition end, removing the dead zone at the end of restored-backup views. The Latest button now fits its label and resubmits immediately instead of leaving the end-time box empty.

RezaMash added 2 commits May 19, 2026 20:22

chore(ingress): migrate PodDisruptionBudgets informer from policy/v1b…

6dbcfb0

…eta1 to policy/v1

salesforce-cla Bot added the cla:missing label Jun 10, 2026

This comment was marked as outdated.

Sign in to view

salesforce-cla Bot added cla:signed and removed cla:missing labels Jun 11, 2026

RezaMash added 2 commits June 11, 2026 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent startup deadlock when watching many CRDs#290

prevent startup deadlock when watching many CRDs#290
RezaMash wants to merge 5 commits into
salesforce:masterfrom
RezaMash:master

RezaMash commented Jun 10, 2026

Uh oh!

salesforce-cla Bot commented Jun 10, 2026

Uh oh!

RezaMash commented Jun 10, 2026

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RezaMash commented Jun 10, 2026

Uh oh!

salesforce-cla Bot commented Jun 10, 2026

Uh oh!

RezaMash commented Jun 10, 2026

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant