Skip to content

prevent startup deadlock when watching many CRDs#290

Open
RezaMash wants to merge 5 commits into
salesforce:masterfrom
RezaMash:master
Open

prevent startup deadlock when watching many CRDs#290
RezaMash wants to merge 5 commits into
salesforce:masterfrom
RezaMash:master

Conversation

@RezaMash

Copy link
Copy Markdown

On clusters with a large number of CRDs (e.g. 500), sloop would
hang during startup and never bind its webserver on :8080. Liveness and
startup probes against /healthz then got "connection refused" forever,
putting the pod into a permanent CrashLoopBackOff.

Root cause was a lock-ordering deadlock during the initial informer sync:

  • writeToOutChan() sent to the bounded kubeWatchChan (buffer 1000) while
    holding i.protection. The author had already flagged this line as
    dangerous.
  • The single processing goroutine drains that channel doing several
    synchronous, fsync'd (SyncWrites=true) BadgerDB transactions per event,
    so it could not keep up with the initial-sync burst from the CRD
    informers plus the core resources.
  • Once the channel filled, informer handlers blocked on the send while
    holding i.protection. Meanwhile the main goroutine was still in
    startCustomInformers() iterating over every CRD, and each informer
    setup also needs i.protection. It blocked, so NewKubeWatcherSource()
    never returned and webserver.Run() was never reached.

Two changes:

  1. ingress/kubewatcher.go: do not hold i.protection across the channel
    send. Take the lock only to read stopped, release it, then send via a
    select on i.outchan / i.stopChan. The send no longer blocks the lock,
    and it unblocks promptly on shutdown instead of blocking forever on a
    full channel. This removes the deadlock.

  2. server/server.go: construct the kube watcher in a background goroutine
    so the main goroutine reaches webserver.Run() and binds /healthz
    immediately, regardless of how long the initial CRD sync takes. This is
    defense in depth: even a merely slow (not deadlocked) watcher no longer
    causes probes to kill the pod during startup. Shutdown reads
    kubeWatcherSource under a mutex since it is now set from a goroutine.

With these changes the pod becomes Ready in seconds with watchCrds=true,
and CRD-backed resources are recorded.

Verified: go build ./pkg/..., go test ./pkg/sloop/ingress/...
./pkg/sloop/server/..., and go test -race ./pkg/sloop/ingress/... all pass.

RezaMash added 2 commits May 19, 2026 20:22
On clusters with a large number of CRDs (e.g. ~477 on GDCH), sloop would
hang during startup and never bind its webserver on :8080. Liveness and
startup probes against /healthz then got "connection refused" forever,
putting the pod into a permanent CrashLoopBackOff.

Root cause was a lock-ordering deadlock during the initial informer sync:

  - writeToOutChan() sent to the bounded kubeWatchChan (buffer 1000) while
    holding i.protection. The author had already flagged this line as
    dangerous.
  - The single processing goroutine drains that channel doing several
    synchronous, fsync'd (SyncWrites=true) BadgerDB transactions per event,
    so it could not keep up with the initial-sync burst from the CRD
    informers plus the core resources.
  - Once the channel filled, informer handlers blocked on the send while
    holding i.protection. Meanwhile the main goroutine was still in
    startCustomInformers() iterating over every CRD, and each informer
    setup also needs i.protection. It blocked, so NewKubeWatcherSource()
    never returned and webserver.Run() was never reached.

Two changes:

1. ingress/kubewatcher.go: do not hold i.protection across the channel
   send. Take the lock only to read `stopped`, release it, then send via a
   select on i.outchan / i.stopChan. The send no longer blocks the lock,
   and it unblocks promptly on shutdown instead of blocking forever on a
   full channel. This removes the deadlock.

2. server/server.go: construct the kube watcher in a background goroutine
   so the main goroutine reaches webserver.Run() and binds /healthz
   immediately, regardless of how long the initial CRD sync takes. This is
   defense in depth: even a merely slow (not deadlocked) watcher no longer
   causes probes to kill the pod during startup. Shutdown reads
   kubeWatcherSource under a mutex since it is now set from a goroutine.

With these changes the pod becomes Ready in seconds with watchCrds=true,
and CRD-backed resources (e.g. *.dbadmin.gdc.goog Instances) are recorded.

Verified: go build ./pkg/..., go test ./pkg/sloop/ingress/...
./pkg/sloop/server/..., and go test -race ./pkg/sloop/ingress/... all pass.
@salesforce-cla

Copy link
Copy Markdown

Thanks for the contribution! Before we can merge this, we need @RezaMash to sign the Salesforce Inc. Contributor License Agreement.

@RezaMash

Copy link
Copy Markdown
Author

I signed the CLA.

Add a GetResDescribe query that renders stored payloads via
k8s.io/kubectl describers backed by a fake clientset seeded with the
payload and its Events, surfacing fields like Image ID that the raw
JSON views bury. Kinds without constructible describers and CRDs fall
back to a generic field-tree rendering. Adds a Describe pane to the
resource detail page.
@RezaMash

This comment was marked as outdated.

RezaMash added 2 commits June 11, 2026 18:28
Return the effective query window in view_options and use it for the
timeline axis and end-time display. Default to an empty end_time so the
server anchors to the newest data (the backup time when browsing a
restore); replace the Now button with Latest.
getEndOfTime now uses the newest resource-summary lastSeen instead of
the hour-aligned partition end, removing the dead zone at the end of
restored-backup views. The Latest button now fits its label and
resubmits immediately instead of leaving the end-time box empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant