Skip to content

Add retry logic to updateRecoveryWindow to handle concurrent ObjectStore status updates #758

@gabrielmouallem

Description

@gabrielmouallem

Problem

When running scheduled backups with retention policies, we observe transient errors:

{"level":"error","msg":"Error while updating the recovery window in the ObjectStore status stanza. Skipping.","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}

{"level":"error","msg":"Retention policy enforcement failed","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}

Root Cause Analysis

After investigating the plugin source code, we identified that the updateRecoveryWindow function in internal/cnpgi/instance/recovery_window.go performs a direct status update without retry logic:

// recovery_window.go:40
func updateRecoveryWindow(...) error {
    // ... builds status ...
    return c.Status().Update(ctx, objectStore)  // No retry on conflict
}

This function is called from two places that can run concurrently:

  1. backup.go:169 - After a backup completes successfully
  2. retention.go:66 - During periodic retention policy enforcement (default every 5 minutes)

When both operations happen close together, Kubernetes optimistic concurrency control rejects one update because the resourceVersion changed between read and write.

Evidence

The same file already has a function that correctly handles this scenario:

// recovery_window.go:65 - setLastFailedBackupTime
func setLastFailedBackupTime(...) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        var objectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, objectStoreKey, &objectStore); err != nil {
            return err
        }
        // ... update status ...
        return c.Status().Update(ctx, &objectStore)
    })
}

The setLastFailedBackupTime function uses retry.RetryOnConflict which:

  1. Gets a fresh copy of the resource before updating
  2. Retries on conflict with exponential backoff

Impact

  • Severity: Low - backups complete successfully, status eventually updates
  • User experience: Confusing error messages in logs
  • Frequency: Depends on backup/retention timing overlap (we see ~2 errors per 24h)

Proposed Fix

Apply the same retry pattern to updateRecoveryWindow:

func updateRecoveryWindow(
    ctx context.Context,
    c client.Client,
    backupList *catalog.Catalog,
    objectStore *barmancloudv1.ObjectStore,
    serverName string,
) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        // Get fresh copy
        var freshObjectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, client.ObjectKeyFromObject(objectStore), &freshObjectStore); err != nil {
            return err
        }

        // Build recovery window
        convertTime := func(t *time.Time) *metav1.Time {
            if t == nil {
                return nil
            }
            return ptr.To(metav1.NewTime(*t))
        }

        recoveryWindow := freshObjectStore.Status.ServerRecoveryWindow[serverName]
        recoveryWindow.FirstRecoverabilityPoint = convertTime(backupList.GetFirstRecoverabilityPoint())
        recoveryWindow.LastSuccessfulBackupTime = convertTime(backupList.GetLastSuccessfulBackupTime())

        if freshObjectStore.Status.ServerRecoveryWindow == nil {
            freshObjectStore.Status.ServerRecoveryWindow = make(map[string]barmancloudv1.RecoveryWindow)
        }
        freshObjectStore.Status.ServerRecoveryWindow[serverName] = recoveryWindow

        return c.Status().Update(ctx, &freshObjectStore)
    })
}

Environment

  • Plugin version: 0.10.0
  • CNPG Operator: 1.26+
  • Kubernetes: 1.29+
  • Object storage: AWS S3

We're happy to submit a PR if this approach looks correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions