Skip to content

backup-verify

ugallu-backup-verify runs scheduled or ad-hoc verification jobs against cluster backups. Two backends - Velero (via the velero.io API) and etcd-snapshot (hostPath mount) - both funnel into the same result CR + SecurityEvent contract, so downstream consumers (attestor, dashboards) don’t care which one ran.

ModeWhat it does
checksum-onlyRecomputes SHA-256 over the backup payload and compares against the recorded checksum. Cheap, idempotent.
full-restoreVelero only. Creates a temporary Restore CR into a sandbox namespace, watches phase, diffs the sandbox against the backup manifest, then tears the sandbox + Restore CR down.

full-restore is the only mode that exercises the actual recovery path - checksum-only only proves bytes survived, not that the backup is restorable.

  1. Reconciler picks up a BackupVerifyRun CR.
  2. Backend dispatch on spec.backend:
    • velero → reads the velero.io/v1.Backup referenced by spec.backupRef; for full-restore, creates a velero.io/v1.Restore with namespaceMapping aimed at spec.sandboxNamespace.
    • etcd-snapshot → opens the file at --etcd-snapshot-dir/<ref>.
  3. Findings (per-finding severity) are merged into a BackupVerifyResult CR.
  4. worstSeverity is computed and persisted via the Status subresource.
  5. SE emitted: BackupVerifyCompleted (Compliance class) or BackupVerifyMismatch (Detection class) when any finding is ≥high.
apiVersion: security.ugallu.io/v1alpha1
kind: BackupVerifyRun
metadata: { name: nightly-velero, namespace: ugallu-system }
spec:
backend: velero
backupRef:
name: nightly-2026-04-29
namespace: velero
mode: full-restore
sandboxNamespace: bv-nightly-bvsandbox
timeout: 10m

The -bvsandbox suffix is enforced by an admission policy - only namespaces ending in that suffix are accepted as restore targets, to keep accidental full-restores from clobbering production namespaces.

BackupVerifyRun.status.phase: Pending (implicit at create) -> Running -> Succeeded | Failed. No finalizer. The CR’s spec is immutable post-creation (admission policy).

A separate BackupVerifyResult CR is created by the controller once the run reaches a terminal phase. Its status.worstSeverity ranks the findings (critical > high > medium > low > info), persisted via the status subresource.

on each BackupVerifyRun event:
run := Get(req)
if run.Status.Phase in {Succeeded, Failed}:
return # terminal
if run.Status.Phase == "":
patch Status.Phase=Running, StartTime=now
emitSE(BackupVerifyStarted, info)
return Requeue
// Phase == Running
result, findings := executeBackend(run) # velero or etcd-snapshot
if mode == full-restore:
if Restore CR not yet terminal:
RequeueAfter: 10s
return
upsert BackupVerifyResult (Get-after-Create + Status().Update for worstSeverity)
patch Status.Phase = (success ? Succeeded : Failed)
emitSE(BackupVerifyCompleted | BackupVerifyMismatch | BackupVerifyFailed)
cleanupSandbox(best-effort) # full-restore only

Phase=Running on operator restart means the previous verifier crashed mid-run. The new pod re-Gets the CR and re-enters executeBackend (idempotent against the same BackupRef). For full-restore the Restore CR is keyed by run name, so a re-Create returns AlreadyExists and the loop simply observes its phase. Result CR creation is tolerant of AlreadyExists (re-Get + Status update keeps worstSeverity correct).

Sandbox + Restore cleanup is best-effort: failures are logged as findings but do not prevent the run from reaching a terminal phase.

Pod killed during a full-restore tick (Restore CR in flight): the new pod resumes by polling Restore phase every 10s. If the Restore reached Completed while the operator was down, the diff runs and the run terminates Succeeded on the very next tick.

  • Sandbox suffix. A ValidatingAdmissionPolicy requires the sandbox namespace name to end with -bvsandbox.
  • Velero CRD missing. Backend startup fails fast: the run goes Failed with a finding code=velero-crd-missing.
  • Backup not found. Same outcome: finding code=velero-backup-not-found, severity=high -> SE flips to Detection class.
  • Worst severity tie-breakers. When two findings share the same severity, the highest-priority finding code wins for SE type selection but worstSeverity is still the shared value.
rules:
- apiGroups: [security.ugallu.io]
resources: [backupverifyruns, backupverifyruns/status,
backupverifyresults, backupverifyresults/status]
verbs: [get, list, watch, create, update, patch]
- apiGroups: [security.ugallu.io]
resources: [securityevents]
verbs: [create]
- apiGroups: [velero.io]
resources: [backups, backupstoragelocations]
verbs: [get, list, watch]
- apiGroups: [velero.io]
resources: [restores]
verbs: [get, list, watch, create, delete]
- apiGroups: [""]
resources: [pods, configmaps, secrets, serviceaccounts]
verbs: [get, list, watch]
- apiGroups: [""]
resources: [namespaces]
verbs: [get, list, watch, delete] # sandbox cleanup
- apiGroups: [""]
resources: [events]
verbs: [create, patch]
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete]
  • BackupVerifyRun
    • one CR per verification, immutable spec.
  • BackupVerifyResult
    • one CR per run, populated by the controller; status carries worstSeverity (info/low/medium/high/critical).

--cluster-id, --cluster-name, --etcd-snapshot-dir, --leader-election-namespace.

Deployment (1 replica) in ugallu-system, leader election on, mounts the etcd-snapshot hostPath read-only on the control-plane nodes. RBAC includes velero.io Backups/BackupStorageLocations read, Restores CRUD, and namespaces delete (sandbox cleanup).

ugallu_backup_verify_runs_total{backend,mode,outcome}, ugallu_backup_verify_findings_total{severity}, ugallu_backup_verify_full_restore_duration_seconds{outcome}.