backup-verify
ugallu-backup-verify runs scheduled or ad-hoc verification jobs
against cluster backups. Two backends - Velero (via the
velero.io API) and etcd-snapshot (hostPath mount) - both
funnel into the same result CR + SecurityEvent contract, so
downstream consumers (attestor, dashboards) don’t care which one
ran.
| Mode | What it does |
|---|---|
checksum-only | Recomputes SHA-256 over the backup payload and compares against the recorded checksum. Cheap, idempotent. |
full-restore | Velero only. Creates a temporary Restore CR into a sandbox namespace, watches phase, diffs the sandbox against the backup manifest, then tears the sandbox + Restore CR down. |
full-restore is the only mode that exercises the actual recovery
path - checksum-only only proves bytes survived, not that the
backup is restorable.
- Reconciler picks up a
BackupVerifyRunCR. - Backend dispatch on
spec.backend:velero→ reads thevelero.io/v1.Backupreferenced byspec.backupRef; for full-restore, creates avelero.io/v1.RestorewithnamespaceMappingaimed atspec.sandboxNamespace.etcd-snapshot→ opens the file at--etcd-snapshot-dir/<ref>.
- Findings (per-finding severity) are merged into a
BackupVerifyResultCR. worstSeverityis computed and persisted via the Status subresource.- SE emitted:
BackupVerifyCompleted(Compliance class) orBackupVerifyMismatch(Detection class) when any finding is ≥high.
Example
Section titled “Example”apiVersion: security.ugallu.io/v1alpha1kind: BackupVerifyRunmetadata: { name: nightly-velero, namespace: ugallu-system }spec: backend: velero backupRef: name: nightly-2026-04-29 namespace: velero mode: full-restore sandboxNamespace: bv-nightly-bvsandbox timeout: 10mThe -bvsandbox suffix is enforced by an admission policy - only
namespaces ending in that suffix are accepted as restore targets,
to keep accidental full-restores from clobbering production
namespaces.
Internals
Section titled “Internals”State machine
Section titled “State machine”BackupVerifyRun.status.phase: Pending (implicit at create) ->
Running -> Succeeded | Failed. No finalizer. The CR’s spec
is immutable post-creation (admission policy).
A separate BackupVerifyResult CR is created by the controller
once the run reaches a terminal phase. Its status.worstSeverity
ranks the findings (critical > high > medium > low >
info), persisted via the status subresource.
Reconcile loop
Section titled “Reconcile loop”on each BackupVerifyRun event: run := Get(req) if run.Status.Phase in {Succeeded, Failed}: return # terminal if run.Status.Phase == "": patch Status.Phase=Running, StartTime=now emitSE(BackupVerifyStarted, info) return Requeue // Phase == Running result, findings := executeBackend(run) # velero or etcd-snapshot if mode == full-restore: if Restore CR not yet terminal: RequeueAfter: 10s return upsert BackupVerifyResult (Get-after-Create + Status().Update for worstSeverity) patch Status.Phase = (success ? Succeeded : Failed) emitSE(BackupVerifyCompleted | BackupVerifyMismatch | BackupVerifyFailed) cleanupSandbox(best-effort) # full-restore onlyError recovery
Section titled “Error recovery”Phase=Running on operator restart means the previous verifier
crashed mid-run. The new pod re-Gets the CR and re-enters
executeBackend (idempotent against the same BackupRef). For
full-restore the Restore CR is keyed by run name, so a re-Create
returns AlreadyExists and the loop simply observes its phase.
Result CR creation is tolerant of AlreadyExists (re-Get +
Status update keeps worstSeverity correct).
Sandbox + Restore cleanup is best-effort: failures are logged as findings but do not prevent the run from reaching a terminal phase.
Crash recovery scenario
Section titled “Crash recovery scenario”Pod killed during a full-restore tick (Restore CR in flight): the
new pod resumes by polling Restore phase every 10s. If the
Restore reached Completed while the operator was down, the diff
runs and the run terminates Succeeded on the very next tick.
Edge cases
Section titled “Edge cases”- Sandbox suffix. A
ValidatingAdmissionPolicyrequires the sandbox namespace name to end with-bvsandbox. - Velero CRD missing. Backend startup fails fast: the run
goes
Failedwith a findingcode=velero-crd-missing. - Backup not found. Same outcome: finding
code=velero-backup-not-found,severity=high-> SE flips to Detection class. - Worst severity tie-breakers. When two findings share the
same severity, the highest-priority finding code wins for SE
type selection but
worstSeverityis still the shared value.
Full RBAC (ClusterRole)
Section titled “Full RBAC (ClusterRole)”rules: - apiGroups: [security.ugallu.io] resources: [backupverifyruns, backupverifyruns/status, backupverifyresults, backupverifyresults/status] verbs: [get, list, watch, create, update, patch] - apiGroups: [security.ugallu.io] resources: [securityevents] verbs: [create] - apiGroups: [velero.io] resources: [backups, backupstoragelocations] verbs: [get, list, watch] - apiGroups: [velero.io] resources: [restores] verbs: [get, list, watch, create, delete] - apiGroups: [""] resources: [pods, configmaps, secrets, serviceaccounts] verbs: [get, list, watch] - apiGroups: [""] resources: [namespaces] verbs: [get, list, watch, delete] # sandbox cleanup - apiGroups: [""] resources: [events] verbs: [create, patch] - apiGroups: [coordination.k8s.io] resources: [leases] verbs: [get, list, watch, create, update, patch, delete]CRDs owned
Section titled “CRDs owned”BackupVerifyRun- one CR per verification, immutable spec.
BackupVerifyResult- one CR per run, populated by the controller; status carries
worstSeverity(info/low/medium/high/critical).
- one CR per run, populated by the controller; status carries
Key flags
Section titled “Key flags”--cluster-id, --cluster-name, --etcd-snapshot-dir,
--leader-election-namespace.
Deployment
Section titled “Deployment”Deployment (1 replica) in ugallu-system, leader election on, mounts
the etcd-snapshot hostPath read-only on the control-plane nodes.
RBAC includes velero.io Backups/BackupStorageLocations read,
Restores CRUD, and namespaces delete (sandbox cleanup).
Telemetry
Section titled “Telemetry”ugallu_backup_verify_runs_total{backend,mode,outcome},
ugallu_backup_verify_findings_total{severity},
ugallu_backup_verify_full_restore_duration_seconds{outcome}.