backup-verify

ugallu-backup-verify runs scheduled or ad-hoc verification jobs against cluster backups. Two backends - Velero (via the velero.io API) and etcd-snapshot (hostPath mount) - both funnel into the same result CR + SecurityEvent contract, so downstream consumers (attestor, dashboards) don’t care which one ran.

Modes

Mode	What it does
`checksum-only`	Recomputes SHA-256 over the backup payload and compares against the recorded checksum. Cheap, idempotent.
`full-restore`	Velero only. Creates a temporary `Restore` CR into a sandbox namespace, watches phase, diffs the sandbox against the backup manifest, then tears the sandbox + Restore CR down.

full-restore is the only mode that exercises the actual recovery path - checksum-only only proves bytes survived, not that the backup is restorable.

Flow

Reconciler picks up a BackupVerifyRun CR.
Backend dispatch on spec.backend:
- velero → reads the velero.io/v1.Backup referenced by spec.backupRef; for full-restore, creates a velero.io/v1.Restore with namespaceMapping aimed at spec.sandboxNamespace.
- etcd-snapshot → opens the file at --etcd-snapshot-dir/<ref>.
Findings (per-finding severity) are merged into a BackupVerifyResult CR.
worstSeverity is computed and persisted via the Status subresource.
SE emitted: BackupVerifyCompleted (Compliance class) or BackupVerifyMismatch (Detection class) when any finding is ≥high.

Example

apiVersion: security.ugallu.io/v1alpha1
kind: BackupVerifyRun
metadata: { name: nightly-velero, namespace: ugallu-system }
spec:
  backend: velero
  backupRef:
    name: nightly-2026-04-29
    namespace: velero
  mode: full-restore
  sandboxNamespace: bv-nightly-bvsandbox
  timeout: 10m

The -bvsandbox suffix is enforced by an admission policy - only namespaces ending in that suffix are accepted as restore targets, to keep accidental full-restores from clobbering production namespaces.

Internals

State machine

BackupVerifyRun.status.phase: Pending (implicit at create) -> Running -> Succeeded | Failed. No finalizer. The CR’s spec is immutable post-creation (admission policy).

A separate BackupVerifyResult CR is created by the controller once the run reaches a terminal phase. Its status.worstSeverity ranks the findings (critical > high > medium > low > info), persisted via the status subresource.

Reconcile loop

on each BackupVerifyRun event:
  run := Get(req)
  if run.Status.Phase in {Succeeded, Failed}:
    return  # terminal
  if run.Status.Phase == "":
    patch Status.Phase=Running, StartTime=now
    emitSE(BackupVerifyStarted, info)
    return Requeue
  // Phase == Running
  result, findings := executeBackend(run)   # velero or etcd-snapshot
  if mode == full-restore:
    if Restore CR not yet terminal:
      RequeueAfter: 10s
      return
  upsert BackupVerifyResult (Get-after-Create + Status().Update for worstSeverity)
  patch Status.Phase = (success ? Succeeded : Failed)
  emitSE(BackupVerifyCompleted | BackupVerifyMismatch | BackupVerifyFailed)
  cleanupSandbox(best-effort)   # full-restore only

Error recovery

Phase=Running on operator restart means the previous verifier crashed mid-run. The new pod re-Gets the CR and re-enters executeBackend (idempotent against the same BackupRef). For full-restore the Restore CR is keyed by run name, so a re-Create returns AlreadyExists and the loop simply observes its phase. Result CR creation is tolerant of AlreadyExists (re-Get + Status update keeps worstSeverity correct).

Sandbox + Restore cleanup is best-effort: failures are logged as findings but do not prevent the run from reaching a terminal phase.

Crash recovery scenario

Pod killed during a full-restore tick (Restore CR in flight): the new pod resumes by polling Restore phase every 10s. If the Restore reached Completed while the operator was down, the diff runs and the run terminates Succeeded on the very next tick.

Edge cases

Sandbox suffix. A ValidatingAdmissionPolicy requires the sandbox namespace name to end with -bvsandbox.
Velero CRD missing. Backend startup fails fast: the run goes Failed with a finding code=velero-crd-missing.
Backup not found. Same outcome: finding code=velero-backup-not-found, severity=high -> SE flips to Detection class.
Worst severity tie-breakers. When two findings share the same severity, the highest-priority finding code wins for SE type selection but worstSeverity is still the shared value.

Full RBAC (ClusterRole)

rules:
  - apiGroups: [security.ugallu.io]
    resources: [backupverifyruns, backupverifyruns/status,
                backupverifyresults, backupverifyresults/status]
    verbs: [get, list, watch, create, update, patch]
  - apiGroups: [security.ugallu.io]
    resources: [securityevents]
    verbs: [create]
  - apiGroups: [velero.io]
    resources: [backups, backupstoragelocations]
    verbs: [get, list, watch]
  - apiGroups: [velero.io]
    resources: [restores]
    verbs: [get, list, watch, create, delete]
  - apiGroups: [""]
    resources: [pods, configmaps, secrets, serviceaccounts]
    verbs: [get, list, watch]
  - apiGroups: [""]
    resources: [namespaces]
    verbs: [get, list, watch, delete]   # sandbox cleanup
  - apiGroups: [""]
    resources: [events]
    verbs: [create, patch]
  - apiGroups: [coordination.k8s.io]
    resources: [leases]
    verbs: [get, list, watch, create, update, patch, delete]

CRDs owned

BackupVerifyRun
- one CR per verification, immutable spec.
BackupVerifyResult
- one CR per run, populated by the controller; status carries worstSeverity (info/low/medium/high/critical).

Key flags

--cluster-id, --cluster-name, --etcd-snapshot-dir, --leader-election-namespace.

Deployment

Deployment (1 replica) in ugallu-system, leader election on, mounts the etcd-snapshot hostPath read-only on the control-plane nodes. RBAC includes velero.io Backups/BackupStorageLocations read, Restores CRUD, and namespaces delete (sandbox cleanup).

Telemetry

ugallu_backup_verify_runs_total{backend,mode,outcome}, ugallu_backup_verify_findings_total{severity}, ugallu_backup_verify_full_restore_duration_seconds{outcome}.