Skip to content

forensics

ugallu-forensics watches SecurityEvent CRs, evaluates a trigger predicate, and runs an IR-as-code capture pipeline against the suspect Pod when it fires.

The operator is built around a single invariant: every side effect is an EventResponse CR that the attestor seals into an in-toto bundle. The pipeline itself is the audit chain

  • there is no separate event log.

ForensicsConfig.spec.trigger gates pipeline entry on:

  • classes (default [Detection, Anomaly])
  • minSeverities (default [high, critical])
  • whitelistedTypes - explicit opt-in (e.g. PrivilegedPodChange, ClusterAdminGranted, HostPathMount, ExecIntoPod, …). An empty whitelist matches nothing.
  • requireAttested - when true, the SE must have status.phase=Attested (set by the attestor after the AttestationBundle is Sealed). Defends against unauthenticated SE forges that would otherwise drive a freeze.
  • namespaceAllowlist - empty = match-all
  • implicit: subject.kind=Pod (anything else is a non_pod_subject skip)

Misses bump ugallu_forensics_skipped_total{reason} so dashboards show why an SE didn’t trigger.

Three steps, run sequentially per incident, each emitting its own EventResponse for attestation:

Labels the suspect Pod with ugallu.io/frozen=<incident-uid> and applies a deny-all CiliumNetworkPolicy (Cilium clusters) or NetworkPolicy (vanilla CNI). Egress is widened to DNS, the configured WORM endpoint, and the forensics workload namespace - without these the snapshot ephemeral container can’t resolve or upload.

Injects an ephemeral container (ugallu-forensics-snapshot) into the suspect Pod and tees the process’s /proc/<pid>/root to S3 as a tar+gzip+sha256 stream. Capability scope: CAP_DAC_READ_SEARCH only - enough to read arbitrary inode trees without root, small enough that pod-security.kubernetes.io/enforce=baseline accepts the ephemeral container only when the suspect’s namespace is labelled privileged. Baseline-only namespaces fall back to a pod-level memory-only capture.

Builds a content-addressed manifest (sha256 over the canonical JSON), uploads it to s3://<bucket>/forensics/<incident>/manifest-<sha>.json with COMPLIANCE Object Lock, and references it from the IncidentCaptureCompleted SE as the sole evidence URL. Re-uploads of identical content are no-ops; divergent rewrites are rejected by Object Lock - that’s the audit guarantee.

The freeze backend is detected once at startup via Cilium CRD probe; the choice surfaces on ForensicsConfig.status.freezeBackend. The detector refreshes every 10m so a CNI swap is reflected without a restart.

Each step EventResponse carries:

  • app.kubernetes.io/managed-by=ugallu-forensics
  • ugallu.io/incident-uid=<sha256(triggerSE.uid)[0:16]>
  • ugallu.io/parent-er=<previous-step-er-name> (chain back-link)
  • ugallu.io/step=<podfreeze|filesystem-snapshot|evidence-upload|podunfreeze>

Manual ack. An authorized ServiceAccount stamps ugallu.io/incident-acknowledged=true on the IncidentCaptureCompleted SE. An admission policy gates this annotation by SA. The controller observes the flip and runs PodUnfreezeStep.

Auto-unfreeze (optional). When cleanup.autoUnfreezeAfter is positive, the controller computes the deadline as triggerSE.metadata.creationTimestamp + grace and unfreezes when the wall clock crosses it. Durable by construction - the deadline lives on the SE, not in process memory, so a controller restart honours the grace.

Crash recovery. At startup the operator lists EventResponses it created, reconstructs the incident state machine, and resumes:

StepRecovery
PodFreeze / PodUnfreezeidempotent re-apply (label + CNP are convergent)
FilesystemSnapshotnon-idempotent - if the ephemeral container is Terminated+success, salvage the logs and proceed; otherwise mark the ER Permanent so we don’t re-run a half-completed capture
EvidenceUploadrebuild the manifest deterministically, retry the upload (Object Lock makes the second write a no-op when content matches)

Pipeline holds a semaphore (MaxConcurrent, default 5) and an in-flight set keyed on incident UID. Re-emits of the same incident return immediately as a no-op; a busy queue surfaces on ugallu_forensics_queue_size so the SE reconciler can back off with a RequeueAfter instead of looping.

apiVersion: security.ugallu.io/v1alpha1
kind: ForensicsConfig
metadata:
name: default
spec:
trigger:
classes: [Detection, Anomaly]
minSeverities: [high, critical]
whitelistedTypes:
- PrivilegedPodChange
- ClusterAdminGranted
- HostPathMount
requireAttested: true
namespaceAllowlist: []
cleanup:
autoUnfreezeAfter: 4h
evidence:
bucket: ugallu-forensics
objectLock: COMPLIANCE
retainDays: 365
  • ugallu_forensics_incidents_total{outcome}
  • ugallu_forensics_steps_total{step,outcome}
  • ugallu_forensics_skipped_total{reason}
  • ugallu_forensics_queue_size
  • ugallu_forensics_cni_detect_failures_total
  • ugallu_forensics_recovery_total{outcome}
  • ugallu_forensics_auto_unfreeze_total{outcome}

ForensicsConfig is a singleton with no phase. The pipeline state lives on the EventResponse chain - each step’s status.phase walks Pending -> Running -> Succeeded | Failed. An incident is identified by ugallu.io/incident-uid=<sha256(triggerSE.uid)[:16]> which labels every ER it produces.

on each ForensicsConfig event or 30s tick:
cfg := Get("default")
patch Status.FreezeBackend = cniDetector.Backend()
patch Status.InFlightIncidents = pipeline.InFlight()
patch Status.LastConfigLoadAt = now
RequeueAfter: 30s
on each SecurityEvent event:
if !triggerPredicate(se): metric Skipped[reason]++; return
if pipeline.InFlight() >= cfg.MaxConcurrentIncidents:
RequeueAfter: 5s; return
pipeline.Process(incidentFrom(se)) # async, returns immediately
# inside pipeline.Process (per-incident goroutine):
for step in [PodFreeze, FilesystemSnapshot, EvidenceUpload]:
er := upsertEventResponse(step, parent=prev)
err := step.Run(ctx, incident)
if err: er.Status.Phase=Failed; emit IncidentCaptureFailed; rollback; return
emit IncidentCaptureCompleted
scheduleAutoUnfreeze(triggerSE.creationTimestamp + cfg.AutoUnfreezeAfter)

Operator restart triggers a startup sweep (leader-elected). The sweep lists every EventResponse with app.kubernetes.io/managed-by=ugallu-forensics and reconstructs the incident state machine. The auto-unfreeze deadline lives on the triggerSE’s creationTimestamp + grace (not in process memory) so a controller restart honours the grace.

Pod killed during FilesystemSnapshotStep: new pod runs the startup sweep, observes the ephemeral container’s state. If Terminated+success, the snapshot logs are salvaged and the pipeline proceeds to EvidenceUploadStep. If the ephemeral container exited Failed during the gap, the snapshot ER is marked Permanent so the incident terminates Failed and forensics emits IncidentCaptureFailed.

  • CNI backend auto-detect runs every 10m so a CNI swap (Cilium <-> CoreV1) is reflected without a restart.
  • Sandbox PSA admission for the snapshot ephemeral container requires the suspect’s namespace labelled privileged. Baseline-only namespaces fall back to a memory-only capture.
  • Manual ack stamps ugallu.io/incident-acknowledged=true on IncidentCaptureCompleted; an admission policy gates the annotation by ServiceAccount.
  • WORM credentials are read via env vars so the operator does not need cluster-wide Secret list/watch.
# ClusterRole
rules:
- apiGroups: [security.ugallu.io]
resources:
- securityevents
- securityevents/status
- forensicsconfigs
- forensicsconfigs/status
- eventresponses
- eventresponses/status
verbs: [get, list, watch, create, patch, update]
- apiGroups: [""]
resources: [pods, pods/log, pods/ephemeralcontainers]
verbs: [get, list, watch, update, patch]
- apiGroups: [""]
resources: [secrets]
verbs: [get, create] # capture credentials only
- apiGroups: [networking.k8s.io]
resources: [networkpolicies]
verbs: [get, list, create, delete]
- apiGroups: [cilium.io]
resources: [ciliumnetworkpolicies]
verbs: [get, list, create, delete]
# Namespaced Role
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete]
- apiGroups: [""]
resources: [events]
verbs: [create, patch]
  • ForensicsConfig
    • singleton trigger predicate + cleanup policy.
  • EventResponse - one CR per pipeline step; immutable, sealed by the attestor.

Helm subchart forensics. Runs in ugallu-system-privileged because the snapshot ephemeral container needs CAP_DAC_READ_SEARCH (and optionally CAP_SYS_PTRACE once memory snapshots are turned on).

WORM credentials are read from the umbrella’s worm.secret Secret via env vars (WORM_ACCESS_KEY / WORM_SECRET_KEY) so we don’t grant a cluster-wide Secret list/watch - controller-runtime’s cache for corev1.Secret is explicitly disabled and master credentials are resolved with a one-shot client at startup.