forensics
ugallu-forensics watches SecurityEvent CRs, evaluates a trigger
predicate, and runs an IR-as-code capture pipeline against the
suspect Pod when it fires.
The operator is built around a single invariant: every side effect
is an EventResponse CR that the attestor
seals into an in-toto bundle. The pipeline itself is the audit chain
- there is no separate event log.
Trigger predicate
Section titled “Trigger predicate”ForensicsConfig.spec.trigger gates pipeline entry on:
classes(default[Detection, Anomaly])minSeverities(default[high, critical])whitelistedTypes- explicit opt-in (e.g.PrivilegedPodChange,ClusterAdminGranted,HostPathMount,ExecIntoPod, …). An empty whitelist matches nothing.requireAttested- when true, the SE must havestatus.phase=Attested(set by the attestor after theAttestationBundleisSealed). Defends against unauthenticated SE forges that would otherwise drive a freeze.namespaceAllowlist- empty = match-all- implicit:
subject.kind=Pod(anything else is anon_pod_subjectskip)
Misses bump ugallu_forensics_skipped_total{reason} so dashboards
show why an SE didn’t trigger.
Pipeline
Section titled “Pipeline”Three steps, run sequentially per incident, each emitting its own
EventResponse for attestation:
1. PodFreezeStep
Section titled “1. PodFreezeStep”Labels the suspect Pod with ugallu.io/frozen=<incident-uid> and
applies a deny-all CiliumNetworkPolicy (Cilium clusters) or
NetworkPolicy (vanilla CNI). Egress is widened to DNS, the
configured WORM endpoint, and the forensics workload namespace -
without these the snapshot ephemeral container can’t resolve or
upload.
2. FilesystemSnapshotStep
Section titled “2. FilesystemSnapshotStep”Injects an ephemeral container (ugallu-forensics-snapshot) into
the suspect Pod and tees the process’s /proc/<pid>/root to S3 as
a tar+gzip+sha256 stream. Capability scope: CAP_DAC_READ_SEARCH
only - enough to read arbitrary inode trees without root, small
enough that pod-security.kubernetes.io/enforce=baseline accepts
the ephemeral container only when the suspect’s namespace is
labelled privileged. Baseline-only namespaces fall back to a
pod-level memory-only capture.
3. EvidenceUploadStep
Section titled “3. EvidenceUploadStep”Builds a content-addressed manifest (sha256 over the canonical
JSON), uploads it to
s3://<bucket>/forensics/<incident>/manifest-<sha>.json with
COMPLIANCE Object Lock, and references it from the
IncidentCaptureCompleted SE as the sole evidence URL. Re-uploads
of identical content are no-ops; divergent rewrites are rejected
by Object Lock - that’s the audit guarantee.
The freeze backend is detected once at startup via Cilium CRD probe;
the choice surfaces on ForensicsConfig.status.freezeBackend. The
detector refreshes every 10m so a CNI swap is reflected without a
restart.
Each step EventResponse carries:
app.kubernetes.io/managed-by=ugallu-forensicsugallu.io/incident-uid=<sha256(triggerSE.uid)[0:16]>ugallu.io/parent-er=<previous-step-er-name>(chain back-link)ugallu.io/step=<podfreeze|filesystem-snapshot|evidence-upload|podunfreeze>
Lifecycle
Section titled “Lifecycle”Manual ack. An authorized ServiceAccount stamps
ugallu.io/incident-acknowledged=true on the
IncidentCaptureCompleted SE. An admission policy gates this
annotation by SA. The controller observes the flip and runs
PodUnfreezeStep.
Auto-unfreeze (optional). When cleanup.autoUnfreezeAfter is
positive, the controller computes the deadline as
triggerSE.metadata.creationTimestamp + grace and unfreezes when
the wall clock crosses it. Durable by construction - the deadline
lives on the SE, not in process memory, so a controller restart
honours the grace.
Crash recovery. At startup the operator lists EventResponses it created, reconstructs the incident state machine, and resumes:
| Step | Recovery |
|---|---|
PodFreeze / PodUnfreeze | idempotent re-apply (label + CNP are convergent) |
FilesystemSnapshot | non-idempotent - if the ephemeral container is Terminated+success, salvage the logs and proceed; otherwise mark the ER Permanent so we don’t re-run a half-completed capture |
EvidenceUpload | rebuild the manifest deterministically, retry the upload (Object Lock makes the second write a no-op when content matches) |
Concurrency
Section titled “Concurrency”Pipeline holds a semaphore (MaxConcurrent, default 5) and an
in-flight set keyed on incident UID. Re-emits of the same incident
return immediately as a no-op; a busy queue surfaces on
ugallu_forensics_queue_size so the SE reconciler can back off
with a RequeueAfter instead of looping.
Example
Section titled “Example”apiVersion: security.ugallu.io/v1alpha1kind: ForensicsConfigmetadata: name: defaultspec: trigger: classes: [Detection, Anomaly] minSeverities: [high, critical] whitelistedTypes: - PrivilegedPodChange - ClusterAdminGranted - HostPathMount requireAttested: true namespaceAllowlist: [] cleanup: autoUnfreezeAfter: 4h evidence: bucket: ugallu-forensics objectLock: COMPLIANCE retainDays: 365Telemetry
Section titled “Telemetry”ugallu_forensics_incidents_total{outcome}ugallu_forensics_steps_total{step,outcome}ugallu_forensics_skipped_total{reason}ugallu_forensics_queue_sizeugallu_forensics_cni_detect_failures_totalugallu_forensics_recovery_total{outcome}ugallu_forensics_auto_unfreeze_total{outcome}
Internals
Section titled “Internals”State machine
Section titled “State machine”ForensicsConfig is a singleton with no phase. The pipeline
state lives on the EventResponse chain - each step’s
status.phase walks Pending -> Running -> Succeeded | Failed.
An incident is identified by
ugallu.io/incident-uid=<sha256(triggerSE.uid)[:16]> which
labels every ER it produces.
Reconcile loop (Config status)
Section titled “Reconcile loop (Config status)”on each ForensicsConfig event or 30s tick: cfg := Get("default") patch Status.FreezeBackend = cniDetector.Backend() patch Status.InFlightIncidents = pipeline.InFlight() patch Status.LastConfigLoadAt = now RequeueAfter: 30sReconcile loop (incident pipeline)
Section titled “Reconcile loop (incident pipeline)”on each SecurityEvent event: if !triggerPredicate(se): metric Skipped[reason]++; return if pipeline.InFlight() >= cfg.MaxConcurrentIncidents: RequeueAfter: 5s; return pipeline.Process(incidentFrom(se)) # async, returns immediately
# inside pipeline.Process (per-incident goroutine): for step in [PodFreeze, FilesystemSnapshot, EvidenceUpload]: er := upsertEventResponse(step, parent=prev) err := step.Run(ctx, incident) if err: er.Status.Phase=Failed; emit IncidentCaptureFailed; rollback; return emit IncidentCaptureCompleted scheduleAutoUnfreeze(triggerSE.creationTimestamp + cfg.AutoUnfreezeAfter)Error recovery
Section titled “Error recovery”Operator restart triggers a startup sweep (leader-elected). The
sweep lists every EventResponse with
app.kubernetes.io/managed-by=ugallu-forensics and reconstructs
the incident state machine. The auto-unfreeze deadline lives on
the triggerSE’s creationTimestamp + grace (not in process
memory) so a controller restart honours the grace.
Crash recovery scenario
Section titled “Crash recovery scenario”Pod killed during FilesystemSnapshotStep: new pod runs the
startup sweep, observes the ephemeral container’s state. If
Terminated+success, the snapshot logs are salvaged and the
pipeline proceeds to EvidenceUploadStep. If the ephemeral
container exited Failed during the gap, the snapshot ER is
marked Permanent so the incident terminates Failed and
forensics emits IncidentCaptureFailed.
Edge cases
Section titled “Edge cases”- CNI backend auto-detect runs every 10m so a CNI swap (Cilium <-> CoreV1) is reflected without a restart.
- Sandbox PSA admission for the snapshot ephemeral container
requires the suspect’s namespace labelled
privileged. Baseline-only namespaces fall back to a memory-only capture. - Manual ack stamps
ugallu.io/incident-acknowledged=trueonIncidentCaptureCompleted; an admission policy gates the annotation by ServiceAccount. - WORM credentials are read via env vars so the operator does not need cluster-wide Secret list/watch.
Full RBAC
Section titled “Full RBAC”# ClusterRolerules: - apiGroups: [security.ugallu.io] resources: - securityevents - securityevents/status - forensicsconfigs - forensicsconfigs/status - eventresponses - eventresponses/status verbs: [get, list, watch, create, patch, update] - apiGroups: [""] resources: [pods, pods/log, pods/ephemeralcontainers] verbs: [get, list, watch, update, patch] - apiGroups: [""] resources: [secrets] verbs: [get, create] # capture credentials only - apiGroups: [networking.k8s.io] resources: [networkpolicies] verbs: [get, list, create, delete] - apiGroups: [cilium.io] resources: [ciliumnetworkpolicies] verbs: [get, list, create, delete]# Namespaced Role - apiGroups: [coordination.k8s.io] resources: [leases] verbs: [get, list, watch, create, update, patch, delete] - apiGroups: [""] resources: [events] verbs: [create, patch]CRDs owned
Section titled “CRDs owned”ForensicsConfig- singleton trigger predicate + cleanup policy.
EventResponse- one CR per pipeline step; immutable, sealed by the attestor.
Deployment
Section titled “Deployment”Helm subchart forensics. Runs in ugallu-system-privileged because
the snapshot ephemeral container needs CAP_DAC_READ_SEARCH (and
optionally CAP_SYS_PTRACE once memory snapshots are turned on).
WORM credentials are read from the umbrella’s worm.secret Secret
via env vars (WORM_ACCESS_KEY / WORM_SECRET_KEY) so we don’t
grant a cluster-wide Secret list/watch - controller-runtime’s cache
for corev1.Secret is explicitly disabled and master credentials
are resolved with a one-shot client at startup.