Observability
Every ugallu operator exposes:
/metricson:9090- Prometheus text format./healthz+/readyzon:8081- HTTP health pair (200/503).- structured logs to
stderr- JSON, key-value attributes via controller-runtime’s logger.
The umbrella chart ships a monitoring subchart that wires
PrometheusRule, ServiceMonitor, and Grafana dashboards. The
Prometheus operator must be installed beforehand; the chart only
adds the rules / monitors, never the operator itself.
Metric naming
Section titled “Metric naming”Every operator’s metrics are namespaced under ugallu_<operator>_*.
That gives you a low-cardinality top-level grouping and lets a
single PromQL {__name__=~"ugallu_audit_.*"} survey one operator’s
state without a label join.
A few canonical metrics every operator exposes (added by the SDK, not the operator code):
ugallu_emitter_emits_total{class,outcome}- SE emission counter withoutcomein{ok, conflict_idempotent, conflict_real, error}. The split between idempotent and real conflicts is the metric you watch when the cluster is replaying audit traffic.ugallu_emitter_emit_latency_seconds- histogram of the apiserverCreateround-trip.ugallu_controller_runtime_*- the standard controller-runtime set (reconcile_total,reconcile_errors_total,workqueue_depth).
Alerts
Section titled “Alerts”The monitoring subchart’s PrometheusRule ships these alert
families:
| Alert family | What it means |
|---|---|
UgalluOperatorDown | An operator’s up time-series has been 0 for >2m. One alert per operator. |
UgalluEmitterErrors | ugallu_emitter_emits_total{outcome="error"} is non-zero - SE writes are failing, not just retried. |
UgalluReconcileFloods | controller_runtime_reconcile_total rate is >100/s for >5m on a single CR kind. Hint at a hot-loop. |
UgalluWorkqueueGrowing | workqueue_depth is monotonically increasing for >5m - the operator is falling behind on a CR. |
UgalluForensicsBacklog | ugallu_forensics_queue_size > forensicsConfig.maxConcurrent for >2m. Real incident pile-up. |
UgalluAttestorSealFailing | AttestationBundle.phase stuck off Sealed for >10m. Either Rekor or WORM is unreachable. |
Alerts route through Alertmanager; no in-cluster pager, by design.
Dashboards
Section titled “Dashboards”Grafana dashboards live under
charts/ugallu/charts/monitoring/dashboards/. The chart loads
them via the grafana_dashboard=1 ConfigMap label that the
grafana chart’s sidecar picks up. Three dashboards ship today:
- Overview - one row per operator with the
up,emit rate,emit error rate, andreconcile workqueue depthfor that operator. - Detection chain - the audit-detection -> attestor -> forensics
pipeline. Per-rule emit rate, attestor
phasedistribution, forensics queue depth. - Backup & compliance -
BackupVerifyResult.worstSeverityhistogram over time,ComplianceScanResult.summarystacked area.
Operators log in JSON via slog + controller-runtime. Standard
attributes:
controller=<crd-kind>- which controller produced the line.name,namespace- the reconciled object.req=<UUID>- per-reconcile correlation ID.level-debug,info,warn,error.
debug is off by default; flip with --zap-log-level=debug per
operator. Don’t ship debug to production - the SDK logs every
reconcile request, not just errors.
Tracing
Section titled “Tracing”OpenTelemetry tracing is wired but not exported by default. Set
umbrella.tracing.endpoint=otel-collector.observability:4317 to
turn it on; spans cover the SE emit pipeline (prepare, create,
status-patch).