Skip to content

Observability

Every ugallu operator exposes:

  • /metrics on :9090 - Prometheus text format.
  • /healthz + /readyz on :8081 - HTTP health pair (200 / 503).
  • structured logs to stderr - JSON, key-value attributes via controller-runtime’s logger.

The umbrella chart ships a monitoring subchart that wires PrometheusRule, ServiceMonitor, and Grafana dashboards. The Prometheus operator must be installed beforehand; the chart only adds the rules / monitors, never the operator itself.

Every operator’s metrics are namespaced under ugallu_<operator>_*. That gives you a low-cardinality top-level grouping and lets a single PromQL {__name__=~"ugallu_audit_.*"} survey one operator’s state without a label join.

A few canonical metrics every operator exposes (added by the SDK, not the operator code):

  • ugallu_emitter_emits_total{class,outcome} - SE emission counter with outcome in {ok, conflict_idempotent, conflict_real, error}. The split between idempotent and real conflicts is the metric you watch when the cluster is replaying audit traffic.
  • ugallu_emitter_emit_latency_seconds - histogram of the apiserver Create round-trip.
  • ugallu_controller_runtime_* - the standard controller-runtime set (reconcile_total, reconcile_errors_total, workqueue_depth).

The monitoring subchart’s PrometheusRule ships these alert families:

Alert familyWhat it means
UgalluOperatorDownAn operator’s up time-series has been 0 for >2m. One alert per operator.
UgalluEmitterErrorsugallu_emitter_emits_total{outcome="error"} is non-zero - SE writes are failing, not just retried.
UgalluReconcileFloodscontroller_runtime_reconcile_total rate is >100/s for >5m on a single CR kind. Hint at a hot-loop.
UgalluWorkqueueGrowingworkqueue_depth is monotonically increasing for >5m - the operator is falling behind on a CR.
UgalluForensicsBacklogugallu_forensics_queue_size > forensicsConfig.maxConcurrent for >2m. Real incident pile-up.
UgalluAttestorSealFailingAttestationBundle.phase stuck off Sealed for >10m. Either Rekor or WORM is unreachable.

Alerts route through Alertmanager; no in-cluster pager, by design.

Grafana dashboards live under charts/ugallu/charts/monitoring/dashboards/. The chart loads them via the grafana_dashboard=1 ConfigMap label that the grafana chart’s sidecar picks up. Three dashboards ship today:

  • Overview - one row per operator with the up, emit rate, emit error rate, and reconcile workqueue depth for that operator.
  • Detection chain - the audit-detection -> attestor -> forensics pipeline. Per-rule emit rate, attestor phase distribution, forensics queue depth.
  • Backup & compliance - BackupVerifyResult.worstSeverity histogram over time, ComplianceScanResult.summary stacked area.

Operators log in JSON via slog + controller-runtime. Standard attributes:

  • controller=<crd-kind> - which controller produced the line.
  • name, namespace - the reconciled object.
  • req=<UUID> - per-reconcile correlation ID.
  • level - debug, info, warn, error.

debug is off by default; flip with --zap-log-level=debug per operator. Don’t ship debug to production - the SDK logs every reconcile request, not just errors.

OpenTelemetry tracing is wired but not exported by default. Set umbrella.tracing.endpoint=otel-collector.observability:4317 to turn it on; spans cover the SE emit pipeline (prepare, create, status-patch).