Observability

Every ugallu operator exposes:

/metrics on :9090 - Prometheus text format.
/healthz + /readyz on :8081 - HTTP health pair (200 / 503).
structured logs to stderr - JSON, key-value attributes via controller-runtime’s logger.

The umbrella chart ships a monitoring subchart that wires PrometheusRule, ServiceMonitor, and Grafana dashboards. The Prometheus operator must be installed beforehand; the chart only adds the rules / monitors, never the operator itself.

Metric naming

Every operator’s metrics are namespaced under ugallu_<operator>_*. That gives you a low-cardinality top-level grouping and lets a single PromQL {__name__=~"ugallu_audit_.*"} survey one operator’s state without a label join.

A few canonical metrics every operator exposes (added by the SDK, not the operator code):

ugallu_emitter_emits_total{class,outcome} - SE emission counter with outcome in {ok, conflict_idempotent, conflict_real, error}. The split between idempotent and real conflicts is the metric you watch when the cluster is replaying audit traffic.
ugallu_emitter_emit_latency_seconds - histogram of the apiserver Create round-trip.
ugallu_controller_runtime_* - the standard controller-runtime set (reconcile_total, reconcile_errors_total, workqueue_depth).

Alerts

The monitoring subchart’s PrometheusRule ships these alert families:

Alert family	What it means
`UgalluOperatorDown`	An operator’s `up` time-series has been `0` for >2m. One alert per operator.
`UgalluEmitterErrors`	`ugallu_emitter_emits_total{outcome="error"}` is non-zero - SE writes are failing, not just retried.
`UgalluReconcileFloods`	`controller_runtime_reconcile_total` rate is >100/s for >5m on a single CR kind. Hint at a hot-loop.
`UgalluWorkqueueGrowing`	`workqueue_depth` is monotonically increasing for >5m - the operator is falling behind on a CR.
`UgalluForensicsBacklog`	`ugallu_forensics_queue_size > forensicsConfig.maxConcurrent` for >2m. Real incident pile-up.
`UgalluAttestorSealFailing`	`AttestationBundle.phase` stuck off `Sealed` for >10m. Either Rekor or WORM is unreachable.

Alerts route through Alertmanager; no in-cluster pager, by design.

Dashboards

Grafana dashboards live under charts/ugallu/charts/monitoring/dashboards/. The chart loads them via the grafana_dashboard=1 ConfigMap label that the grafana chart’s sidecar picks up. Three dashboards ship today:

Overview - one row per operator with the up, emit rate, emit error rate, and reconcile workqueue depth for that operator.
Detection chain - the audit-detection -> attestor -> forensics pipeline. Per-rule emit rate, attestor phase distribution, forensics queue depth.
Backup & compliance - BackupVerifyResult.worstSeverity histogram over time, ComplianceScanResult.summary stacked area.

Logs

Operators log in JSON via slog + controller-runtime. Standard attributes:

controller=<crd-kind> - which controller produced the line.
name, namespace - the reconciled object.
req=<UUID> - per-reconcile correlation ID.
level - debug, info, warn, error.

debug is off by default; flip with --zap-log-level=debug per operator. Don’t ship debug to production - the SDK logs every reconcile request, not just errors.

Tracing

OpenTelemetry tracing is wired but not exported by default. Set umbrella.tracing.endpoint=otel-collector.observability:4317 to turn it on; spans cover the SE emit pipeline (prepare, create, status-patch).