Skip to content

webhook-auditor

ugallu-webhook-auditor watches every MutatingWebhookConfiguration and ValidatingWebhookConfiguration in the cluster, computes a risk score per webhook, and fires a SecurityEvent when the score crosses the configured threshold. Admission webhooks are the highest-leverage targets in a cluster - a compromised webhook can rewrite every request to the apiserver - and they tend to drift unnoticed.

The score is a weighted sum across these axes:

  • Reach. A webhook with * in rules[].resources / apiGroups / operations scores higher than a narrowly-scoped one.
  • Failure policy. failurePolicy: Ignore is higher risk than Fail - an attacker who DoSes the webhook gets traffic to bypass it entirely, so “Ignore” is paradoxically the dangerous setting.
  • CA bundle. Self-signed (no caBundle.cert-manager.io/inject-ca-from annotation, no recognised issuer) scores higher than a properly rotated one.
  • Endpoint. Service refs in kube-system or unannotated namespaces score lower; URL-based endpoints (especially off-cluster) score higher.
  • Side effects. sideEffects: NoneOnDryRun / Some rather than None.

The exact weights live in WebhookAuditorConfig.spec.weights. The breakdown is published on each evaluation so dashboards can show “why is this webhook a 78”.

To distinguish “the caBundle is cert-manager rotated” from “the caBundle is hard-coded base64 from 2022”, the auditor follows the cert-manager.io/inject-ca-from annotation and reads the referenced Secret - but only from a configured allowlist of namespaces (spec.trustedCASources), so a tenant can’t lure the auditor into reading their Secret by setting the annotation.

apiVersion: security.ugallu.io/v1alpha1
kind: WebhookAuditorConfig
metadata: { name: default, namespace: ugallu-system }
spec:
thresholds:
alertOn: 70
weights:
wildcardResource: 20
failurePolicyIgnore: 15
selfSignedCA: 25
offClusterURL: 20
sideEffectsNone: 5
trustedCASources:
- { namespace: cert-manager, namePrefix: webhook-cert- }
- { namespace: kube-system, namePrefix: ugallu- }

WebhookAuditorConfig is a singleton with no phase. The reconciler treats every Mutating/ValidatingWebhookConfiguration event as a re-evaluation request, debounced across rapid mutations. Status fields (observedWebhooks, lastConfigLoadAt, per-namespace caBundle resolution counters) are refreshed on a 30s tick.

on each WebhookAuditorConfig event or 30s tick:
cfg := Get("default")
patch Status.ObservedWebhooks = listMWC().Count + listVWC().Count
patch Status.LastConfigLoadAt = now
RequeueAfter: 30s
on each MWC / VWC event:
if cfg.Ignore.Match(name): metric Skipped[ignored]++; return
if debounce.Skip(name, spec): metric Skipped[debounced]++; return
caScore := resolveCABundle(spec.caBundle, cfg.TrustedCASources)
reach := scoreReach(spec.rules)
fp := scoreFailurePolicy(spec.failurePolicy)
endpoint := scoreEndpoint(spec.clientConfig)
side := scoreSideEffects(spec.sideEffects)
total := reach + fp + caScore + endpoint + side
if total >= cfg.Thresholds.AlertOn:
emitSE(MutatingWebhookHighRisk | ValidatingWebhookHighRisk)
metric ScoreDistribution.Observe(total)

Stateless evaluator: each MWC/VWC event re-runs the full scoring pass. Operator restart simply re-Lists all MWCs/VWCs and re-evaluates each. The debounce cache is rebuilt on the fly. caBundle resolution failures are counted per-reason (annotation_parse_error / namespace_forbidden / resolve_error / resolver_disabled) and recorded in status.

Pod killed during a per-MWC evaluation: the new pod observes the MWC again on its next informer sync, re-runs the scoring, emits the SE if it still crosses the threshold. No state to recover.

  • Debounce. Rapid mutations to the same MWC (e.g. cert-manager hot-reload of the caBundle) collapse into a single evaluation.
  • CA allowlist. caBundle resolution via cert-manager.io/inject-ca-from is gated by namespace allowlist (trustedCASources); a tenant cannot lure the auditor into reading their Secret by setting the annotation on a webhook outside the allowlist.
  • Eval timeout budget. Per-MWC eval has a hard timeout that surfaces on ugallu_webhook_eval_timeouts_total and a WebhookEvalFailed SE.
  • Risk score histogram lets dashboards show the distribution across all webhooks (“how many MWCs sit between 60 and 80?”).
# ClusterRole
rules:
- apiGroups: [admissionregistration.k8s.io]
resources: [mutatingwebhookconfigurations,
validatingwebhookconfigurations]
verbs: [get, list, watch]
- apiGroups: [apiextensions.k8s.io]
resources: [customresourcedefinitions]
verbs: [get, list, watch] # subject kind mapping
- apiGroups: [security.ugallu.io]
resources: [webhookauditorconfigs, webhookauditorconfigs/status]
verbs: [get, list, watch, update, patch]
- apiGroups: [security.ugallu.io]
resources: [securityevents]
verbs: [create]
- apiGroups: [""]
resources: [events]
verbs: [create, patch]
# Per-namespace Role(s) gated by trustedCASources
- apiGroups: [""]
resources: [secrets]
verbs: [get, list, watch]
# Namespaced Role
- apiGroups: [coordination.k8s.io]
resources: [leases]
verbs: [get, list, watch, create, update, patch, delete]
  • ugallu_webhook_score_total{type, severity}
  • ugallu_webhook_eval_total
  • ugallu_webhook_eval_skipped_total{reason} (ignored / debounced / missing_secret)
  • ugallu_webhook_drop_total{reason} (rate-limited)
  • ugallu_webhook_score_distribution (histogram 0-100)
  • ugallu_webhook_observed_count
  • ugallu_webhook_eval_timeouts_total
  • ugallu_webhook_ca_resolve_fallback_total{reason}

--cluster-id, --cluster-name, --config-name (default default).

Deployment (2 replicas) in ugallu-system, leader election on, priorityClassName=system-cluster-critical. RBAC: cluster-wide read on MutatingWebhookConfiguration / ValidatingWebhookConfiguration / CustomResourceDefinition, namespace-scoped read on Secret (limited to trustedCASources).

ugallu_webhook_auditor_score{name,kind}, ugallu_webhook_auditor_breaches_total, ugallu_webhook_auditor_ca_resolves_total{outcome}.