Skip to main content

Notifications

OSAPI monitors component health through condition evaluation and notifies when conditions change. The notification system watches the registry KV bucket for condition transitions and dispatches events through a pluggable backend.

How It Works

Every component (agent, API server, NATS server) evaluates conditions on each heartbeat and writes them to the registry KV bucket. A watcher on the API server detects transitions:

  • Fired: a condition becomes active (e.g., DiskPressure crosses threshold)
  • Resolved: a condition becomes inactive (e.g., disk usage drops below threshold)
  • Unreachable: a component's heartbeat expires (TTL timeout)

Active conditions are re-fired at a configurable interval so they remain visible in logs and alerts.

Conditions

ConditionComponentsDescription
MemoryPressureagentHost memory usage exceeds threshold (default 90%)
HighLoadagentLoad average exceeds CPU count × multiplier (default 2.0)
DiskPressureagentAny disk usage exceeds threshold (default 90%)
ProcessMemoryPressureagent, api, natsProcess RSS exceeds configured byte threshold
ProcessHighCPUagent, api, natsProcess CPU usage exceeds configured percent threshold
ComponentUnreachableagent, api, natsHeartbeat expired (TTL timeout)

Host-level conditions are evaluated on agents only. Process-level conditions are evaluated on all components. ComponentUnreachable is emitted by the watcher when a heartbeat TTL expires — it does not appear on the component's registration because the component is already gone.

Notifier Backends

The notification system uses a pluggable Notifier interface. Currently one backend is available:

Log (default)

Writes condition events to the structured log. Fired conditions log at WARN level, resolved conditions at INFO:

WRN condition fired   component=agent hostname=web-01 condition=DiskPressure active=true reason="/ 92% used"
INF condition resolved component=agent hostname=web-01 condition=DiskPressure active=false
WRN condition fired component=nats hostname=nats-01 condition=ComponentUnreachable active=true reason="heartbeat expired"

Future Backends

  • Slack — post to a webhook URL
  • Email — send via SMTP
  • Webhook — POST to a configurable URL

Re-notification

By default, a condition fires once when it becomes active and once when it resolves. To keep active conditions visible, configure renotify_interval:

notifications:
enabled: true
notifier: 'log'
renotify_interval: '5m'

With renotify_interval: '5m', an active DiskPressure condition re-fires every 5 minutes until resolved. Uses Go duration format (1m, 5m, 1h). Set to '0' to disable re-notification.

Configuration

notifications:
enabled: true
notifier: 'log'
renotify_interval: '5m'
KeyEnv VariableDescription
enabledOSAPI_NOTIFICATIONS_ENABLEDEnable the watcher (default: false)
notifierOSAPI_NOTIFICATIONS_NOTIFIERBackend: "log" (default)
renotify_intervalOSAPI_NOTIFICATIONS_RENOTIFY_INTERVALRe-fire interval (default: "0")

See Configuration for the full reference.

Architecture

The watcher runs as a background goroutine in the API server. It monitors the registry KV bucket using NATS KV Watch. On each update it compares the previous condition set to the current one and emits ConditionEvents for transitions.

The watcher is designed to be extractable into a separate process — its only dependency is NATS KV access and a Notifier implementation.

Permissions

No specific permissions are required for notifications. The watcher reads the same registry KV bucket that the health status endpoint uses.