Operations Runbook¶
Health Check¶
GET /api/health returns the status of the database and Redis:
{
"status": "ok",
"version": "0.1.0",
"uptime": 12345.67,
"checks": {
"database": { "status": "ok", "latencyMs": 2 },
"redis": { "status": "ok", "latencyMs": 1 }
}
}
| Status | Meaning |
|---|---|
ok |
Both database and Redis are healthy |
degraded |
One or both dependencies are down |
Use this endpoint for load balancer health checks. A degraded status means the server is running but some functionality (sessions, job scheduling) may be impaired.
Stuck Run Recovery¶
Workflow runs that remain in running status without progress are automatically failed by a periodic job.
| Setting | Env var | Default |
|---|---|---|
| Timeout | STUCK_RUN_TIMEOUT_MINUTES |
30 |
| Check interval | — | Every 15 minutes |
A run is considered stuck if its updated_at timestamp is older than the timeout. The run is marked as failed with reason "Run timed out (no progress for N minutes)".
The timeout can also be configured via Admin > Security Settings in the web UI.
BullMQ Job Processing¶
All background jobs use a single BullMQ queue named workflow-scheduler.
Retry Policy¶
Jobs are retried up to 3 times with exponential backoff (5s, 10s, 20s). Failed jobs are retained for inspection (up to 5000).
Job Types¶
| Job | Schedule | Purpose |
|---|---|---|
role-expiry-check |
Hourly | Revoke expired role assignments |
document-expiry-check |
Hourly | Revoke roles with expired documents |
entitlement-reconciliation |
Daily 02:00 UTC | Reconcile entitlements with connectors |
run-orphan-cleanup |
Daily 02:30 UTC | Clean up orphaned run artifacts |
stuck-run-recovery |
Every 15 minutes | Fail runs stuck in running state |
audit-checkpoint |
Configurable | Create signed audit checkpoint |
deliver-scheduled-report |
Per-schedule cron | Generate and email scheduled reports |
trigger-workflow |
Per-schedule cron | Start scheduled workflow runs |
escalation-reminder |
Delayed | Send approval reminders |
escalation-reassignment |
Delayed | Reassign overdue approvals |
Monitoring Failed Jobs¶
Use the BullMQ dashboard or query Redis directly to inspect failed jobs:
Graceful Shutdown¶
The server handles SIGTERM and SIGINT signals:
- Stops accepting new HTTP connections
- Completes in-flight requests
- Drains the BullMQ worker (waits for active jobs)
- Closes Redis and database connections
- Exits process
Docker and Kubernetes send SIGTERM on container stop. The default grace period should be at least 30 seconds.
Database Pool Tuning¶
| Env var | Default | Description |
|---|---|---|
DB_POOL_MAX |
10 | Maximum number of connections |
DB_POOL_MIN |
2 | Minimum idle connections |
DB_POOL_IDLE_TIMEOUT_MS |
30000 | Close idle connections after this time |
DB_POOL_CONNECTION_TIMEOUT_MS |
5000 | Fail if connection cannot be acquired |
Monitor active connections with:
Log Retention¶
System logs are stored in the system_log table. A daily cleanup job purges entries older than LOG_RETENTION_DAYS (default: 30).
Manual purge is available via Admin > Logging > Purge Now in the web UI.
Audit Checkpoints¶
Signed audit checkpoints are created on a configurable schedule (default: every 6 hours). Checkpoints are stored according to AUDIT_CHECKPOINT_STORE:
| Store | Env var | Description |
|---|---|---|
file |
AUDIT_CHECKPOINT_PATH |
Local filesystem (default) |
s3 |
— | S3-compatible storage (planned) |
siem |
— | SIEM integration (planned) |
Configuration Validation¶
The server validates all configuration at startup. Invalid values (e.g., PORT=abc, DB_POOL_MAX=0) produce clear error messages and prevent startup.
In production, the server also validates that required secrets are not using default/fallback values. See Security for the full list.