Operations Runbook¶

Health Check¶

GET /api/health returns the status of the database and Redis:

{
  "status": "ok",
  "version": "0.1.0",
  "uptime": 12345.67,
  "checks": {
    "database": { "status": "ok", "latencyMs": 2 },
    "redis": { "status": "ok", "latencyMs": 1 }
  }
}

Status	Meaning
`ok`	Both database and Redis are healthy
`degraded`	One or both dependencies are down

Use this endpoint for load balancer health checks. A degraded status means the server is running but some functionality (sessions, job scheduling) may be impaired.

Stuck Run Recovery¶

Workflow runs that remain in running status without progress are automatically failed by a periodic job.

Setting	Env var	Default
Timeout	`STUCK_RUN_TIMEOUT_MINUTES`	30
Check interval	—	Every 15 minutes

A run is considered stuck if its updated_at timestamp is older than the timeout. The run is marked as failed with reason "Run timed out (no progress for N minutes)".

The timeout can also be configured via Admin > Security Settings in the web UI.

BullMQ Job Processing¶

All background jobs use a single BullMQ queue named workflow-scheduler.

Retry Policy¶

Jobs are retried up to 3 times with exponential backoff (5s, 10s, 20s). Failed jobs are retained for inspection (up to 5000).

Job Types¶

Job	Schedule	Purpose
`role-expiry-check`	Hourly	Revoke expired role assignments
`document-expiry-check`	Hourly	Revoke roles with expired documents
`entitlement-reconciliation`	Daily 02:00 UTC	Reconcile entitlements with connectors
`run-orphan-cleanup`	Daily 02:30 UTC	Clean up orphaned run artifacts
`stuck-run-recovery`	Every 15 minutes	Fail runs stuck in running state
`audit-checkpoint`	Configurable	Create signed audit checkpoint
`deliver-scheduled-report`	Per-schedule cron	Generate and email scheduled reports
`trigger-workflow`	Per-schedule cron	Start scheduled workflow runs
`escalation-reminder`	Delayed	Send approval reminders
`escalation-reassignment`	Delayed	Reassign overdue approvals

Monitoring Failed Jobs¶

Use the BullMQ dashboard or query Redis directly to inspect failed jobs:

redis-cli keys "bull:workflow-scheduler:failed:*"

Graceful Shutdown¶

The server handles SIGTERM and SIGINT signals:

Stops accepting new HTTP connections
Completes in-flight requests
Drains the BullMQ worker (waits for active jobs)
Closes Redis and database connections
Exits process

Docker and Kubernetes send SIGTERM on container stop. The default grace period should be at least 30 seconds.

Database Pool Tuning¶

Env var	Default	Description
`DB_POOL_MAX`	10	Maximum number of connections
`DB_POOL_MIN`	2	Minimum idle connections
`DB_POOL_IDLE_TIMEOUT_MS`	30000	Close idle connections after this time
`DB_POOL_CONNECTION_TIMEOUT_MS`	5000	Fail if connection cannot be acquired

Monitor active connections with:

SELECT count(*) FROM pg_stat_activity WHERE datname = 'floh';

Log Retention¶

System logs are stored in the system_log table. A daily cleanup job purges entries older than LOG_RETENTION_DAYS (default: 30).

Manual purge is available via Admin > Logging > Purge Now in the web UI.

Audit Checkpoints¶

Signed audit checkpoints are created on a configurable schedule (default: every 6 hours). Checkpoints are stored according to AUDIT_CHECKPOINT_STORE:

Store	Env var	Description
`file`	`AUDIT_CHECKPOINT_PATH`	Local filesystem (default)
`s3`	—	S3-compatible storage (planned)
`siem`	—	SIEM integration (planned)

Configuration Validation¶

The server validates all configuration at startup. Invalid values (e.g., PORT=abc, DB_POOL_MAX=0) produce clear error messages and prevent startup.

In production, the server also validates that required secrets are not using default/fallback values. See Security for the full list.