Architecture Decision Records¶

This document captures key technical decisions made in the Floh project. Each entry records the context, the decision, and its consequences so that future contributors understand not just what was chosen but why.

ADR-001: pnpm Monorepo with Fixed Versioning¶

Status: Accepted

Context: Floh consists of multiple packages (server, two frontends, a BFF, shared types, and an MCP server) that are developed and released together. We needed a workspace strategy that keeps dependencies in sync and avoids version drift across packages.

Decision: Use a pnpm workspace monorepo with @changesets/cli configured in fixed mode so all packages share a single version number.

Consequences: - All packages are versioned in lockstep, simplifying deployment and compatibility reasoning. - pnpm's strict dependency resolution prevents phantom dependencies. - Changesets automate version bumps, changelog generation, and git tagging via CI. - Trade-off: a patch fix in one package bumps the version of all packages, even unchanged ones.

ADR-002: Fastify over Express¶

Status: Accepted

Context: The backend needed a Node.js HTTP framework with strong TypeScript support, built-in schema validation, and good performance for a workflow engine that processes synchronous step chains.

Decision: Use Fastify 5 as the HTTP framework.

Consequences: - JSON Schema validation is built into the route definition via @fastify/type-provider-typebox, reducing hand-written validation code. - The plugin/decorator system provides a clean dependency injection pattern without a DI container. - Fastify's serialization is significantly faster than Express for JSON-heavy APIs. - The ecosystem is smaller than Express, occasionally requiring custom solutions for middleware.

ADR-003: Kysely over Prisma and Drizzle¶

Status: Accepted

Context: We needed a database layer that supports both PostgreSQL and MySQL, provides full TypeScript type safety, and allows writing complex queries (joins, subqueries, CTEs) without fighting the abstraction.

Decision: Use Kysely as a type-safe SQL query builder with hand-written schema types.

Consequences: - No code generation step — schema types are authored manually and stay in sync with migrations. - Complex queries (advanced filters, pagination, soft-delete scoping) are expressed naturally in SQL without ORM escape hatches. - Dual database support (Postgres + MySQL) is achieved by writing dialect-agnostic queries. - Trade-off: schema types must be updated manually when migrations change the schema, which can drift if developers forget.

ADR-004: Angular with Standalone Components and Signals¶

Status: Accepted

Context: Both the admin and portal frontends needed a component framework with strong TypeScript integration, a mature ecosystem for enterprise UI components (tables, forms, dialogs), and a clear upgrade path.

Decision: Use Angular 21 with standalone components, signals for reactive state, and PrimeNG as the component library. No global state management library (NgRx, etc.) is used.

Consequences: - Standalone components eliminate NgModule boilerplate and simplify lazy loading. - Signals provide fine-grained reactivity without RxJS complexity for UI state; services expose signal() and computed() values. - PrimeNG provides a comprehensive set of enterprise components (data tables, workflow designer inputs, charts) with consistent theming. - Trade-off: without a formal state management library, complex cross-feature state coordination relies on service-level signals and can be harder to trace in larger features.

ADR-005: BullMQ for Background Jobs¶

Status: Accepted

Context: The platform needs cron-triggered workflows, delayed escalation timers, periodic reconciliation, and report generation — all as background jobs that survive process restarts.

Decision: Use a single BullMQ queue (workflow-scheduler) with job-type-based dispatch. The worker can run in-process (combined mode) or as a separate process (WORKER_MODE=separate).

Consequences: - A single queue simplifies operations — one place to monitor, one set of retry/backoff settings. - Job-type dispatch (switch on job name) keeps the routing logic centralized in worker-handlers.ts. - Separate worker mode allows independent scaling of job processing without affecting API latency. - Redis is already required for sessions, so BullMQ adds no new infrastructure dependency. - Trade-off: a single queue means a slow job type (e.g., report generation) can delay others. Future extraction into separate queues is documented in Service Architecture.

ADR-006: Portal BFF as a Stateless Proxy¶

Status: Accepted

Context: External users (invitees, task assignees, approvers) need access to a subset of the API without exposing the full admin surface. The portal must be deployable outside the internal firewall.

Decision: Introduce a dedicated Backend-for-Frontend (portal-bff) that acts as a stateless HTTP proxy with route whitelisting and scope enforcement. It has no direct database, Redis, or OIDC connections.

Consequences: - Minimal attack surface — the BFF only forwards whitelisted routes and strips scope=all from task/approval queries. - Stateless design means it can be horizontally scaled without coordination. - The BFF injects X-Portal-Origin so the server redirects to the correct frontend after authentication. - Trade-off: adds a network hop for portal requests and requires maintaining the route whitelist when new portal-facing endpoints are added.

ADR-007: QuickJS Sandbox for Script Connectors¶

Status: Accepted

Context: Users can author custom connector scripts that execute arbitrary logic. These scripts must be isolated from the server process to prevent resource exhaustion, filesystem access, or network abuse.

Decision: Run script and OAS-derived connectors inside quickjs-emscripten VMs in Node.js worker_threads. Scripts communicate with the host through a controlled floh.* API surface.

Consequences: - Memory limits (default 16 MB) and execution timeouts (default 30s) prevent runaway scripts. - No direct filesystem or network access — HTTP calls are proxied through floh.http.* to the main thread. - QuickJS supports ES2023 syntax, sufficient for connector logic without requiring a full V8 isolate. - Trade-off: worker thread creation has overhead, and the floh.* API surface must be explicitly extended for each new capability scripts need.

See Connector Architecture.

ADR-008: OIDC with Server-Side Sessions¶

Status: Accepted

Context: Authentication must support any OIDC-compliant identity provider. We needed to decide between stateless JWT-based auth and server-side sessions.

Decision: Use server-side OIDC authorization code flow with encrypted sessions stored in Redis. A floh_sid cookie identifies the session. CSRF is mitigated via the double-submit cookie pattern.

Consequences: - Sessions can be revoked instantly (delete from Redis), unlike JWTs which remain valid until expiry. - Token refresh is handled server-side — the client never sees IdP tokens. - Session data (user info, access/refresh tokens) is encrypted at rest in Redis. - CSRF protection is automatic for all mutating requests when OIDC is enabled. - Trade-off: requires Redis availability for authentication, adding a hard dependency on Redis uptime.

ADR-009: JSON-in-TEXT Columns for Flexible Data¶

Status: Accepted

Context: Workflow definitions, step configurations, run variables, and connector configs are deeply nested, schema-flexible structures that vary between workflow types and connector implementations.

Decision: Store these structures as serialized JSON in TEXT columns. The repository layer handles serialization/deserialization and the application layer enforces structure via TypeScript types.

Consequences: - Schema changes to workflow definitions or connector configs do not require database migrations. - The same schema works across PostgreSQL (which has native JSONB) and MySQL (which has JSON but with limitations) by using plain TEXT. - Complex structures (step arrays, nested variable definitions, transition maps) are stored and retrieved atomically. - Trade-off: JSON data in TEXT columns cannot be indexed or queried at the database level. All filtering and searching happens in the application layer after deserialization.

ADR-010: AES-256-GCM Encryption for Secrets at Rest¶

Status: Accepted

Context: Connector configurations contain sensitive credentials (API keys, OAuth tokens, service account keys). Session data stored in Redis contains IdP tokens. These must be encrypted at rest.

Decision: Use AES-256-GCM authenticated encryption for connector secrets, session data, and audit checkpoints. Secret fields are identified via configSchema (fields with secret: true). Keys are rotatable with a CONNECTOR_ENCRYPTION_KEY_PREVIOUS fallback.

Consequences: - Secrets are encrypted before storage and decrypted only at execution time, limiting exposure. - Key rotation is supported without downtime — the previous key is tried on decryption failure. - Authenticated encryption (GCM) detects tampering, not just eavesdropping. - Trade-off: losing the encryption key means permanent loss of all encrypted secrets. Key management is critical and documented in Encryption Keys.

ADR-011: Soft Deletes with Sentinel Timestamps¶

Status: Accepted

Context: Many entities (workflows, connectors, users) should be recoverable after deletion. We needed a soft-delete pattern that works efficiently with database indexes across both PostgreSQL and MySQL.

Decision: Use a deleted_at TIMESTAMP column with a far-future sentinel value (9999-12-31) instead of NULL for non-deleted rows. A SOFT_DELETE_SENTINEL constant is used in queries.

Consequences: - Indexes on deleted_at are effective because the column is never NULL — the sentinel value is a concrete, indexable value. - Queries filter with WHERE deleted_at = SOFT_DELETE_SENTINEL rather than WHERE deleted_at IS NULL, which has more predictable index behavior across database engines. - Restoration is a simple update back to the sentinel value. - Trade-off: the sentinel value is unconventional and requires documentation. Developers unfamiliar with the pattern may be confused by deleted_at = '9999-12-31' in the database.

ADR-012: MCP Server for AI Integration¶

Status: Accepted

Context: AI coding assistants (Cursor, Claude, etc.) benefit from structured access to workflow definitions, run diagnostics, and connector schemas. We needed a way to expose Floh's capabilities to these tools without building custom integrations for each one.

Decision: Create a dedicated @floh/mcp package implementing the Model Context Protocol (MCP). It exposes tools (CRUD operations, run diagnostics), resources (schemas, examples), and prompts (guided workflows) over the standardized MCP transport.

Consequences: - Any MCP-compatible AI tool can interact with Floh without custom integration work. - The MCP server is a standalone process that authenticates via API token or OIDC refresh token, keeping it decoupled from the main server. - Guided prompts (create-workflow, diagnose-failure, document-workflow) encode domain knowledge for AI assistants. - Trade-off: the MCP specification is still evolving, and breaking changes in the protocol may require updates to the package.