Observability Plan

Status: Canonical Last Updated: 2026-02-06 Owner: Engineering

Purpose

This document defines the observability requirements for CAIRL.

It establishes:

What MUST be observable
What MUST be logged, measured, and alerted on
How observability supports security, compliance, and reliability
Boundaries between application, database, and provider visibility

This document defines architectural expectations. Implementation details may evolve, but guarantees MUST hold.

Scope

This plan applies to:

All application services
All server-side code paths
All database access patterns
All third-party provider integrations
All compliance-relevant operations

Client-side observability is explicitly out of scope.

Core Observability Principles (Invariants)

Security- and compliance-relevant actions are always observable.
Sensitive data is never logged.
Observability failures are treated as defects.
Logs are structured, queryable, and retained intentionally.
Access to observability data is restricted.

Observability Pillars

CAIRL observability is structured around three pillars:

Logs – What happened
Metrics – How often and how severe
Traces – How a request flowed through the system

All three pillars MUST exist for critical paths.

Logging Requirements

What MUST Be Logged

The following events MUST be logged server-side:

Authentication and session lifecycle events
Authorization failures
RLS access denials
Access to HIPAA-regulated data
Admin actions and overrides
Rate limit and abuse guardrail triggers
Account suspension and restoration
Data deletion and retention actions
Provider errors and throttling events
Background job failures

What MUST NOT Be Logged

The following MUST NOT appear in logs:

PHI or HIPAA document contents
Biometric image data
Authentication secrets or tokens
Full request or response payloads from providers
Raw user-generated content unless explicitly approved

Logs must prefer identifiers over payloads.

Log Structure

All logs MUST include:

Timestamp (UTC)
Environment (dev, staging, prod)
Request or operation identifier
Actor identifier (user_id, admin_id, system)
Action type
Outcome (success, failure, blocked)
Reason code (where applicable)

Logs SHOULD be machine-parseable.

Metrics Requirements

Core Metrics

The system MUST emit metrics for:

Request volume and error rates
Authorization and RLS denials
Rate limit hits and guardrail activations
Provider API latency and failure rates
Background job success and failure counts
Email forwarding caps reached
Phone guardrail blocks
Partner API usage and deduplication events

Metrics MUST be aggregated and non-identifying.

Compliance Metrics

The following MUST be measurable:

HIPAA data access counts
Admin access frequency to regulated data
Retention job execution and failures
Deletion completion rates
Audit log growth and retention

Tracing Requirements

Request Tracing

For server-side requests:

A unique trace identifier MUST be generated
The identifier MUST propagate through:
- Application logic
- Database operations
- Provider calls (where possible)

Tracing MUST support:

Latency analysis
Failure attribution
Dependency mapping

Background and Async Tracing

Background jobs MUST emit:

Start and completion events
Failure events with reason
Correlation to triggering action where applicable

RLS and Database Observability

The following MUST be observable:

RLS policy denials
Use of service-role credentials
Access to compliance-regulated tables
Elevated access usage

Database observability MUST NOT expose sensitive row contents.

Provider Observability

For each third-party provider integration:

Request counts
Error rates
Throttling or suppression events
Guardrail activations
Retry behavior

Provider identifiers SHOULD be logged, not full payloads.

Alerting Requirements

Mandatory Alerts

Alerts MUST exist for:

Sustained authorization or RLS failures
HIPAA access anomalies
Abuse guardrail spikes
Provider outages or throttling
Background job failures affecting retention or deletion
Unexpected growth in regulated data stores

Alerts MUST be actionable and routed to on-call owners.

Alert Hygiene

Alerts SHOULD avoid noise
Repeated alerts MUST be aggregated
Alert thresholds MUST be reviewed periodically

Access Control for Observability Data

Observability data is restricted to authorized roles
HIPAA-related logs are admin-only
Partner-related logs are scoped appropriately
Access to logs MUST itself be logged

Retention of Observability Data

Observability data retention MUST comply with data retention contracts.

Minimum expectations:

Security and abuse logs: per abuse contract
HIPAA access logs: 6 years
General application logs: finite, documented window

Incident Support

Observability MUST support:

Root cause analysis
Audit evidence generation
Compliance reporting
Partner dispute resolution

Lack of observability is grounds to halt deployments.

Non-Negotiable Rules

Compliance visibility is mandatory.
Sensitive data is never logged.
Observability gaps are defects.
Logs are access-controlled.
Instrumentation is not optional.

References

docs/governance/doc-authority.new.md
docs/contracts/authz-and-roles.new.md
docs/contracts/data-retention.new.md
docs/contracts/rate-limits-and-abuse.new.md
docs/contracts/rls-standard.new.md
docs/architecture/system-overview.new.md

End of Document