Observability Plan
Status: Canonical Last Updated: 2026-02-06 Owner: Engineering
Purpose
This document defines the observability requirements for CAIRL.
It establishes:
- What MUST be observable
- What MUST be logged, measured, and alerted on
- How observability supports security, compliance, and reliability
- Boundaries between application, database, and provider visibility
This document defines architectural expectations. Implementation details may evolve, but guarantees MUST hold.
Scope
This plan applies to:
- All application services
- All server-side code paths
- All database access patterns
- All third-party provider integrations
- All compliance-relevant operations
Client-side observability is explicitly out of scope.
Core Observability Principles (Invariants)
- Security- and compliance-relevant actions are always observable.
- Sensitive data is never logged.
- Observability failures are treated as defects.
- Logs are structured, queryable, and retained intentionally.
- Access to observability data is restricted.
Observability Pillars
CAIRL observability is structured around three pillars:
- Logs – What happened
- Metrics – How often and how severe
- Traces – How a request flowed through the system
All three pillars MUST exist for critical paths.
Logging Requirements
What MUST Be Logged
The following events MUST be logged server-side:
- Authentication and session lifecycle events
- Authorization failures
- RLS access denials
- Access to HIPAA-regulated data
- Admin actions and overrides
- Rate limit and abuse guardrail triggers
- Account suspension and restoration
- Data deletion and retention actions
- Provider errors and throttling events
- Background job failures
What MUST NOT Be Logged
The following MUST NOT appear in logs:
- PHI or HIPAA document contents
- Biometric image data
- Authentication secrets or tokens
- Full request or response payloads from providers
- Raw user-generated content unless explicitly approved
Logs must prefer identifiers over payloads.
Log Structure
All logs MUST include:
- Timestamp (UTC)
- Environment (dev, staging, prod)
- Request or operation identifier
- Actor identifier (user_id, admin_id, system)
- Action type
- Outcome (success, failure, blocked)
- Reason code (where applicable)
Logs SHOULD be machine-parseable.
Metrics Requirements
Core Metrics
The system MUST emit metrics for:
- Request volume and error rates
- Authorization and RLS denials
- Rate limit hits and guardrail activations
- Provider API latency and failure rates
- Background job success and failure counts
- Email forwarding caps reached
- Phone guardrail blocks
- Partner API usage and deduplication events
Metrics MUST be aggregated and non-identifying.
Compliance Metrics
The following MUST be measurable:
- HIPAA data access counts
- Admin access frequency to regulated data
- Retention job execution and failures
- Deletion completion rates
- Audit log growth and retention
Tracing Requirements
Request Tracing
For server-side requests:
- A unique trace identifier MUST be generated
- The identifier MUST propagate through:
- Application logic
- Database operations
- Provider calls (where possible)
Tracing MUST support:
- Latency analysis
- Failure attribution
- Dependency mapping
Background and Async Tracing
Background jobs MUST emit:
- Start and completion events
- Failure events with reason
- Correlation to triggering action where applicable
RLS and Database Observability
The following MUST be observable:
- RLS policy denials
- Use of service-role credentials
- Access to compliance-regulated tables
- Elevated access usage
Database observability MUST NOT expose sensitive row contents.
Provider Observability
For each third-party provider integration:
- Request counts
- Error rates
- Throttling or suppression events
- Guardrail activations
- Retry behavior
Provider identifiers SHOULD be logged, not full payloads.
Alerting Requirements
Mandatory Alerts
Alerts MUST exist for:
- Sustained authorization or RLS failures
- HIPAA access anomalies
- Abuse guardrail spikes
- Provider outages or throttling
- Background job failures affecting retention or deletion
- Unexpected growth in regulated data stores
Alerts MUST be actionable and routed to on-call owners.
Alert Hygiene
- Alerts SHOULD avoid noise
- Repeated alerts MUST be aggregated
- Alert thresholds MUST be reviewed periodically
Access Control for Observability Data
- Observability data is restricted to authorized roles
- HIPAA-related logs are admin-only
- Partner-related logs are scoped appropriately
- Access to logs MUST itself be logged
Retention of Observability Data
Observability data retention MUST comply with data retention contracts.
Minimum expectations:
- Security and abuse logs: per abuse contract
- HIPAA access logs: 6 years
- General application logs: finite, documented window
Incident Support
Observability MUST support:
- Root cause analysis
- Audit evidence generation
- Compliance reporting
- Partner dispute resolution
Lack of observability is grounds to halt deployments.
Non-Negotiable Rules
- Compliance visibility is mandatory.
- Sensitive data is never logged.
- Observability gaps are defects.
- Logs are access-controlled.
- Instrumentation is not optional.
References
- docs/governance/doc-authority.new.md
- docs/contracts/authz-and-roles.new.md
- docs/contracts/data-retention.new.md
- docs/contracts/rate-limits-and-abuse.new.md
- docs/contracts/rls-standard.new.md
- docs/architecture/system-overview.new.md
End of Document