Provider Outage Runbook

Status: Canonical (Draft) Last Updated: 2026-02-06 Owner: Operations / Engineering

Purpose

This runbook defines how CAIRL responds to outages or degradation in third-party providers.

Providers are treated as external dependencies. Outages must be handled safely, predictably, and audibly.

Primary goals:

Protect user data
Prevent cascading failures
Preserve compliance guarantees
Communicate clearly and minimally

Scope

This runbook applies to outages involving:

Email providers (MailGap, transactional email)
Authentication providers
Supabase Postgres
AWS S3
Payment providers
Phone providers
Partner API dependencies

Out of scope:

Internal code bugs (separate incident process)
Planned maintenance (change management)

Core Principles

Fail safe, not open
Degrade functionality, not security
Preserve data integrity
Communicate clearly, avoid speculation
Recovery before feature restoration

Outage Classification

Severity Levels

Severity	Description
SEV-1	Critical outage affecting core functionality or compliance
SEV-2	Partial degradation with workarounds
SEV-3	Minor impact, non-critical features

Severity determines response urgency and communication scope.

Detection and Confirmation

Outages may be detected via:

Automated alerts
Provider status pages
Error spikes
User reports

Before acting:

Confirm via at least two signals
Identify affected provider
Determine blast radius

General Response Checklist

Acknowledge incident
Identify provider and scope
Set severity level
Freeze deployments if needed
Activate relevant mitigation steps
Monitor continuously
Communicate status
Document actions

Provider-Specific Response

Email Provider Outage (MailGap or Transactional)

Symptoms:

Inbound emails not received
Replies failing
SMTP errors
Webhook failures

Immediate Actions:

Pause outbound replies
Preserve inbound messages if queued
Prevent credit consumption on failures
Log all failed attempts

User Impact:

Show non-destructive error messages
Do not allow retries that risk duplication

Recovery:

Resume sending gradually
Monitor bounces and complaints
Validate suppression integrity

Authentication Provider Outage

Symptoms:

Login failures
Session validation errors

Immediate Actions:

Block new logins
Preserve existing sessions if valid
Disable sensitive actions requiring re-auth

User Impact:

Show authentication unavailable message

Recovery:

Re-enable login
Monitor session anomalies

Supabase Postgres Outage

Symptoms:

Query failures
RLS errors
Connection timeouts

Immediate Actions:

Disable write operations
Block user actions requiring persistence
Preserve read-only operations if safe

User Impact:

Show maintenance or degraded mode message

Recovery:

Validate data integrity
Verify RLS enforcement
Resume writes cautiously

AWS S3 Outage

Symptoms:

Attachment uploads failing
Downloads unavailable

Immediate Actions:

Block uploads
Preserve metadata creation
Prevent partial writes

User Impact:

Disable upload UI
Allow message viewing without attachments

Recovery:

Resume uploads
Validate object consistency

Payment Provider Outage

Symptoms:

Checkout failures
Billing webhooks delayed

Immediate Actions:

Pause subscription changes
Preserve billing events for replay
Prevent duplicate charges

User Impact:

Show billing unavailable message

Recovery:

Reconcile events
Validate subscription states

Phone Provider Outage

Symptoms:

Calls or SMS failing
Webhook delays

Immediate Actions:

Block call initiation
Preserve voicemail where possible
Prevent retries that cause duplication

User Impact:

Disable phone actions

Recovery:

Resume gradually
Monitor fraud guardrails

Partner API Dependency Outage

Symptoms:

External eligibility or verification failing

Immediate Actions:

Freeze partner calls
Return safe degraded responses
Do not count billing events

User Impact:

Show temporary unavailability message

Recovery:

Resume calls
Monitor deduplication and guardrails

Communication Guidelines

Internal Communication

Notify on-call engineering
Update incident channel
Record timestamps and decisions

User Communication

User-facing messages MUST:

Be factual
Avoid provider names unless required
Avoid timelines unless confirmed

Example: “Some features are temporarily unavailable. We’re monitoring the situation.”

Observability and Logging

During an outage:

Increase log sampling if safe
Track error rates per provider
Log all mitigation actions

All actions MUST be auditable.

Recovery and Post-Incident

After recovery:

Confirm service stability
Gradually re-enable features
Validate compliance invariants
Close incident
Write postmortem

Postmortem Requirements

A postmortem MUST include:

Timeline
Root cause
Impact assessment
Mitigations taken
Preventive actions

Postmortems are non-blaming.

Non-Negotiable Rules

Safety overrides availability
Do not bypass contracts
Do not disable logging
Do not speculate publicly
Every action is documented

References

docs/architecture/observability-plan.new.md
docs/contracts/rate-limits-and-abuse.new.md
docs/contracts/data-retention.new.md
docs/contracts/rls-standard.new.md
docs/specs/mailgap/mailgap-proxy-email-service.spec.new.md

End of Document