Provider Outage Runbook

Status: Canonical (Draft) Last Updated: 2026-02-06 Owner: Operations / Engineering


Purpose

This runbook defines how CAIRL responds to outages or degradation in third-party providers.

Providers are treated as external dependencies. Outages must be handled safely, predictably, and audibly.

Primary goals:

  • Protect user data
  • Prevent cascading failures
  • Preserve compliance guarantees
  • Communicate clearly and minimally

Scope

This runbook applies to outages involving:

  • Email providers (MailGap, transactional email)
  • Authentication providers
  • Supabase Postgres
  • AWS S3
  • Payment providers
  • Phone providers
  • Partner API dependencies

Out of scope:

  • Internal code bugs (separate incident process)
  • Planned maintenance (change management)

Core Principles

  1. Fail safe, not open
  2. Degrade functionality, not security
  3. Preserve data integrity
  4. Communicate clearly, avoid speculation
  5. Recovery before feature restoration

Outage Classification

Severity Levels

Severity Description
SEV-1 Critical outage affecting core functionality or compliance
SEV-2 Partial degradation with workarounds
SEV-3 Minor impact, non-critical features

Severity determines response urgency and communication scope.


Detection and Confirmation

Outages may be detected via:

  • Automated alerts
  • Provider status pages
  • Error spikes
  • User reports

Before acting:

  • Confirm via at least two signals
  • Identify affected provider
  • Determine blast radius

General Response Checklist

  1. Acknowledge incident
  2. Identify provider and scope
  3. Set severity level
  4. Freeze deployments if needed
  5. Activate relevant mitigation steps
  6. Monitor continuously
  7. Communicate status
  8. Document actions

Provider-Specific Response


Email Provider Outage (MailGap or Transactional)

Symptoms:

  • Inbound emails not received
  • Replies failing
  • SMTP errors
  • Webhook failures

Immediate Actions:

  • Pause outbound replies
  • Preserve inbound messages if queued
  • Prevent credit consumption on failures
  • Log all failed attempts

User Impact:

  • Show non-destructive error messages
  • Do not allow retries that risk duplication

Recovery:

  • Resume sending gradually
  • Monitor bounces and complaints
  • Validate suppression integrity

Authentication Provider Outage

Symptoms:

  • Login failures
  • Session validation errors

Immediate Actions:

  • Block new logins
  • Preserve existing sessions if valid
  • Disable sensitive actions requiring re-auth

User Impact:

  • Show authentication unavailable message

Recovery:

  • Re-enable login
  • Monitor session anomalies

Supabase Postgres Outage

Symptoms:

  • Query failures
  • RLS errors
  • Connection timeouts

Immediate Actions:

  • Disable write operations
  • Block user actions requiring persistence
  • Preserve read-only operations if safe

User Impact:

  • Show maintenance or degraded mode message

Recovery:

  • Validate data integrity
  • Verify RLS enforcement
  • Resume writes cautiously

AWS S3 Outage

Symptoms:

  • Attachment uploads failing
  • Downloads unavailable

Immediate Actions:

  • Block uploads
  • Preserve metadata creation
  • Prevent partial writes

User Impact:

  • Disable upload UI
  • Allow message viewing without attachments

Recovery:

  • Resume uploads
  • Validate object consistency

Payment Provider Outage

Symptoms:

  • Checkout failures
  • Billing webhooks delayed

Immediate Actions:

  • Pause subscription changes
  • Preserve billing events for replay
  • Prevent duplicate charges

User Impact:

  • Show billing unavailable message

Recovery:

  • Reconcile events
  • Validate subscription states

Phone Provider Outage

Symptoms:

  • Calls or SMS failing
  • Webhook delays

Immediate Actions:

  • Block call initiation
  • Preserve voicemail where possible
  • Prevent retries that cause duplication

User Impact:

  • Disable phone actions

Recovery:

  • Resume gradually
  • Monitor fraud guardrails

Partner API Dependency Outage

Symptoms:

  • External eligibility or verification failing

Immediate Actions:

  • Freeze partner calls
  • Return safe degraded responses
  • Do not count billing events

User Impact:

  • Show temporary unavailability message

Recovery:

  • Resume calls
  • Monitor deduplication and guardrails

Communication Guidelines

Internal Communication

  • Notify on-call engineering
  • Update incident channel
  • Record timestamps and decisions

User Communication

User-facing messages MUST:

  • Be factual
  • Avoid provider names unless required
  • Avoid timelines unless confirmed

Example: “Some features are temporarily unavailable. We’re monitoring the situation.”


Observability and Logging

During an outage:

  • Increase log sampling if safe
  • Track error rates per provider
  • Log all mitigation actions

All actions MUST be auditable.


Recovery and Post-Incident

After recovery:

  1. Confirm service stability
  2. Gradually re-enable features
  3. Validate compliance invariants
  4. Close incident
  5. Write postmortem

Postmortem Requirements

A postmortem MUST include:

  • Timeline
  • Root cause
  • Impact assessment
  • Mitigations taken
  • Preventive actions

Postmortems are non-blaming.


Non-Negotiable Rules

  • Safety overrides availability
  • Do not bypass contracts
  • Do not disable logging
  • Do not speculate publicly
  • Every action is documented

References

  • docs/architecture/observability-plan.new.md
  • docs/contracts/rate-limits-and-abuse.new.md
  • docs/contracts/data-retention.new.md
  • docs/contracts/rls-standard.new.md
  • docs/specs/mailgap/mailgap-proxy-email-service.spec.new.md

End of Document