Provider Outage Runbook
Status: Canonical (Draft) Last Updated: 2026-02-06 Owner: Operations / Engineering
Purpose
This runbook defines how CAIRL responds to outages or degradation in third-party providers.
Providers are treated as external dependencies. Outages must be handled safely, predictably, and audibly.
Primary goals:
- Protect user data
- Prevent cascading failures
- Preserve compliance guarantees
- Communicate clearly and minimally
Scope
This runbook applies to outages involving:
- Email providers (MailGap, transactional email)
- Authentication providers
- Supabase Postgres
- AWS S3
- Payment providers
- Phone providers
- Partner API dependencies
Out of scope:
- Internal code bugs (separate incident process)
- Planned maintenance (change management)
Core Principles
- Fail safe, not open
- Degrade functionality, not security
- Preserve data integrity
- Communicate clearly, avoid speculation
- Recovery before feature restoration
Outage Classification
Severity Levels
| Severity | Description |
|---|---|
| SEV-1 | Critical outage affecting core functionality or compliance |
| SEV-2 | Partial degradation with workarounds |
| SEV-3 | Minor impact, non-critical features |
Severity determines response urgency and communication scope.
Detection and Confirmation
Outages may be detected via:
- Automated alerts
- Provider status pages
- Error spikes
- User reports
Before acting:
- Confirm via at least two signals
- Identify affected provider
- Determine blast radius
General Response Checklist
- Acknowledge incident
- Identify provider and scope
- Set severity level
- Freeze deployments if needed
- Activate relevant mitigation steps
- Monitor continuously
- Communicate status
- Document actions
Provider-Specific Response
Email Provider Outage (MailGap or Transactional)
Symptoms:
- Inbound emails not received
- Replies failing
- SMTP errors
- Webhook failures
Immediate Actions:
- Pause outbound replies
- Preserve inbound messages if queued
- Prevent credit consumption on failures
- Log all failed attempts
User Impact:
- Show non-destructive error messages
- Do not allow retries that risk duplication
Recovery:
- Resume sending gradually
- Monitor bounces and complaints
- Validate suppression integrity
Authentication Provider Outage
Symptoms:
- Login failures
- Session validation errors
Immediate Actions:
- Block new logins
- Preserve existing sessions if valid
- Disable sensitive actions requiring re-auth
User Impact:
- Show authentication unavailable message
Recovery:
- Re-enable login
- Monitor session anomalies
Supabase Postgres Outage
Symptoms:
- Query failures
- RLS errors
- Connection timeouts
Immediate Actions:
- Disable write operations
- Block user actions requiring persistence
- Preserve read-only operations if safe
User Impact:
- Show maintenance or degraded mode message
Recovery:
- Validate data integrity
- Verify RLS enforcement
- Resume writes cautiously
AWS S3 Outage
Symptoms:
- Attachment uploads failing
- Downloads unavailable
Immediate Actions:
- Block uploads
- Preserve metadata creation
- Prevent partial writes
User Impact:
- Disable upload UI
- Allow message viewing without attachments
Recovery:
- Resume uploads
- Validate object consistency
Payment Provider Outage
Symptoms:
- Checkout failures
- Billing webhooks delayed
Immediate Actions:
- Pause subscription changes
- Preserve billing events for replay
- Prevent duplicate charges
User Impact:
- Show billing unavailable message
Recovery:
- Reconcile events
- Validate subscription states
Phone Provider Outage
Symptoms:
- Calls or SMS failing
- Webhook delays
Immediate Actions:
- Block call initiation
- Preserve voicemail where possible
- Prevent retries that cause duplication
User Impact:
- Disable phone actions
Recovery:
- Resume gradually
- Monitor fraud guardrails
Partner API Dependency Outage
Symptoms:
- External eligibility or verification failing
Immediate Actions:
- Freeze partner calls
- Return safe degraded responses
- Do not count billing events
User Impact:
- Show temporary unavailability message
Recovery:
- Resume calls
- Monitor deduplication and guardrails
Communication Guidelines
Internal Communication
- Notify on-call engineering
- Update incident channel
- Record timestamps and decisions
User Communication
User-facing messages MUST:
- Be factual
- Avoid provider names unless required
- Avoid timelines unless confirmed
Example: “Some features are temporarily unavailable. We’re monitoring the situation.”
Observability and Logging
During an outage:
- Increase log sampling if safe
- Track error rates per provider
- Log all mitigation actions
All actions MUST be auditable.
Recovery and Post-Incident
After recovery:
- Confirm service stability
- Gradually re-enable features
- Validate compliance invariants
- Close incident
- Write postmortem
Postmortem Requirements
A postmortem MUST include:
- Timeline
- Root cause
- Impact assessment
- Mitigations taken
- Preventive actions
Postmortems are non-blaming.
Non-Negotiable Rules
- Safety overrides availability
- Do not bypass contracts
- Do not disable logging
- Do not speculate publicly
- Every action is documented
References
- docs/architecture/observability-plan.new.md
- docs/contracts/rate-limits-and-abuse.new.md
- docs/contracts/data-retention.new.md
- docs/contracts/rls-standard.new.md
- docs/specs/mailgap/mailgap-proxy-email-service.spec.new.md
End of Document