Skip to main content

Incident Response Plan

Overview

This runbook defines the incident response process for the Agentix platform. It covers severity classification, detection methods, escalation paths, response procedures, communication templates, and post-mortem process. The goal is to minimize downtime, communicate clearly with affected users, and learn from every incident to prevent recurrence. Related runbooks: Support contact: support@agentixx.io

1. Severity Levels

All incidents are classified into one of four severity levels. Severity determines response time, escalation path, and communication requirements.
LevelNameCriteriaResponse TimeExample
P1CriticalComplete service outage or data loss affecting all tenants15 minutesDatabase down; webhook endpoint unreachable; all workflow executions failing; data corruption
P2MajorSignificant degradation affecting multiple tenants1 hourWhatsApp message delivery delays >5 min; AI node failures >10%; inbox not loading for subset of users; BullMQ workers stuck
P3MinorLimited impact with workaround available4 hoursSingle tenant workflow failure; analytics not tracking; non-critical UI bug affecting one page; CSV export timing out
P4LowCosmetic or improvement-level issueNext business dayTypo in UI; non-blocking warning in logs; minor styling inconsistency; slow but functional query

Severity Decision Guide

When in doubt, escalate up one level. It is better to over-classify and de-escalate than to under-classify and miss response windows.
  • Affecting all tenants? -> P1
  • Affecting multiple tenants or a core feature (inbox, builder, webhooks)? -> P2
  • Affecting one tenant or a non-core feature? -> P3
  • No user impact, cosmetic only? -> P4

2. Detection

Incidents are discovered through four channels:

Automated Alerts

  1. BetterStack uptime monitoring — monitors /health endpoint on the API service. Alerts on downtime or degraded response (>5s latency).
  2. Sentry error spike alerts — 6 alert rules configured across agentix-api and agentix-web projects. Triggers on error count thresholds, new issue types, and error rate spikes.
  3. Railway deployment alerts — notifies on failed builds or crashed services.

Manual Detection

  1. User reports — via support@agentixx.io or in-app feedback popover.
  2. Manual observation — during routine monitoring checks or after deployments.

Initial Assessment Checklist

When an alert fires or a report comes in, run through this checklist to assess scope:
  • Check BetterStack status page: Is the API health endpoint responding? What is the current uptime percentage?
  • Check Sentry dashboards: Are there new errors or error spikes? Which service (API or web)? How many affected users?
  • Check Railway logs: Is the API service running? Are BullMQ workers processing jobs? Any crash loops or OOM kills?
  • Check Vercel deployment status: Is the latest frontend deployment healthy? Any build failures?
  • Check recent deploys: Was there a deployment in the last 2 hours? What changed? (Use git log --oneline -10)
  • Check Redis: Is Redis reachable? Are BullMQ queues backing up? (DBSIZE, INFO clients)
  • Check PostgreSQL: Is the database responding? Connection pool exhaustion? Check Railway PostgreSQL metrics.

3. Escalation Paths

SeverityFirst ResponderEscalate ToEscalation Trigger
P1On-call engineerCTO/founderIf not mitigated within 15 minutes
P2On-call engineerTeam leadIf not mitigated within 1 hour
P3Assigned engineerOn-call engineerIf blocked or no progress in 4 hours
P4Assigned engineerNoneNo escalation required

On-Call Definition

For a small team, the on-call engineer is the person who pushed the last deploy to main. They have the most context about recent changes and are best positioned to diagnose regressions.

Escalation Process

  1. Attempt resolution within the response time for the assigned severity level.
  2. If resolution is not possible within the response window: escalate to the next person in the escalation path.
  3. When escalating: provide a brief summary including severity, impact scope, what has been tried, and current hypothesis.
  4. P1 incidents automatically escalate to all available team members regardless of on-call schedule.

4. Incident Response Steps

Follow these steps sequentially for any P1 or P2 incident. For P3/P4, steps 1-2 and 5-7 are sufficient.

Step 1: Acknowledge

  • Confirm the incident is real (not a false positive from monitoring).
  • Assign a severity level using the criteria in Section 1.
  • Create a tracking thread (Slack channel, GitHub issue, or shared doc).
  • Assign a first responder (or claim it yourself).
  • Note the detection time: ____-__-__T__:__:__Z

Step 2: Assess

  • Determine scope: How many tenants are affected? Which services? Which features?
  • Check recent deployments: Was anything deployed in the last 2 hours?
  • Review error logs in Sentry and Railway for root cause clues.
  • Identify the affected component(s): API, web, workers, database, Redis, external services (WhatsApp API, OpenAI).

Step 3: Contain

Choose the appropriate containment strategy: If deploy-related:
  • Roll back the deployment. See Deployment Runbook — Rollback Procedures for step-by-step instructions.
  • For Vercel: redeploy the previous commit from the Deployments tab.
  • For Railway: redeploy the previous deployment or revert the commit.
If data-related:
  • Pause affected BullMQ queues to prevent further data corruption.
  • Do NOT attempt manual database fixes without a backup. See Database Backup & Restore.
If external service failure (WhatsApp API, OpenAI):
  • Confirm the outage on the provider’s status page.
  • If WhatsApp API is down: workflows will queue messages; no action needed unless queues are filling up.
  • If OpenAI is down: AI nodes will fail with error responses; users see fallback messages.

Step 4: Communicate

  • For P1/P2: Send initial notification to affected users within 30 minutes of detection (use templates in Section 5).
  • For P3/P4: No external communication required unless specifically requested.
  • Update the tracking thread with findings every 30 minutes (P1) or every hour (P2).

Step 5: Resolve

  • Implement the fix (hotfix commit, configuration change, or manual intervention).
  • Verify the fix resolves the root cause, not just the symptom.
  • If a code change is needed: push to main, monitor the deployment, verify in production.

Step 6: Verify

  • Run the health check endpoint: curl https://api.agentix.app/health
  • Confirm Sentry error rate has returned to baseline.
  • Confirm BetterStack shows the service as UP.
  • Monitor for 30 minutes after resolution to ensure the fix holds.
  • If the incident involved message delivery: verify a test message flows end-to-end.

Step 7: Close

  • Update the tracking thread with resolution details and timeline.
  • Mark the incident as resolved.
  • For P1/P2: Schedule a post-mortem within 48 hours (see Section 6).
  • For P3/P4: Document the root cause and fix in the tracking thread. No formal post-mortem required.

5. Communication Templates

Use these templates when communicating with affected users about incidents. Send via email from support@agentixx.io.

Template 1: Initial Notification

Subject: [Agentix] Service Disruption — We’re Investigating We are aware of [issue description] affecting [scope — e.g., message delivery, workflow execution, inbox access]. Our team is actively investigating. We will provide an update within [timeframe — e.g., 30 minutes, 1 hour]. If you have questions, reply to this email or contact us at support@agentixx.io. — The Agentix Team

Template 2: Status Update

Subject: [Agentix] Update on Service Disruption Update on [issue]: Current status: [e.g., We have identified the root cause and are deploying a fix.] What we’ve done: [e.g., Rolled back the problematic deployment; paused affected queues to prevent further impact.] We expect to provide the next update in [timeframe]. — The Agentix Team

Template 3: Resolution Notification

Subject: [Agentix] Service Restored The issue affecting [scope] has been resolved as of [time, e.g., 2026-03-15 14:30 UTC]. Root cause: [Brief, non-technical explanation — e.g., A deployment introduced a configuration error that prevented message processing.] Impact: [e.g., Message delivery was delayed by approximately 45 minutes for all tenants.] What we’re doing to prevent this: [e.g., We are adding automated checks to our deployment pipeline to catch this class of error.] We will publish a detailed post-mortem within [N] business days. We apologize for the disruption. If you notice any remaining issues, please contact us at support@agentixx.io. — The Agentix Team

Communication Timing

SeverityInitial NotificationUpdatesResolution
P1Within 30 minutesEvery 30 minutesImmediately on resolution
P2Within 1 hourEvery hourWithin 1 hour of resolution
P3Not requiredNot requiredNot required
P4Not requiredNot requiredNot required

6. Post-Mortem Process

Post-mortems are required for all P1 and P2 incidents and optional for P3 incidents that reveal systemic issues.

Timeline

  1. Schedule the post-mortem within 48 hours of incident resolution.
  2. Complete the write-up within 5 business days of the incident.
  3. Review action items in the next team meeting.

Post-Mortem Template

Store post-mortem documents in docs/runbooks/post-mortems/ with the filename format: YYYY-MM-DD-brief-title.md.
# Post-Mortem: [Incident Title]

**Date:** [YYYY-MM-DD]
**Severity:** [P1/P2]
**Duration:** [detection to resolution, e.g., 2h 15m]
**Author:** [name]

## Impact

- **Tenants affected:** [number or "all"]
- **Messages delayed/lost:** [count or estimate]
- **Feature(s) impacted:** [e.g., workflow execution, inbox, message delivery]
- **Revenue impact:** [if applicable]

## Timeline

All times in UTC.

| Time | Event |
|------|-------|
| HH:MM | [First alert / detection] |
| HH:MM | [Incident acknowledged, severity assigned] |
| HH:MM | [Investigation started] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix deployed] |
| HH:MM | [Verification complete, incident closed] |

## Root Cause

[Detailed technical explanation of what went wrong.]

### 5 Whys Analysis

1. **Why did the incident occur?** [answer]
2. **Why did [answer 1] happen?** [answer]
3. **Why did [answer 2] happen?** [answer]
4. **Why did [answer 3] happen?** [answer]
5. **Why did [answer 4] happen?** [root cause]

## What Went Well

- [e.g., Alerts fired within 2 minutes of the issue starting]
- [e.g., Rollback procedure worked as documented]

## What Could Be Improved

- [e.g., Detection took 15 minutes because the specific error type was not covered by alerts]
- [e.g., Rollback process required manual steps that could be automated]

## Action Items

| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| [e.g., Add alert rule for this error class] | [name] | [date] | [ ] Open |
| [e.g., Automate rollback for this scenario] | [name] | [date] | [ ] Open |
| [e.g., Add integration test for this path] | [name] | [date] | [ ] Open |

Post-Mortem Principles

  • Blameless: Focus on systems and processes, not individuals. “The deploy process allowed a bad configuration” not “Person X deployed bad config.”
  • Action-oriented: Every post-mortem must produce at least one actionable item with an owner and deadline.
  • Shared: Post-mortems are visible to the entire team. Transparency builds trust and collective learning.

7. Runbook Quick Reference

When responding to an incident, use this table to find the relevant runbook for the affected system:
ScenarioRunbookKey Section
Service outage after deployDeployment RunbookRollback Procedures
Database corruption or data lossDatabase Backup & RestoreManual Restore
Redis failure or queue backupRedis PersistenceVerification Steps
Email delivery failureEmail Deliverability SetupTroubleshooting
Monitoring alerts configurationUptime Monitoring SetupAlert Configuration
Staging environment issuesStaging Environment SetupVerification

Last updated: 2026-03-27