Incident Response Plan

Overview

This runbook defines the incident response process for the Agentix platform. It covers severity classification, detection methods, escalation paths, response procedures, communication templates, and post-mortem process. The goal is to minimize downtime, communicate clearly with affected users, and learn from every incident to prevent recurrence. Related runbooks:

Deployment Runbook — rollback procedures for deploy-related incidents
Database Backup & Restore — data recovery procedures
Redis Persistence — Redis state verification after failures
Email Deliverability Setup — email delivery troubleshooting
Uptime Monitoring Setup — alert configuration and monitoring

Support contact: support@agentixx.io

1. Severity Levels

All incidents are classified into one of four severity levels. Severity determines response time, escalation path, and communication requirements.

Level	Name	Criteria	Response Time	Example
P1	Critical	Complete service outage or data loss affecting all tenants	15 minutes	Database down; webhook endpoint unreachable; all workflow executions failing; data corruption
P2	Major	Significant degradation affecting multiple tenants	1 hour	WhatsApp message delivery delays >5 min; AI node failures >10%; inbox not loading for subset of users; BullMQ workers stuck
P3	Minor	Limited impact with workaround available	4 hours	Single tenant workflow failure; analytics not tracking; non-critical UI bug affecting one page; CSV export timing out
P4	Low	Cosmetic or improvement-level issue	Next business day	Typo in UI; non-blocking warning in logs; minor styling inconsistency; slow but functional query

Severity Decision Guide

When in doubt, escalate up one level. It is better to over-classify and de-escalate than to under-classify and miss response windows.

Affecting all tenants? -> P1
Affecting multiple tenants or a core feature (inbox, builder, webhooks)? -> P2
Affecting one tenant or a non-core feature? -> P3
No user impact, cosmetic only? -> P4

2. Detection

Incidents are discovered through four channels:

Automated Alerts

BetterStack uptime monitoring — monitors /health endpoint on the API service. Alerts on downtime or degraded response (>5s latency).
Sentry error spike alerts — 6 alert rules configured across agentix-api and agentix-web projects. Triggers on error count thresholds, new issue types, and error rate spikes.
Railway deployment alerts — notifies on failed builds or crashed services.

Manual Detection

User reports — via support@agentixx.io or in-app feedback popover.
Manual observation — during routine monitoring checks or after deployments.

Initial Assessment Checklist

When an alert fires or a report comes in, run through this checklist to assess scope:

Check BetterStack status page: Is the API health endpoint responding? What is the current uptime percentage?
Check Sentry dashboards: Are there new errors or error spikes? Which service (API or web)? How many affected users?
Check Railway logs: Is the API service running? Are BullMQ workers processing jobs? Any crash loops or OOM kills?
Check Vercel deployment status: Is the latest frontend deployment healthy? Any build failures?
Check recent deploys: Was there a deployment in the last 2 hours? What changed? (Use git log --oneline -10)
Check Redis: Is Redis reachable? Are BullMQ queues backing up? (DBSIZE, INFO clients)
Check PostgreSQL: Is the database responding? Connection pool exhaustion? Check Railway PostgreSQL metrics.

3. Escalation Paths

Severity	First Responder	Escalate To	Escalation Trigger
P1	On-call engineer	CTO/founder	If not mitigated within 15 minutes
P2	On-call engineer	Team lead	If not mitigated within 1 hour
P3	Assigned engineer	On-call engineer	If blocked or no progress in 4 hours
P4	Assigned engineer	None	No escalation required

On-Call Definition

For a small team, the on-call engineer is the person who pushed the last deploy to main. They have the most context about recent changes and are best positioned to diagnose regressions.

Escalation Process

Attempt resolution within the response time for the assigned severity level.
If resolution is not possible within the response window: escalate to the next person in the escalation path.
When escalating: provide a brief summary including severity, impact scope, what has been tried, and current hypothesis.
P1 incidents automatically escalate to all available team members regardless of on-call schedule.

4. Incident Response Steps

Follow these steps sequentially for any P1 or P2 incident. For P3/P4, steps 1-2 and 5-7 are sufficient.

Step 1: Acknowledge

Confirm the incident is real (not a false positive from monitoring).
Assign a severity level using the criteria in Section 1.
Create a tracking thread (Slack channel, GitHub issue, or shared doc).
Assign a first responder (or claim it yourself).
Note the detection time: ____-__-__T__:__:__Z

Step 2: Assess

Determine scope: How many tenants are affected? Which services? Which features?
Check recent deployments: Was anything deployed in the last 2 hours?
Review error logs in Sentry and Railway for root cause clues.
Identify the affected component(s): API, web, workers, database, Redis, external services (WhatsApp API, OpenAI).

Step 3: Contain

Choose the appropriate containment strategy: If deploy-related:

Roll back the deployment. See Deployment Runbook — Rollback Procedures for step-by-step instructions.
For Vercel: redeploy the previous commit from the Deployments tab.
For Railway: redeploy the previous deployment or revert the commit.

If data-related:

Pause affected BullMQ queues to prevent further data corruption.
Do NOT attempt manual database fixes without a backup. See Database Backup & Restore.

If external service failure (WhatsApp API, OpenAI):

Confirm the outage on the provider’s status page.
If WhatsApp API is down: workflows will queue messages; no action needed unless queues are filling up.
If OpenAI is down: AI nodes will fail with error responses; users see fallback messages.

Step 4: Communicate

For P1/P2: Send initial notification to affected users within 30 minutes of detection (use templates in Section 5).
For P3/P4: No external communication required unless specifically requested.
Update the tracking thread with findings every 30 minutes (P1) or every hour (P2).

Step 5: Resolve

Implement the fix (hotfix commit, configuration change, or manual intervention).
Verify the fix resolves the root cause, not just the symptom.
If a code change is needed: push to main, monitor the deployment, verify in production.

Step 6: Verify

Run the health check endpoint: curl https://api.agentix.app/health
Confirm Sentry error rate has returned to baseline.
Confirm BetterStack shows the service as UP.
Monitor for 30 minutes after resolution to ensure the fix holds.
If the incident involved message delivery: verify a test message flows end-to-end.

Step 7: Close

Update the tracking thread with resolution details and timeline.
Mark the incident as resolved.
For P1/P2: Schedule a post-mortem within 48 hours (see Section 6).
For P3/P4: Document the root cause and fix in the tracking thread. No formal post-mortem required.

5. Communication Templates

Use these templates when communicating with affected users about incidents. Send via email from support@agentixx.io.

Template 1: Initial Notification

Subject: [Agentix] Service Disruption — We’re Investigating We are aware of [issue description] affecting [scope — e.g., message delivery, workflow execution, inbox access]. Our team is actively investigating. We will provide an update within [timeframe — e.g., 30 minutes, 1 hour]. If you have questions, reply to this email or contact us at support@agentixx.io. — The Agentix Team

Template 2: Status Update

Subject: [Agentix] Update on Service Disruption Update on [issue]: Current status: [e.g., We have identified the root cause and are deploying a fix.] What we’ve done: [e.g., Rolled back the problematic deployment; paused affected queues to prevent further impact.] We expect to provide the next update in [timeframe]. — The Agentix Team

Template 3: Resolution Notification

Subject: [Agentix] Service Restored The issue affecting [scope] has been resolved as of [time, e.g., 2026-03-15 14:30 UTC]. Root cause: [Brief, non-technical explanation — e.g., A deployment introduced a configuration error that prevented message processing.] Impact: [e.g., Message delivery was delayed by approximately 45 minutes for all tenants.] What we’re doing to prevent this: [e.g., We are adding automated checks to our deployment pipeline to catch this class of error.] We will publish a detailed post-mortem within [N] business days. We apologize for the disruption. If you notice any remaining issues, please contact us at support@agentixx.io. — The Agentix Team

Communication Timing

Severity	Initial Notification	Updates	Resolution
P1	Within 30 minutes	Every 30 minutes	Immediately on resolution
P2	Within 1 hour	Every hour	Within 1 hour of resolution
P3	Not required	Not required	Not required
P4	Not required	Not required	Not required

6. Post-Mortem Process

Post-mortems are required for all P1 and P2 incidents and optional for P3 incidents that reveal systemic issues.

Timeline

Schedule the post-mortem within 48 hours of incident resolution.
Complete the write-up within 5 business days of the incident.
Review action items in the next team meeting.

Post-Mortem Template

Store post-mortem documents in docs/runbooks/post-mortems/ with the filename format: YYYY-MM-DD-brief-title.md.

# Post-Mortem: [Incident Title]

**Date:** [YYYY-MM-DD]
**Severity:** [P1/P2]
**Duration:** [detection to resolution, e.g., 2h 15m]
**Author:** [name]

## Impact

- **Tenants affected:** [number or "all"]
- **Messages delayed/lost:** [count or estimate]
- **Feature(s) impacted:** [e.g., workflow execution, inbox, message delivery]
- **Revenue impact:** [if applicable]

## Timeline

All times in UTC.

| Time | Event |
|------|-------|
| HH:MM | [First alert / detection] |
| HH:MM | [Incident acknowledged, severity assigned] |
| HH:MM | [Investigation started] |
| HH:MM | [Root cause identified] |
| HH:MM | [Fix deployed] |
| HH:MM | [Verification complete, incident closed] |

## Root Cause

[Detailed technical explanation of what went wrong.]

### 5 Whys Analysis

1. **Why did the incident occur?** [answer]
2. **Why did [answer 1] happen?** [answer]
3. **Why did [answer 2] happen?** [answer]
4. **Why did [answer 3] happen?** [answer]
5. **Why did [answer 4] happen?** [root cause]

## What Went Well

- [e.g., Alerts fired within 2 minutes of the issue starting]
- [e.g., Rollback procedure worked as documented]

## What Could Be Improved

- [e.g., Detection took 15 minutes because the specific error type was not covered by alerts]
- [e.g., Rollback process required manual steps that could be automated]

## Action Items

| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| [e.g., Add alert rule for this error class] | [name] | [date] | [ ] Open |
| [e.g., Automate rollback for this scenario] | [name] | [date] | [ ] Open |
| [e.g., Add integration test for this path] | [name] | [date] | [ ] Open |

Post-Mortem Principles

Blameless: Focus on systems and processes, not individuals. “The deploy process allowed a bad configuration” not “Person X deployed bad config.”
Action-oriented: Every post-mortem must produce at least one actionable item with an owner and deadline.
Shared: Post-mortems are visible to the entire team. Transparency builds trust and collective learning.

7. Runbook Quick Reference

When responding to an incident, use this table to find the relevant runbook for the affected system:

Scenario	Runbook	Key Section
Service outage after deploy	Deployment Runbook	Rollback Procedures
Database corruption or data loss	Database Backup & Restore	Manual Restore
Redis failure or queue backup	Redis Persistence	Verification Steps
Email delivery failure	Email Deliverability Setup	Troubleshooting
Monitoring alerts configuration	Uptime Monitoring Setup	Alert Configuration
Staging environment issues	Staging Environment Setup	Verification

Last updated: 2026-03-27

Getting Started

Runbooks

Incident Response Plan

Incident Response Plan

Overview

1. Severity Levels

Severity Decision Guide

2. Detection

Automated Alerts

Manual Detection

Initial Assessment Checklist

3. Escalation Paths

On-Call Definition

Escalation Process

4. Incident Response Steps

Step 1: Acknowledge

Step 2: Assess

Step 3: Contain

Step 4: Communicate

Step 5: Resolve

Step 6: Verify

Step 7: Close

5. Communication Templates

Template 1: Initial Notification

Template 2: Status Update

Template 3: Resolution Notification

Communication Timing

6. Post-Mortem Process

Timeline

Post-Mortem Template

Post-Mortem Principles

7. Runbook Quick Reference

Getting Started

Runbooks

​Incident Response Plan

​Overview

​1. Severity Levels

​Severity Decision Guide

​2. Detection

​Automated Alerts

​Manual Detection

​Initial Assessment Checklist

​3. Escalation Paths

​On-Call Definition

​Escalation Process

​4. Incident Response Steps

​Step 1: Acknowledge

​Step 2: Assess

​Step 3: Contain

​Step 4: Communicate

​Step 5: Resolve

​Step 6: Verify

​Step 7: Close

​5. Communication Templates

​Template 1: Initial Notification

​Template 2: Status Update

​Template 3: Resolution Notification

​Communication Timing

​6. Post-Mortem Process

​Timeline

​Post-Mortem Template

​Post-Mortem Principles

​7. Runbook Quick Reference

Incident Response Plan

Overview

1. Severity Levels

Severity Decision Guide

2. Detection

Automated Alerts

Manual Detection

Initial Assessment Checklist

3. Escalation Paths

On-Call Definition

Escalation Process

4. Incident Response Steps

Step 1: Acknowledge

Step 2: Assess

Step 3: Contain

Step 4: Communicate

Step 5: Resolve

Step 6: Verify

Step 7: Close

5. Communication Templates

Template 1: Initial Notification

Template 2: Status Update

Template 3: Resolution Notification

Communication Timing

6. Post-Mortem Process

Timeline

Post-Mortem Template

Post-Mortem Principles

7. Runbook Quick Reference