Incident Response Plan
Overview
This runbook defines the incident response process for the Agentix platform. It covers severity classification, detection methods, escalation paths, response procedures, communication templates, and post-mortem process. The goal is to minimize downtime, communicate clearly with affected users, and learn from every incident to prevent recurrence. Related runbooks:- Deployment Runbook — rollback procedures for deploy-related incidents
- Database Backup & Restore — data recovery procedures
- Redis Persistence — Redis state verification after failures
- Email Deliverability Setup — email delivery troubleshooting
- Uptime Monitoring Setup — alert configuration and monitoring
1. Severity Levels
All incidents are classified into one of four severity levels. Severity determines response time, escalation path, and communication requirements.| Level | Name | Criteria | Response Time | Example |
|---|---|---|---|---|
| P1 | Critical | Complete service outage or data loss affecting all tenants | 15 minutes | Database down; webhook endpoint unreachable; all workflow executions failing; data corruption |
| P2 | Major | Significant degradation affecting multiple tenants | 1 hour | WhatsApp message delivery delays >5 min; AI node failures >10%; inbox not loading for subset of users; BullMQ workers stuck |
| P3 | Minor | Limited impact with workaround available | 4 hours | Single tenant workflow failure; analytics not tracking; non-critical UI bug affecting one page; CSV export timing out |
| P4 | Low | Cosmetic or improvement-level issue | Next business day | Typo in UI; non-blocking warning in logs; minor styling inconsistency; slow but functional query |
Severity Decision Guide
When in doubt, escalate up one level. It is better to over-classify and de-escalate than to under-classify and miss response windows.- Affecting all tenants? -> P1
- Affecting multiple tenants or a core feature (inbox, builder, webhooks)? -> P2
- Affecting one tenant or a non-core feature? -> P3
- No user impact, cosmetic only? -> P4
2. Detection
Incidents are discovered through four channels:Automated Alerts
- BetterStack uptime monitoring — monitors
/healthendpoint on the API service. Alerts on downtime or degraded response (>5s latency). - Sentry error spike alerts — 6 alert rules configured across
agentix-apiandagentix-webprojects. Triggers on error count thresholds, new issue types, and error rate spikes. - Railway deployment alerts — notifies on failed builds or crashed services.
Manual Detection
- User reports — via support@agentixx.io or in-app feedback popover.
- Manual observation — during routine monitoring checks or after deployments.
Initial Assessment Checklist
When an alert fires or a report comes in, run through this checklist to assess scope:- Check BetterStack status page: Is the API health endpoint responding? What is the current uptime percentage?
- Check Sentry dashboards: Are there new errors or error spikes? Which service (API or web)? How many affected users?
- Check Railway logs: Is the API service running? Are BullMQ workers processing jobs? Any crash loops or OOM kills?
- Check Vercel deployment status: Is the latest frontend deployment healthy? Any build failures?
- Check recent deploys: Was there a deployment in the last 2 hours? What changed? (Use
git log --oneline -10) - Check Redis: Is Redis reachable? Are BullMQ queues backing up? (
DBSIZE,INFO clients) - Check PostgreSQL: Is the database responding? Connection pool exhaustion? Check Railway PostgreSQL metrics.
3. Escalation Paths
| Severity | First Responder | Escalate To | Escalation Trigger |
|---|---|---|---|
| P1 | On-call engineer | CTO/founder | If not mitigated within 15 minutes |
| P2 | On-call engineer | Team lead | If not mitigated within 1 hour |
| P3 | Assigned engineer | On-call engineer | If blocked or no progress in 4 hours |
| P4 | Assigned engineer | None | No escalation required |
On-Call Definition
For a small team, the on-call engineer is the person who pushed the last deploy tomain. They have the most context about recent changes and are best positioned to diagnose regressions.
Escalation Process
- Attempt resolution within the response time for the assigned severity level.
- If resolution is not possible within the response window: escalate to the next person in the escalation path.
- When escalating: provide a brief summary including severity, impact scope, what has been tried, and current hypothesis.
- P1 incidents automatically escalate to all available team members regardless of on-call schedule.
4. Incident Response Steps
Follow these steps sequentially for any P1 or P2 incident. For P3/P4, steps 1-2 and 5-7 are sufficient.Step 1: Acknowledge
- Confirm the incident is real (not a false positive from monitoring).
- Assign a severity level using the criteria in Section 1.
- Create a tracking thread (Slack channel, GitHub issue, or shared doc).
- Assign a first responder (or claim it yourself).
- Note the detection time:
____-__-__T__:__:__Z
Step 2: Assess
- Determine scope: How many tenants are affected? Which services? Which features?
- Check recent deployments: Was anything deployed in the last 2 hours?
- Review error logs in Sentry and Railway for root cause clues.
- Identify the affected component(s): API, web, workers, database, Redis, external services (WhatsApp API, OpenAI).
Step 3: Contain
Choose the appropriate containment strategy: If deploy-related:- Roll back the deployment. See Deployment Runbook — Rollback Procedures for step-by-step instructions.
- For Vercel: redeploy the previous commit from the Deployments tab.
- For Railway: redeploy the previous deployment or revert the commit.
- Pause affected BullMQ queues to prevent further data corruption.
- Do NOT attempt manual database fixes without a backup. See Database Backup & Restore.
- Confirm the outage on the provider’s status page.
- If WhatsApp API is down: workflows will queue messages; no action needed unless queues are filling up.
- If OpenAI is down: AI nodes will fail with error responses; users see fallback messages.
Step 4: Communicate
- For P1/P2: Send initial notification to affected users within 30 minutes of detection (use templates in Section 5).
- For P3/P4: No external communication required unless specifically requested.
- Update the tracking thread with findings every 30 minutes (P1) or every hour (P2).
Step 5: Resolve
- Implement the fix (hotfix commit, configuration change, or manual intervention).
- Verify the fix resolves the root cause, not just the symptom.
- If a code change is needed: push to
main, monitor the deployment, verify in production.
Step 6: Verify
- Run the health check endpoint:
curl https://api.agentix.app/health - Confirm Sentry error rate has returned to baseline.
- Confirm BetterStack shows the service as UP.
- Monitor for 30 minutes after resolution to ensure the fix holds.
- If the incident involved message delivery: verify a test message flows end-to-end.
Step 7: Close
- Update the tracking thread with resolution details and timeline.
- Mark the incident as resolved.
- For P1/P2: Schedule a post-mortem within 48 hours (see Section 6).
- For P3/P4: Document the root cause and fix in the tracking thread. No formal post-mortem required.
5. Communication Templates
Use these templates when communicating with affected users about incidents. Send via email from support@agentixx.io.Template 1: Initial Notification
Subject: [Agentix] Service Disruption — We’re Investigating We are aware of [issue description] affecting [scope — e.g., message delivery, workflow execution, inbox access]. Our team is actively investigating. We will provide an update within [timeframe — e.g., 30 minutes, 1 hour]. If you have questions, reply to this email or contact us at support@agentixx.io. — The Agentix Team
Template 2: Status Update
Subject: [Agentix] Update on Service Disruption Update on [issue]: Current status: [e.g., We have identified the root cause and are deploying a fix.] What we’ve done: [e.g., Rolled back the problematic deployment; paused affected queues to prevent further impact.] We expect to provide the next update in [timeframe]. — The Agentix Team
Template 3: Resolution Notification
Subject: [Agentix] Service Restored The issue affecting [scope] has been resolved as of [time, e.g., 2026-03-15 14:30 UTC]. Root cause: [Brief, non-technical explanation — e.g., A deployment introduced a configuration error that prevented message processing.] Impact: [e.g., Message delivery was delayed by approximately 45 minutes for all tenants.] What we’re doing to prevent this: [e.g., We are adding automated checks to our deployment pipeline to catch this class of error.] We will publish a detailed post-mortem within [N] business days. We apologize for the disruption. If you notice any remaining issues, please contact us at support@agentixx.io. — The Agentix Team
Communication Timing
| Severity | Initial Notification | Updates | Resolution |
|---|---|---|---|
| P1 | Within 30 minutes | Every 30 minutes | Immediately on resolution |
| P2 | Within 1 hour | Every hour | Within 1 hour of resolution |
| P3 | Not required | Not required | Not required |
| P4 | Not required | Not required | Not required |
6. Post-Mortem Process
Post-mortems are required for all P1 and P2 incidents and optional for P3 incidents that reveal systemic issues.Timeline
- Schedule the post-mortem within 48 hours of incident resolution.
- Complete the write-up within 5 business days of the incident.
- Review action items in the next team meeting.
Post-Mortem Template
Store post-mortem documents indocs/runbooks/post-mortems/ with the filename format: YYYY-MM-DD-brief-title.md.
Post-Mortem Principles
- Blameless: Focus on systems and processes, not individuals. “The deploy process allowed a bad configuration” not “Person X deployed bad config.”
- Action-oriented: Every post-mortem must produce at least one actionable item with an owner and deadline.
- Shared: Post-mortems are visible to the entire team. Transparency builds trust and collective learning.
7. Runbook Quick Reference
When responding to an incident, use this table to find the relevant runbook for the affected system:| Scenario | Runbook | Key Section |
|---|---|---|
| Service outage after deploy | Deployment Runbook | Rollback Procedures |
| Database corruption or data loss | Database Backup & Restore | Manual Restore |
| Redis failure or queue backup | Redis Persistence | Verification Steps |
| Email delivery failure | Email Deliverability Setup | Troubleshooting |
| Monitoring alerts configuration | Uptime Monitoring Setup | Alert Configuration |
| Staging environment issues | Staging Environment Setup | Verification |
Last updated: 2026-03-27