Uptime Monitoring Setup — BetterStack External Health Checks
1. Overview
Current setup: The Agentix API exposes a/health endpoint that verifies database (PostgreSQL) and Redis connectivity, returning HTTP 200 when healthy and HTTP 503 when either dependency is down.
Target: External uptime monitoring via BetterStack that pings /health every 60 seconds and alerts the team via email (and optionally Slack) when the service is degraded.
Why this matters:
- Internal health checks only help if the server is reachable — external monitoring catches network, DNS, and infrastructure failures
- 60-second intervals ensure issues are detected within 1-2 minutes
- Automated alerts reduce mean-time-to-detection (MTTD) from hours to minutes
- A public status page builds trust with tenants
/health endpoint checks:
- PostgreSQL connectivity (runs a lightweight query)
- Redis connectivity (sends a PING command)
- Returns
200 OKwith{ "status": "healthy" }if both pass - Returns
503 Service Unavailablewith{ "status": "unhealthy", "details": {...} }if either fails
2. Prerequisites
- BetterStack account — sign up at https://betterstack.com (free tier includes 5 monitors with 3-minute checks; upgrade for 60-second intervals)
- Production API URL (e.g.,
https://api.agentix.app) - Team email addresses for alert recipients
3. Step 1 — Create a Monitor
- Sign into the BetterStack dashboard
- Navigate to Monitors in the left sidebar
- Click Create Monitor
- Configure the monitor settings:
| Setting | Value |
|---|---|
| Monitor type | HTTP(s) |
| URL | https://api.agentix.app/health (substitute your actual production URL) |
| Check frequency | Every 60 seconds |
| Request method | GET |
| Expected status code | 200 |
| Confirmation period | 2 checks (waits for 2 consecutive failures before alerting — avoids false alarms on transient blips) |
| Request timeout | 10 seconds |
| Monitor name | Agentix API — Health (or any descriptive name) |
- Click Save to create the monitor
4. Step 2 — Configure Email Alerts
BetterStack sends alerts to people added to your escalation policy.- Navigate to On-call > People in the left sidebar
- Click Invite team member and add each recipient’s email address
- Navigate to On-call > Escalation policies
- Edit the default escalation policy (or create a new one):
- Step 1: Notify the team immediately on incident creation
- Add all relevant team members
- Return to your monitor and verify the escalation policy is assigned
- BetterStack sends a welcome email when you invite team members
- If no welcome email arrives, check spam/junk folders and verify the email address
5. Step 3 — Configure Slack Alerts (Optional)
For faster response times, add Slack notifications alongside email.- Navigate to Integrations in the left sidebar
- Find Slack and click Connect
- Authorize BetterStack to post to your Slack workspace
- Select the channel for alerts (e.g.,
#ops-alertsor#engineering) - Return to On-call > Escalation policies
- Add a Slack notification step to your escalation policy:
- Step 1: Notify via Slack channel immediately
- Step 2: Notify team members via email (if not acknowledged within 5 minutes)
6. Step 4 — Create a Status Page (Optional)
A public status page communicates uptime to tenants without them needing to contact support.- Navigate to Status pages in the left sidebar
- Click Create status page
- Configure:
- Name:
Agentix Status - Subdomain:
status.agentix.app(or use BetterStack’s default subdomain) - Resources: Add the
Agentix API — Healthmonitor
- Name:
- Click Save
- Share the status page URL with tenants or link it from the product
- Add a CNAME record in your DNS pointing
status.agentix.appto BetterStack’s status page domain - Configure the custom domain in BetterStack’s status page settings
7. Verification
After creating the monitor, verify everything is working:- Wait 2-3 minutes for the first few checks to complete
- In the BetterStack dashboard, confirm the monitor shows Up status with a green indicator
- Check that the response time graph is populating
- Temporarily change the monitor URL to a non-existent path (e.g.,
https://api.agentix.app/health-test-invalid) - Wait for 2 check cycles (2-3 minutes depending on your interval)
- Confirm an alert email arrives (check spam if not in inbox)
- Confirm Slack notification arrives (if configured)
- Immediately revert the monitor URL back to
https://api.agentix.app/health - Confirm the monitor recovers and shows Up status
- Confirm a recovery notification is sent
8. Verification Checklist
- Monitor exists in BetterStack dashboard with Up status
- Check frequency is set to 60 seconds (or 3 minutes on free tier)
- Expected status code is 200
- Confirmation period is 2 checks
- Request timeout is 10 seconds
- At least one team member is configured in the escalation policy
- Test alert was received via email
- (Optional) Slack integration is connected and test alert received
- (Optional) Status page is created and accessible
9. Troubleshooting
Monitor shows “Down” but the app works in browser
-
CORS or auth blocking: The
/healthendpoint should not require authentication or set CORS restrictions. Verify by running:Expected output:200 - Firewall or WAF: If using Cloudflare or another WAF, ensure BetterStack’s IP ranges are not blocked. BetterStack publishes their monitoring IP ranges in their documentation.
- DNS resolution: The monitor URL must be publicly resolvable. If the API is behind a private network, external monitoring cannot reach it.
Alerts not arriving
- Email: Check spam/junk folders. Verify the email address in On-call > People. Ensure the escalation policy is assigned to the monitor.
- Slack: Verify the Slack integration is still authorized (tokens can expire). Reconnect if needed.
- Escalation policy: Ensure the policy has at least one active step with team members assigned.
False alarms (intermittent “Down” alerts)
- Increase the confirmation period from 2 to 3 checks
- Increase the request timeout from 10 to 15 seconds
- Check if the API has cold-start latency (Railway sleeps inactive services on some plans)
Health endpoint returns 503
The/health endpoint returns 503 when PostgreSQL or Redis is unreachable. This is a real issue that requires investigation:
- Check Railway dashboard for database/Redis service status
- Check PostgreSQL connection limits (max_connections)
- Check Redis memory usage and eviction policy
- Review API logs in Railway for connection errors
10. Ongoing Maintenance
- Review monthly: Check the uptime percentage in BetterStack dashboard. Aim for 99.9%+ uptime.
- DMARC upgrade path: None needed — BetterStack alerts come from BetterStack’s own domain.
- Escalation policy updates: When team members join or leave, update the escalation policy in On-call > People.
- Monitor updates: If the API URL changes (e.g., domain migration), update the monitor URL immediately.
References
- BetterStack Uptime Documentation
- BetterStack Monitoring IP Ranges
- BetterStack Status Pages
- Code reference:
apps/api/src/routes/health.ts(health endpoint implementation) - Code reference:
apps/api/src/index.ts(health route registration)