Rapid incident classification, severity assessment, and response coordination.
Install via CLI
openskills install jmagly/ai-writing-guide# incident-triage
Rapid incident classification, severity assessment, and response coordination.
## Triggers
- "production incident"
- "system is down"
- "critical issue"
- "triage incident"
- "incident severity"
- "outage"
- "P0" / "P1" / "SEV1"
## Purpose
This skill provides rapid incident response coordination by:
- Classifying incident type and severity
- Assembling response team
- Coordinating initial response actions
- Tracking timeline and status
- Facilitating communication
- Preparing post-incident review
## Behavior
When triggered, this skill:
1. **Gathers incident details**:
- What is happening?
- When did it start?
- Who/what is affected?
- What changed recently?
2. **Classifies severity**:
- Assess customer impact
- Determine scope
- Assign severity level
- Calculate business impact
3. **Assembles response team**:
- Identify required responders
- Notify on-call personnel
- Establish incident commander
4. **Initiates response**:
- Create incident channel/bridge
- Start timeline documentation
- Coordinate initial diagnosis
5. **Manages communication**:
- Internal status updates
- Customer communication (if needed)
- Executive notifications (for high severity)
6. **Tracks resolution**:
- Document actions taken
- Track mitigation progress
- Confirm resolution
- Schedule post-incident review
## Severity Levels
### SEV1 / P0 - Critical
```yaml
sev1:
name: Critical
alias: [P0, SEV1, Critical]
criteria:
- Complete service outage
- Data loss or corruption
- Security breach
- >50% customers affected
- Revenue-impacting
response:
response_time: 15 minutes
update_frequency: 15 minutes
executive_notification: immediate
customer_communication: within 30 minutes
escalation:
- incident_commander: required
- engineering_manager: required
- vp_engineering: within 30 minutes
- cto: within 1 hour (if unresolved)
target_resolution: 4 hours
```
### SEV2 / P1 - High
```yaml
sev2:
name: High
alias: [P1, SEV2, High]
criteria:
- Major feature unavailable
- Significant degradation
- 10-50% customers affected
- Workaround exists but painful
response:
response_time: 30 minutes
update_frequency: 30 minutes
executive_notification: within 1 hour
customer_communication: within 2 hours (if extended)
escalation:
- incident_commander: required
- engineering_manager: within 1 hour
target_resolution: 8 hours
```
### SEV3 / P2 - Medium
```yaml
sev3:
name: Medium
alias: [P2, SEV3, Medium]
criteria:
- Feature partially degraded
- <10% customers affected
- Workaround available
- Non-critical path affected
response:
response_time: 2 hours
update_frequency: 2 hours
executive_notification: daily summary
customer_communication: as needed
escalation:
- team_lead: within 4 hours
target_resolution: 24 hours
```
### SEV4 / P3 - Low
```yaml
sev4:
name: Low
alias: [P3, SEV4, Low]
criteria:
- Minor issue
- Cosmetic problem
- Edge case affected
- Easy workaround
response:
response_time: next business day
update_frequency: daily
executive_notification: weekly summary
escalation: standard ticket flow
target_resolution: 1 week
```
## Incident Response Flow
```
┌─────────────────────────────────────────────────────────────┐
│ 1. DETECTION & TRIAGE │
│ • Alert received or issue reported │
│ • Gather initial details │
│ • Classify severity │
│ • Create incident record │
│ • Time: <15 minutes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. MOBILIZATION │
│ • Page on-call responders │
│ • Establish incident commander │
│ • Create communication channel │
│ • Notify stakeholders per severity │
│ • Time: <5 minutes after triage │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. INVESTIGATION │
│ • Review recent changes │
│ • Check monitoring/logs │
│ • Identify affected components │
│ • Form hypothesis │
│ • Time: ongoing, status updates per SLA │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. MITIGATION │
│ • Implement workaround if available │
│ • Rollback if change-related │
│ • Scale resources if capacity issue │
│ • Isolate affected components │
│ • Goal: Reduce customer impact │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. RESOLUTION │
│ • Implement permanent fix │
│ • Verify fix is effective │
│ • Monitor for recurrence │
│ • Update status to resolved │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. POST-INCIDENT │
│ • Schedule post-incident review │
│ • Document timeline and actions │
│ • Identify root cause │
│ • Create follow-up action items │
│ • Update runbooks/documentation │
└─────────────────────────────────────────────────────────────┘
```
## Incident Record Format
```markdown
# Incident Report: INC-2025-001234
## Summary
| Field | Value |
|-------|-------|
| Title | Database connection pool exhaustion |
| Severity | SEV1 (Critical) |
| Status | Resolved |
| Start Time | 2025-12-08 14:32 UTC |
| Detected | 2025-12-08 14:35 UTC |
| Resolved | 2025-12-08 15:47 UTC |
| Duration | 1h 15m |
| Impact | 100% of API requests failing |
| Customers Affected | ~45,000 |
## Incident Commander
**Name**: Sarah Chen
**Role**: Senior SRE
## Response Team
| Role | Name | Joined |
|------|------|--------|
| Incident Commander | Sarah Chen | 14:38 |
| Backend Lead | David Kim | 14:40 |
| DBA | Elena Rodriguez | 14:45 |
| Comms Lead | James Wilson | 14:50 |
## Impact Assessment
### Customer Impact
- **Scope**: All customers using web and mobile apps
- **Severity**: Complete service outage
- **Duration**: 1h 15m
- **Affected Features**: All authenticated features
### Business Impact
- **Revenue Loss**: Estimated $XX,XXX
- **SLA Breach**: Yes (99.9% monthly target affected)
- **Customer Complaints**: 127 support tickets
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:32 | First customer reports of errors |
| 14:35 | PagerDuty alert for 5xx spike |
| 14:38 | Incident declared, Sarah Chen IC |
| 14:40 | Investigation begins |
| 14:45 | Identified: DB connection pool exhausted |
| 14:52 | Root cause: Runaway query from batch job |
| 15:00 | Mitigation: Batch job killed |
| 15:10 | Connection pool recovering |
| 15:30 | 50% traffic restored |
| 15:47 | Full service restored |
| 15:50 | Monitoring confirms stable |
| 16:00 | Incident closed |
## Root Cause
**Summary**: A scheduled batch job contained an inefficient query that held database connections indefinitely, exhausting the connection pool.
**Details**:
- Batch job deployed at 14:00 with new query
- Query had missing index, causing full table scan
- Each scan held connection for 30+ seconds
- 100 concurrent requests × 30s = pool exhausted
- New requests could not get connections → 5xx errors
**Contributing Factors**:
1. Missing index migration in batch job deploy
2. No query timeout configured
3. Connection pool size not tuned for load
4. Batch job ran during peak hours
## Resolution
**Immediate Actions**:
1. Killed runaway batch job
2. Restarted application servers to reset connections
3. Verified service restoration
**Permanent Fixes** (follow-ups):
- [ ] Add missing index (INC-001-01)
- [ ] Configure query timeouts (INC-001-02)
- [ ] Increase connection pool size (INC-001-03)
- [ ] Move batch jobs to off-peak hours (INC-001-04)
- [ ] Add connection pool monitoring alerts (INC-001-05)
## Communication Log
| Time | Channel | Message |
|------|---------|---------|
| 14:45 | #incident-2025-001234 | Incident declared, investigating API failures |
| 15:00 | Status Page | Investigating service disruption |
| 15:15 | Status Page | Identified cause, implementing fix |
| 15:30 | #incident-2025-001234 | Service recovering, 50% restored |
| 15:50 | Status Page | Service fully restored |
| 16:00 | Email to customers | Incident resolved, apology + explanation |
## Post-Incident Review
**Scheduled**: 2025-12-10 10:00 UTC
**Attendees**: Response team + Engineering Manager
**Document**: .aiwg/incidents/INC-2025-001234-pir.md
## Lessons Learned
### What Went Well
- Fast detection (3 minutes)
- Clear incident commander
- Good team coordination
### What Could Improve
- Batch job should have been tested with prod-like data
- Missing connection pool alerts
- No query timeout safety net
## Action Items
| ID | Action | Owner | Due | Status |
|----|--------|-------|-----|--------|
| INC-001-01 | Add missing index | Elena | 2025-12-09 | Done |
| INC-001-02 | Configure query timeouts | David | 2025-12-10 | In Progress |
| INC-001-03 | Tune connection pool | Sarah | 2025-12-12 | Planned |
| INC-001-04 | Reschedule batch jobs | David | 2025-12-15 | Planned |
| INC-001-05 | Add pool monitoring | Sarah | 2025-12-12 | Planned |
```
## Usage Examples
### Declare Incident
```
User: "Production is down, API returning 500 errors"
Skill initiates:
1. Gather details
2. Classify severity
3. Create incident record
4. Notify responders
Output:
"🚨 INCIDENT DECLARED: INC-2025-001234
Severity: SEV1 (Critical) - Complete API failure
Impact: All customers affected
Immediate Actions:
1. ✅ Incident record created
2. ✅ Paging on-call: Sarah Chen (SRE)
3. ✅ Slack channel: #incident-2025-001234
4. ⏳ Assembling response team
You are Incident Commander until relieved.
Quick Checklist:
- [ ] What changed in last 2 hours?
- [ ] Check: logs, metrics, recent deploys
- [ ] Status page update needed?
Update command: 'incident update <message>'"
```
### Severity Assessment
```
User: "Is this SEV1 or SEV2? Users can't checkout but browsing works"
Skill evaluates:
- Core function affected: Yes (checkout = revenue)
- Workaround: No (can't complete purchase)
- Customer %: All attempting checkout
- Revenue impact: Direct
Output:
"Severity Assessment: SEV1 (Critical)
Rationale:
✓ Revenue-impacting feature down
✓ No workaround for affected flow
✓ Checkout is business-critical path
✓ Affects all customers attempting purchase
SEV1 Criteria Met:
- Complete feature outage: checkout
- Revenue impact: direct
- No workaround: cannot complete purchases
Response Requirements:
- 15-minute update cadence
- Executive notification: immediate
- Customer communication: within 30 minutes
- Target resolution: 4 hours"
```
### Status Update
```
User: "Incident update: found the issue, deploying fix"
Skill logs:
1. Add to timeline
2. Update status
3. Notify stakeholders
Output:
"Incident Update: INC-2025-001234
Status: Mitigating
Time: 15:15 UTC
Update Logged:
'Found root cause, deploying fix'
Next Actions:
- [ ] Update status page
- [ ] Notify executive stakeholders
- [ ] Continue timeline documentation
Time Since Start: 43 minutes
Next Update Due: 15:30 UTC"
```
## Integration
This skill uses:
- `project-awareness`: Context for system topology
- `artifact-metadata`: Track incident artifacts
## Agent Orchestration
```yaml
agents:
incident_commander:
agent: incident-responder
focus: Overall coordination and decisions
technical_lead:
agent: debugger
focus: Root cause investigation
reliability:
agent: reliability-engineer
focus: System stability and monitoring
communications:
agent: support-lead
focus: Customer and stakeholder communication
```
## Configuration
### Notification Channels
```yaml
notifications:
sev1:
pagerduty: true
slack: "#incidents-critical"
email: [engineering-leads, on-call-manager]
sms: [incident-commander, vp-engineering]
sev2:
pagerduty: true
slack: "#incidents"
email: [engineering-leads]
sev3:
slack: "#incidents"
email: [team-lead]
sev4:
slack: "#incidents-low"
```
### Escalation Paths
```yaml
escalation:
sev1:
- {time: 0, to: on-call-engineer}
- {time: 15m, to: engineering-manager}
- {time: 30m, to: vp-engineering}
- {time: 1h, to: cto}
sev2:
- {time: 0, to: on-call-engineer}
- {time: 1h, to: engineering-manager}
- {time: 4h, to: vp-engineering}
```
## Output Locations
- Incident records: `.aiwg/incidents/INC-{year}-{id}.md`
- Post-incident reviews: `.aiwg/incidents/INC-{year}-{id}-pir.md`
- Action items: `.aiwg/incidents/action-items.md`
- Metrics: `.aiwg/incidents/metrics/`
## References
- Incident response template: templates/operations/incident-template.md
- Post-incident review template: templates/operations/pir-template.md
- On-call schedule: .aiwg/team/on-call.yaml
- Runbooks: .aiwg/deployment/runbooks/
No comments yet. Be the first to comment!