Back to skills

Scalability Advisor

Guidance for scaling systems from startup to enterprise scale. Use when planning for growth, diagnosing bottlenecks, or designing systems that need to handle 10x-1000x current load.

32 stars
0 votes
0 copies
0 views
Added 12/19/2025
data-aipythonsqlnodeapidatabaseperformance

Works with

api

Install via CLI

$openskills install alirezarezvani/claude-cto-team
Download Zip
Files
SKILL.md
---
name: scalability-advisor
description: Guidance for scaling systems from startup to enterprise scale. Use when planning for growth, diagnosing bottlenecks, or designing systems that need to handle 10x-1000x current load.
---

# Scalability Advisor

Provides systematic guidance for scaling systems at different growth stages, identifying bottlenecks, and designing for horizontal scalability.

## When to Use

- Planning for 10x, 100x, or 1000x growth
- Diagnosing current performance bottlenecks
- Designing new systems for scale
- Evaluating scaling strategies (vertical vs. horizontal)
- Capacity planning and infrastructure sizing

## Scaling Stages Framework

### Stage Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                    SCALING JOURNEY                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Stage 1        Stage 2         Stage 3         Stage 4             │
│  Startup        Growth          Scale           Enterprise          │
│  0-10K users    10K-100K        100K-1M         1M+ users           │
│                                                                     │
│  Single         Add caching,    Horizontal      Global,             │
│  server         read replicas   scaling         multi-region        │
│                                                                     │
│  $100/mo        $1K/mo          $10K/mo         $100K+/mo           │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Stage 1: Startup (0-10K Users)

### Architecture

```
┌────────────────────────────────────────┐
│           Single Server                │
│  ┌──────────────────────────────────┐  │
│  │  App Server (Node/Python/etc)    │  │
│  │  + Database (PostgreSQL)         │  │
│  │  + File Storage (local/S3)       │  │
│  └──────────────────────────────────┘  │
└────────────────────────────────────────┘
```

### Key Metrics

| Metric | Target | Warning |
|--------|--------|---------|
| Response time (P95) | < 500ms | > 1s |
| Database queries/request | < 10 | > 20 |
| Server CPU | < 70% | > 85% |
| Database connections | < 50% pool | > 80% pool |

### What to Focus On

**DO**:
- Write clean, maintainable code
- Use database indexes on frequently queried columns
- Implement basic monitoring (uptime, errors)
- Keep architecture simple (monolith is fine)

**DON'T**:
- Over-engineer for scale you don't have
- Add caching before you need it
- Split into microservices prematurely
- Worry about multi-region yet

### When to Move to Stage 2

- Database CPU consistently > 70%
- Response times degrading
- Single queries taking > 100ms
- Server resources maxed

---

## Stage 2: Growth (10K-100K Users)

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌─────────┐      ┌─────────────────────────────────┐     │
│    │   CDN   │      │      Load Balancer              │     │
│    └────┬────┘      └──────────────┬──────────────────┘     │
│         │                          │                        │
│         │           ┌──────────────┼──────────────┐         │
│         │           │              │              │         │
│         ▼           ▼              ▼              ▼         │
│    ┌─────────┐ ┌─────────┐   ┌─────────┐   ┌─────────┐      │
│    │ Static  │ │ App 1   │   │ App 2   │   │ App 3   │      │
│    │ Assets  │ └────┬────┘   └────┬────┘   └────┬────┘      │
│    └─────────┘      │             │             │           │
│                     └──────────────┼────────────┘           │
│                                    │                        │
│                     ┌──────────────┼──────────────┐         │
│                     │              │              │         │
│                     ▼              ▼              ▼         │
│               ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│               │ Primary │   │  Read   │   │  Redis  │       │
│               │   DB    │───│ Replica │   │  Cache  │       │
│               └─────────┘   └─────────┘   └─────────┘       │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Key Additions

| Component | Purpose | When to Add |
|-----------|---------|-------------|
| **CDN** | Static asset caching | Images, JS, CSS taking > 20% bandwidth |
| **Load Balancer** | Distribute traffic | Single server CPU > 70% |
| **Read Replicas** | Offload reads | > 80% database ops are reads |
| **Redis Cache** | Application caching | Same queries repeated frequently |
| **Job Queue** | Async processing | Background tasks blocking requests |

### Caching Strategy

```
Request Flow with Caching:

1. Check CDN (static assets)         ─► HIT: Return cached
                                           │
2. Check Application Cache (Redis)   ─► HIT: Return cached
                                           │
3. Check Database                    ─► Return + Cache result
```

**What to Cache**:
- Session data (TTL: session duration)
- User profile data (TTL: 5-15 minutes)
- API responses (TTL: varies by freshness needs)
- Database query results (TTL: 1-5 minutes)
- Computed values (TTL: based on computation cost)

### Database Optimization

```sql
-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;

-- Find missing indexes
SELECT schemaname, tablename, indexrelname, idx_scan, seq_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND seq_scan > 1000;
```

### When to Move to Stage 3

- Write traffic overwhelming single primary
- Cache hit rate plateauing despite optimization
- Read replicas can't keep up with replication lag
- Need independent scaling of components

---

## Stage 3: Scale (100K-1M Users)

### Architecture

```
┌──────────────────────────────────────────────────────────────────────┐
│                           CDN / Edge                                 │
└──────────────────────────────────────────────────────────────────────┘
                                    │
┌──────────────────────────────────────────────────────────────────────┐
│                        API Gateway                                   │
│              (Rate limiting, Auth, Routing)                          │
└──────────────────────────────────────────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   Service A   │          │   Service B   │          │   Service C   │
│   (Users)     │          │   (Orders)    │          │   (Search)    │
│   Auto-scale  │          │   Auto-scale  │          │   Auto-scale  │
└───────┬───────┘          └───────┬───────┘          └───────┬───────┘
        │                          │                          │
        ▼                          ▼                          ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   User DB     │          │   Order DB    │          │ Elasticsearch │
│   (Sharded)   │          │   (Sharded)   │          │   (Cluster)   │
└───────────────┘          └───────────────┘          └───────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────┐
                    │     Message Queue         │
                    │     (Kafka / SQS)         │
                    └───────────────────────────┘
```

### Key Patterns

#### Database Sharding

```
Sharding Strategies:

1. Hash-based (user_id % num_shards)
   PRO: Even distribution
   CON: Hard to add shards

2. Range-based (user_id 1-1M → shard 1)
   PRO: Easy to add shards
   CON: Hotspots possible

3. Directory-based (lookup table)
   PRO: Flexible
   CON: Lookup overhead
```

#### Event-Driven Architecture

```
Synchronous → Asynchronous

Before:
  API → Service A → Service B → Service C → Response (slow)

After:
  API → Service A → Queue → Response (fast)
                      ↓
              Service B, C process async
```

### Scaling Checklist

- [ ] Stateless application servers (no local state)
- [ ] Database read/write separation
- [ ] Asynchronous processing for non-critical paths
- [ ] Circuit breakers between services
- [ ] Distributed tracing implemented
- [ ] Auto-scaling configured with proper metrics
- [ ] Database connection pooling (PgBouncer, ProxySQL)

### When to Move to Stage 4

- Need geographic distribution for latency
- Regulatory requirements (data residency)
- Single region can't handle failover
- Global user base with latency requirements

---

## Stage 4: Enterprise (1M+ Users)

### Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                          Global Load Balancer                           │
│                       (GeoDNS, Anycast, Route53)                        │
└─────────────────────────────────────────────────────────────────────────┘
                    │                                │
           ┌────────┴────────┐              ┌───────┴────────┐
           │                 │              │                │
           ▼                 ▼              ▼                ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │  US-East     │  │  US-West     │  │  EU-West     │  │  AP-South    │
    │  Region      │  │  Region      │  │  Region      │  │  Region      │
    │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │
    │  │Services│  │  │  │Services│  │  │  │Services│  │  │  │Services│  │
    │  └────────┘  │  │  └────────┘  │  │  └────────┘  │  │  └────────┘  │
    │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │
    │  │Database│  │  │  │Database│  │  │  │Database│  │  │  │Database│  │
    │  │(Primary)│ │  │  │(Replica)│ │  │  │(Primary)│ │  │  │(Replica)│ │
    │  └────────┘  │  │  └────────┘  │  │  └────────┘  │  │  └────────┘  │
    └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
                              │                   │
                              └─────────┬─────────┘
                                        │
                              Cross-Region Replication
```

### Multi-Region Patterns

| Pattern | Consistency | Latency | Complexity |
|---------|-------------|---------|------------|
| **Active-Passive** | Strong | High failover | Low |
| **Active-Active** | Eventual | Low | High |
| **Follow-the-Sun** | Strong per region | Medium | Medium |

### Data Consistency Strategies

```
CAP Theorem Trade-offs:

Strong Consistency (CP):
- All regions see same data
- Higher latency for writes
- Use for: Financial transactions, inventory

Eventual Consistency (AP):
- Regions may have stale data briefly
- Low latency always
- Use for: Social feeds, analytics, non-critical

Causal Consistency:
- Related operations ordered correctly
- Balance of latency and correctness
- Use for: Messaging, collaboration
```

### Enterprise Checklist

- [ ] Multi-region deployment
- [ ] Cross-region data replication
- [ ] Global CDN with edge caching
- [ ] Disaster recovery tested
- [ ] Compliance (SOC 2, GDPR, data residency)
- [ ] 99.99% SLA architecture
- [ ] Zero-downtime deployments
- [ ] Chaos engineering practice

---

## Bottleneck Diagnosis Guide

### Finding the Bottleneck

```
Systematic Diagnosis:

1. Where is time spent?
   └─► Distributed tracing (Jaeger, Datadog)

2. Is it the database?
   └─► Check slow query logs, connection pool

3. Is it the application?
   └─► CPU profiling, memory analysis

4. Is it the network?
   └─► Latency between services, DNS resolution

5. Is it external services?
   └─► Third-party API latency, rate limits
```

### Common Bottlenecks by Layer

| Layer | Symptoms | Solutions |
|-------|----------|-----------|
| **Database** | Slow queries, high CPU | Indexing, read replicas, caching |
| **Application** | High CPU, memory | Optimize code, scale horizontally |
| **Network** | High latency, timeouts | CDN, edge caching, connection pooling |
| **Storage** | Slow I/O, high wait | SSD, object storage, caching |
| **External APIs** | Timeouts, rate limits | Circuit breakers, caching, fallbacks |

### Database Bottleneck Checklist

```markdown
## Quick Database Health Check

1. Connection Pool
   - Current connections vs max?
   - Connection wait time?
   - Pool exhaustion events?

2. Query Performance
   - Slowest queries (pg_stat_statements)?
   - Missing indexes (seq scans > 10K)?
   - Lock contention?

3. Replication
   - Replica lag?
   - Write throughput?
   - Read distribution?

4. Storage
   - Disk I/O wait?
   - Table/index bloat?
   - WAL write latency?
```

---

## Scaling Calculations

### Capacity Planning Formula

```
Required Capacity = Peak Traffic × Growth Factor × Safety Margin

Example:
- Current peak: 1,000 req/sec
- Expected growth: 3x in 12 months
- Safety margin: 1.5x

Required: 1,000 × 3 × 1.5 = 4,500 req/sec capacity
```

### Database Sizing

```
Connection Pool Size:
  connections = (num_cores × 2) + effective_spindle_count

  Example: 8 cores, SSD
  connections = (8 × 2) + 1 = 17 connections per instance

Read Replica Sizing:
  replicas = ceiling(read_traffic / single_replica_capacity)

  Example: 10,000 reads/sec, 3,000/replica capacity
  replicas = ceiling(10,000 / 3,000) = 4 replicas
```

### Cache Sizing

```
Cache Size:
  memory = working_set_size × (1 + overhead_factor)

  Working set = frequently accessed data (usually 10-20% of total)
  Overhead = ~1.5x for Redis data structures

  Example: 10GB working set
  Redis memory = 10GB × 1.5 = 15GB
```

---

## Quick Reference

### Scaling Decision Matrix

| Symptom | First Try | Then Try | Finally |
|---------|-----------|----------|---------|
| Slow page loads | Add caching | CDN | Edge compute |
| Database slow | Add indexes | Read replicas | Sharding |
| API timeouts | Async processing | Circuit breakers | Event-driven |
| High server CPU | Vertical scale | Horizontal scale | Optimize code |
| High memory | Increase RAM | Fix memory leaks | Redesign data structures |

### Infrastructure Cost at Scale

| Users | Architecture | Monthly Cost |
|-------|-------------|--------------|
| 10K | Single server | $100-300 |
| 100K | Load balanced + cache | $1,000-3,000 |
| 1M | Microservices + sharding | $10,000-30,000 |
| 10M | Multi-region | $100,000+ |

---

## References

- [Bottleneck Diagnosis Guide](bottleneck-diagnosis.md) - Detailed troubleshooting
- [Capacity Planning Calculator](capacity-calculator.md) - Sizing formulas

Comments (0)

No comments yet. Be the first to comment!