ProductGraph Architecture Scaling Guide¶
This document outlines the recommended architecture at different stages of scale and revenue, including trade-offs between simplicity and performance.
Architecture Overview¶
ProductGraph can be deployed in three configurations:
| Configuration | Users | Events/Month | Revenue | Complexity |
|---|---|---|---|---|
| Starter | 1-1,000 | <50M | $0-50K ARR | Low |
| Growth | 1,000-10,000 | 50M-500M | $50K-500K ARR | Medium |
| Scale | 10,000+ | 500M+ | $500K+ ARR | High |
Starter Architecture (v0.1.0)¶
Recommended for: Early-stage, PoC, first 1000 paying users.
┌─────────────────────────────────────────────────────────────────┐
│ ProductGraph Service │
│ (Single Binary) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Ingestion │ │ GraphQL API │ │ Background Workers │ │
│ │ /v1/events │ │ │ │ (Session aggregation) │ │
│ └──────┬───────┘ └──────┬───────┘ └───────────┬────────────┘ │
│ │ │ │ │
│ └─────────────────┴──────────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Ent ORM │ │
│ └──────┬──────┘ │
└───────────────────────────┼─────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ PostgreSQL 16+ (RLS) │
│ │
│ • Events (BRIN index) │
│ • Sessions │
│ • Journeys │
│ • Projects/Orgs │
└────────────────────────┘
Components¶
| Component | Technology | Purpose |
|---|---|---|
| Database | PostgreSQL 16+ | All data storage with RLS |
| ORM | Ent | Type-safe queries, migrations |
| API | Go + Chi | HTTP ingestion + GraphQL |
Trade-offs¶
| Pros | Cons |
|---|---|
| Single database to manage | Analytics queries may slow >2s at scale |
| Simple deployment (single binary + DB) | No real-time streaming |
| Low operational cost | Limited horizontal scaling |
| Easy debugging | Session aggregation in-process |
| RLS provides secure multi-tenancy | Must carefully index for performance |
PostgreSQL Optimization Tips¶
-- Use BRIN indexes for time-series event data
CREATE INDEX events_timestamp_brin ON events
USING BRIN (org_id, project_id, timestamp);
-- Partition by month for large event tables
CREATE TABLE events (
...
) PARTITION BY RANGE (timestamp);
-- Use connection pooling (PgBouncer)
-- Tune shared_buffers, effective_cache_size
When to Upgrade¶
Migrate to Growth architecture when:
- [ ] Event ingestion latency p99 > 200ms
- [ ] Analytics queries consistently > 2s
- [ ] Database CPU > 70% sustained
- [ ] Storage costs exceed compute savings
Growth Architecture¶
Recommended for: Scaling to 10,000 users, $50K-500K ARR.
┌─────────────────────────────────────────────────────────────────┐
│ API Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Ingestion │ │ GraphQL API │ │ WebSocket │ │
│ │ /v1/events │ │ │ │ (Real-time updates) │ │
│ └──────┬───────┘ └──────┬───────┘ └───────────┬────────────┘ │
└─────────┼─────────────────┼──────────────────────┼──────────────┘
│ │ │
▼ │ │
┌─────────────────┐ │ │
│ Kafka │ │ │
│ (Event Stream) │ │ │
└────────┬────────┘ │ │
│ │ │
▼ │ │
┌─────────────────┐ │ │
│ Processors │ │ │
│ • Session build │ │ │
│ • Journey match │ │ │
└────────┬────────┘ │ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ClickHouse │ │ PostgreSQL │ │ Redis │
│ (Events) │ │ (Metadata) │ │ (Sessions) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
New Components¶
| Component | Technology | Purpose |
|---|---|---|
| Event Stream | Kafka | Decouple ingestion from processing |
| Analytics DB | ClickHouse | Fast columnar analytics |
| Cache | Redis | Real-time session state |
Migration Path from Starter¶
- Add Kafka (Week 1-2)
- Deploy Kafka cluster
- Update ingestion to publish to Kafka
-
Add consumer for PostgreSQL writes (temporary)
-
Add ClickHouse (Week 3-4)
- Deploy ClickHouse
- Add consumer to write events to ClickHouse
- Migrate analytics queries to ClickHouse
-
Keep recent events in PostgreSQL for joins
-
Add Redis (Week 5-6)
- Deploy Redis
- Move session state to Redis
- Add WebSocket support for real-time
Trade-offs¶
| Pros | Cons |
|---|---|
| Horizontal scaling for ingestion | 3 databases to manage |
| Sub-second analytics queries | More complex deployment |
| Real-time capabilities | Higher operational cost |
| Kafka replay for reprocessing | Need Kafka expertise |
Cost Estimate (AWS)¶
| Component | Instance | Monthly Cost |
|---|---|---|
| Kafka (MSK) | kafka.m5.large x3 | ~$600 |
| ClickHouse | r6g.xlarge x2 | ~$400 |
| PostgreSQL (RDS) | db.r6g.large | ~$200 |
| Redis (ElastiCache) | cache.r6g.large | ~$150 |
| Total | ~$1,350/mo |
Scale Architecture¶
Recommended for: 10,000+ users, $500K+ ARR.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────────────────────┴───────────────────────────────────────┐
│ API Gateway (Kong) │
│ Rate limiting, Auth, Routing │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ingestion │ │ GraphQL API │ │ WebSocket │
│ (Replicas) │ │ (Replicas) │ │ (Replicas) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
▼ │ │
┌─────────────────────────────────────┼────────────────────────────┼─────────┐
│ Kafka Cluster (Multi-AZ) │ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ events │ │sessions │ │journeys │ │ alerts │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
└─────────────────────────────────────┬───────────────────────────┴─────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Session │ │ Journey │ │ Alert │
│ Processor │ │ Processor │ │ Processor │
│ (Consumer Grp) │ │ (Consumer Grp) │ │ (Consumer Grp) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└────────────────────────────┼────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ClickHouse │ │ PostgreSQL │ │ Redis │
│ Cluster │ │ (Primary + │ │ Cluster │
│ (Sharded) │ │ Replicas) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ S3 / R2 │
│ (Snapshots, │
│ Exports) │
└─────────────────┘
Additional Components¶
| Component | Technology | Purpose |
|---|---|---|
| API Gateway | Kong / Traefik | Rate limiting, auth, routing |
| Object Storage | S3 / Cloudflare R2 | Screenshots, exports |
| Monitoring | Prometheus + Grafana | Observability |
| Tracing | Jaeger / Tempo | Distributed tracing |
Trade-offs¶
| Pros | Cons |
|---|---|
| Unlimited horizontal scaling | Complex operations |
| Multi-region capable | Requires dedicated SRE |
| High availability | Higher infrastructure cost |
| Feature-rich (alerts, exports) | Longer development cycles |
Cost Estimate (AWS)¶
| Component | Configuration | Monthly Cost |
|---|---|---|
| Kafka (MSK) | kafka.m5.xlarge x6 | ~$2,000 |
| ClickHouse | r6g.2xlarge x4 (sharded) | ~$2,000 |
| PostgreSQL (RDS) | db.r6g.xlarge + replica | ~$600 |
| Redis (ElastiCache) | cache.r6g.xlarge cluster | ~$500 |
| S3 | 1TB storage + transfer | ~$100 |
| Kong | t3.medium x2 | ~$100 |
| Monitoring | Managed Prometheus | ~$200 |
| Total | ~$5,500/mo |
Decision Matrix¶
Use this matrix to decide which architecture fits your needs:
| Factor | Starter | Growth | Scale |
|---|---|---|---|
| Setup Time | 1 day | 2-4 weeks | 2-3 months |
| Team Size | 1 dev | 2-3 devs | 5+ devs + SRE |
| Monthly Infra | $50-200 | $1,000-2,000 | $5,000+ |
| Query Latency | 200ms-2s | 50-200ms | <50ms |
| Ingestion Rate | 1K/sec | 10K/sec | 100K+/sec |
| Data Retention | 90 days | 1 year | Unlimited |
| Real-time | No | Basic | Full |
| Multi-region | No | No | Yes |
Upgrade Triggers¶
Starter → Growth¶
| Metric | Threshold | Action |
|---|---|---|
| Events/day | >1M | Add Kafka |
| Query p99 | >2s | Add ClickHouse |
| Active sessions | >10K concurrent | Add Redis |
| Team size | >3 engineers | Worth the complexity |
Growth → Scale¶
| Metric | Threshold | Action |
|---|---|---|
| Events/day | >50M | Shard ClickHouse |
| Ingestion latency | >100ms p99 | Scale Kafka partitions |
| Revenue | >$500K ARR | Can afford dedicated SRE |
| Uptime SLA | >99.9% | Multi-region deployment |
Migration Checklist¶
Starter → Growth¶
- [ ] Deploy Kafka cluster
- [ ] Add Kafka producer to ingestion service
- [ ] Deploy Kafka consumers for processing
- [ ] Deploy ClickHouse
- [ ] Migrate event writes to ClickHouse
- [ ] Update analytics queries to use ClickHouse
- [ ] Deploy Redis
- [ ] Migrate session state to Redis
- [ ] Add WebSocket support
- [ ] Update monitoring dashboards
- [ ] Update runbooks
Growth → Scale¶
- [ ] Deploy API Gateway
- [ ] Configure rate limiting and auth
- [ ] Shard ClickHouse by project_id
- [ ] Add PostgreSQL read replicas
- [ ] Deploy Redis cluster
- [ ] Set up S3 for snapshots
- [ ] Configure multi-AZ for all components
- [ ] Add distributed tracing
- [ ] Create runbooks for each component
- [ ] Train team on operations
Summary¶
Start with the Starter architecture. It's sufficient for most early-stage products and can handle significant scale with proper PostgreSQL tuning. Only upgrade when you hit specific performance triggers—premature optimization adds complexity without benefits.
The cost savings of Starter over Scale are significant:
| Architecture | Monthly Cost | Annual Savings vs Scale |
|---|---|---|
| Starter | ~$150 | $64,200 |
| Growth | ~$1,350 | $49,800 |
| Scale | ~$5,500 | - |
Invest those savings in product development until scale demands otherwise.