System Design Patterns: Building Resilient Distributed Systems
Designing distributed systems is challenging. You're dealing with network failures, partial failures, consistency issues, and more. Over the years, I've learned that certain patterns consistently help build resilient systems. Let me share the patterns that have proven most valuable.
The Challenges of Distributed Systems
Before diving into patterns, let's acknowledge what we're up against:
- Network is unreliable - Messages can be lost, delayed, or duplicated
- Clocks aren't synchronized - Time-based logic is tricky
- Partial failures - Some parts work while others fail
- Concurrency - Multiple things happening simultaneously
- State management - Keeping data consistent across services
Core Patterns
1. Circuit Breaker
Prevent cascading failures by stopping calls to failing services:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Usage
const breaker = new CircuitBreaker();
try {
const result = await breaker.call(() => paymentService.charge(userId, amount));
} catch (error) {
// Handle failure gracefully
return { success: false, message: 'Payment service unavailable' };
}
2. Retry with Exponential Backoff
Handle transient failures:
async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = initialDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
}
// Usage
const result = await retryWithBackoff(
() => api.fetchUser(userId),
3, // max retries
1000 // initial delay in ms
);
3. Bulkhead Pattern
Isolate failures to prevent them from spreading:
// Separate thread pools for different operations
class BulkheadExecutor {
constructor() {
// Critical operations get more resources
this.criticalPool = new ThreadPool(10);
// Non-critical operations get fewer resources
this.nonCriticalPool = new ThreadPool(2);
}
async executeCritical(task) {
return this.criticalPool.execute(task);
}
async executeNonCritical(task) {
return this.nonCriticalPool.execute(task);
}
}
4. Saga Pattern
Manage distributed transactions:
class OrderSaga {
async execute(order) {
const steps = [
{ name: 'reserveInventory', compensate: 'releaseInventory' },
{ name: 'chargePayment', compensate: 'refundPayment' },
{ name: 'createShipment', compensate: 'cancelShipment' }
];
const executed = [];
try {
for (const step of steps) {
await this[step.name](order);
executed.push(step);
}
return { success: true };
} catch (error) {
// Compensate in reverse order
for (const step of executed.reverse()) {
await this[step.compensate](order);
}
throw error;
}
}
async reserveInventory(order) {
// Reserve inventory
}
async releaseInventory(order) {
// Release inventory
}
// ... other methods
}
5. Event Sourcing
Store all changes as a sequence of events:
class EventStore {
constructor() {
this.events = [];
}
append(aggregateId, event) {
this.events.push({
aggregateId,
event,
timestamp: Date.now(),
version: this.getNextVersion(aggregateId)
});
}
getEvents(aggregateId) {
return this.events.filter(e => e.aggregateId === aggregateId);
}
reconstruct(aggregateId) {
const events = this.getEvents(aggregateId);
return events.reduce((state, event) => {
return this.applyEvent(state, event);
}, {});
}
}
// Usage
const store = new EventStore();
store.append('order-123', { type: 'OrderCreated', data: { ... } });
store.append('order-123', { type: 'ItemAdded', data: { ... } });
const order = store.reconstruct('order-123');
6. CQRS (Command Query Responsibility Segregation)
Separate read and write models:
// Write model (commands)
class OrderCommandHandler {
async createOrder(data) {
const order = new Order(data);
await this.eventStore.append('order', order.createdEvent());
return order.id;
}
async addItem(orderId, item) {
const order = await this.eventStore.reconstruct(orderId);
order.addItem(item);
await this.eventStore.append('order', order.itemAddedEvent());
}
}
// Read model (queries)
class OrderQueryHandler {
async getOrder(orderId) {
// Optimized read model (denormalized)
return this.readDb.query('SELECT * FROM order_view WHERE id = ?', [orderId]);
}
async getUserOrders(userId) {
return this.readDb.query('SELECT * FROM order_view WHERE user_id = ?', [userId]);
}
}
7. API Gateway Pattern
Single entry point for clients:
class APIGateway {
constructor() {
this.routes = {
'/users': userService,
'/orders': orderService,
'/payments': paymentService
};
}
async handleRequest(path, method, body) {
const service = this.routes[path];
if (!service) {
return { status: 404, body: 'Not found' };
}
// Add cross-cutting concerns
const requestId = this.generateRequestId();
this.logRequest(requestId, path, method);
try {
const result = await service.handle(method, body);
return { status: 200, body: result };
} catch (error) {
this.logError(requestId, error);
return { status: 500, body: 'Internal server error' };
}
}
}
Data Patterns
1. Database Per Service
Each service has its own database:
User Service → User Database
Order Service → Order Database
Payment Service → Payment Database
Benefits:
- Service independence
- Technology flexibility
- Scalability
Challenges:
- Data consistency
- Cross-service queries
2. Shared Database Anti-Pattern
Avoid this:
User Service ──┐
Order Service ──┼──→ Shared Database
Payment Service ─┘
Why it's bad:
- Tight coupling
- Can't scale independently
- Schema changes affect all services
3. Event-Driven Communication
Services communicate via events:
// Publisher
class OrderService {
async createOrder(data) {
const order = await this.saveOrder(data);
// Publish event
await eventBus.publish('OrderCreated', {
orderId: order.id,
userId: order.userId,
total: order.total
});
return order;
}
}
// Subscriber
class InventoryService {
constructor() {
eventBus.subscribe('OrderCreated', this.handleOrderCreated.bind(this));
}
async handleOrderCreated(event) {
await this.reserveInventory(event.orderId, event.items);
}
}
Resilience Patterns
1. Timeout Pattern
Never wait forever:
async function withTimeout(promise, timeoutMs) {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeoutMs)
)
]);
}
// Usage
try {
const result = await withTimeout(
api.fetchData(),
5000 // 5 second timeout
);
} catch (error) {
if (error.message === 'Timeout') {
// Handle timeout
}
}
2. Bulkhead Isolation
Isolate resources:
// Separate connection pools
const criticalPool = new Pool({ max: 20 });
const nonCriticalPool = new Pool({ max: 5 });
// Critical operations use critical pool
async function processPayment(data) {
const client = await criticalPool.connect();
// ...
}
// Non-critical operations use non-critical pool
async function sendEmail(data) {
const client = await nonCriticalPool.connect();
// ...
}
Real-World Example: E-commerce Platform
Architecture:
API Gateway
├── User Service (PostgreSQL)
├── Product Service (MongoDB)
├── Order Service (PostgreSQL)
├── Payment Service (PostgreSQL)
└── Notification Service (Redis)
Patterns Used:
- Circuit Breaker - Payment service calls
- Saga - Order creation flow
- Event Sourcing - Order history
- CQRS - Product catalog (write: normalized, read: denormalized)
- API Gateway - Single entry point
- Database per Service - Each service has its own DB
Result: System handles 10,000+ orders/day with 99.9% uptime.
Best Practices
- Design for failure - Assume everything will fail
- Idempotency - Operations should be safe to retry
- Monitoring - You can't fix what you can't see
- Graceful degradation - System should work partially
- Version APIs - Allow evolution without breaking changes
Conclusion
Building resilient distributed systems requires understanding these patterns and applying them appropriately. The key is:
- Start simple - Don't over-engineer
- Add resilience gradually - Based on actual needs
- Monitor everything - Know when patterns are needed
- Test failure scenarios - Chaos engineering
Remember: Perfect is the enemy of good. A system that's 95% reliable and ships is better than a perfect system that never ships.
What distributed systems challenges have you faced? Which patterns have been most helpful?
Related Posts
Event-Driven Architecture: Patterns and Best Practices
Learn how to build scalable, decoupled systems using event-driven architecture. Discover patterns, message brokers, and real-world implementation strategies.
Data Modeling for Scalable Applications: Normalization vs Denormalization
Learn when to normalize and when to denormalize your database schema. Master the art of data modeling for applications that scale to millions of users.
Scaling Applications Horizontally: Strategies for Growth
Learn how to scale applications horizontally to handle millions of users. From load balancing to database sharding, master the techniques that enable growth.
Microservices vs Monoliths: When to Choose What in 2024
A practical guide to choosing between microservices and monolithic architectures. Learn when each approach makes sense, common pitfalls, and how to make the right decision for your project.
Building Scalable React Applications: Lessons from Production
Learn from real-world production experiences how to build React applications that scale gracefully. Discover patterns, pitfalls, and best practices that have proven effective in large-scale applications.
Serverless Architecture: When to Use and When to Avoid
A practical guide to serverless architecture. Learn when serverless makes sense, its trade-offs, and how to build effective serverless applications.