System Design Patterns: Building Resilient Distributed Systems

System Design Patterns: Building Resilient Distributed Systems

BySanjay Goraniya
3 min read
Share:

System Design Patterns: Building Resilient Distributed Systems

Designing distributed systems is challenging. You're dealing with network failures, partial failures, consistency issues, and more. Over the years, I've learned that certain patterns consistently help build resilient systems. Let me share the patterns that have proven most valuable.

The Challenges of Distributed Systems

Before diving into patterns, let's acknowledge what we're up against:

  1. Network is unreliable - Messages can be lost, delayed, or duplicated
  2. Clocks aren't synchronized - Time-based logic is tricky
  3. Partial failures - Some parts work while others fail
  4. Concurrency - Multiple things happening simultaneously
  5. State management - Keeping data consistent across services

Core Patterns

1. Circuit Breaker

Prevent cascading failures by stopping calls to failing services:

Code
class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

// Usage
const breaker = new CircuitBreaker();
try {
  const result = await breaker.call(() => paymentService.charge(userId, amount));
} catch (error) {
  // Handle failure gracefully
  return { success: false, message: 'Payment service unavailable' };
}

2. Retry with Exponential Backoff

Handle transient failures:

Code
async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      
      const delay = initialDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
}

// Usage
const result = await retryWithBackoff(
  () => api.fetchUser(userId),
  3, // max retries
  1000 // initial delay in ms
);

3. Bulkhead Pattern

Isolate failures to prevent them from spreading:

Code
// Separate thread pools for different operations
class BulkheadExecutor {
  constructor() {
    // Critical operations get more resources
    this.criticalPool = new ThreadPool(10);
    // Non-critical operations get fewer resources
    this.nonCriticalPool = new ThreadPool(2);
  }

  async executeCritical(task) {
    return this.criticalPool.execute(task);
  }

  async executeNonCritical(task) {
    return this.nonCriticalPool.execute(task);
  }
}

4. Saga Pattern

Manage distributed transactions:

Code
class OrderSaga {
  async execute(order) {
    const steps = [
      { name: 'reserveInventory', compensate: 'releaseInventory' },
      { name: 'chargePayment', compensate: 'refundPayment' },
      { name: 'createShipment', compensate: 'cancelShipment' }
    ];

    const executed = [];
    
    try {
      for (const step of steps) {
        await this[step.name](order);
        executed.push(step);
      }
      return { success: true };
    } catch (error) {
      // Compensate in reverse order
      for (const step of executed.reverse()) {
        await this[step.compensate](order);
      }
      throw error;
    }
  }

  async reserveInventory(order) {
    // Reserve inventory
  }

  async releaseInventory(order) {
    // Release inventory
  }

  // ... other methods
}

5. Event Sourcing

Store all changes as a sequence of events:

Code
class EventStore {
  constructor() {
    this.events = [];
  }

  append(aggregateId, event) {
    this.events.push({
      aggregateId,
      event,
      timestamp: Date.now(),
      version: this.getNextVersion(aggregateId)
    });
  }

  getEvents(aggregateId) {
    return this.events.filter(e => e.aggregateId === aggregateId);
  }

  reconstruct(aggregateId) {
    const events = this.getEvents(aggregateId);
    return events.reduce((state, event) => {
      return this.applyEvent(state, event);
    }, {});
  }
}

// Usage
const store = new EventStore();
store.append('order-123', { type: 'OrderCreated', data: { ... } });
store.append('order-123', { type: 'ItemAdded', data: { ... } });

const order = store.reconstruct('order-123');

6. CQRS (Command Query Responsibility Segregation)

Separate read and write models:

Code
// Write model (commands)
class OrderCommandHandler {
  async createOrder(data) {
    const order = new Order(data);
    await this.eventStore.append('order', order.createdEvent());
    return order.id;
  }

  async addItem(orderId, item) {
    const order = await this.eventStore.reconstruct(orderId);
    order.addItem(item);
    await this.eventStore.append('order', order.itemAddedEvent());
  }
}

// Read model (queries)
class OrderQueryHandler {
  async getOrder(orderId) {
    // Optimized read model (denormalized)
    return this.readDb.query('SELECT * FROM order_view WHERE id = ?', [orderId]);
  }

  async getUserOrders(userId) {
    return this.readDb.query('SELECT * FROM order_view WHERE user_id = ?', [userId]);
  }
}

7. API Gateway Pattern

Single entry point for clients:

Code
class APIGateway {
  constructor() {
    this.routes = {
      '/users': userService,
      '/orders': orderService,
      '/payments': paymentService
    };
  }

  async handleRequest(path, method, body) {
    const service = this.routes[path];
    if (!service) {
      return { status: 404, body: 'Not found' };
    }

    // Add cross-cutting concerns
    const requestId = this.generateRequestId();
    this.logRequest(requestId, path, method);

    try {
      const result = await service.handle(method, body);
      return { status: 200, body: result };
    } catch (error) {
      this.logError(requestId, error);
      return { status: 500, body: 'Internal server error' };
    }
  }
}

Data Patterns

1. Database Per Service

Each service has its own database:

Code
User Service → User Database
Order Service → Order Database
Payment Service → Payment Database

Benefits:

  • Service independence
  • Technology flexibility
  • Scalability

Challenges:

  • Data consistency
  • Cross-service queries

2. Shared Database Anti-Pattern

Avoid this:

Code
User Service ──┐
Order Service ──┼──→ Shared Database
Payment Service ─┘

Why it's bad:

  • Tight coupling
  • Can't scale independently
  • Schema changes affect all services

3. Event-Driven Communication

Services communicate via events:

Code
// Publisher
class OrderService {
  async createOrder(data) {
    const order = await this.saveOrder(data);
    
    // Publish event
    await eventBus.publish('OrderCreated', {
      orderId: order.id,
      userId: order.userId,
      total: order.total
    });
    
    return order;
  }
}

// Subscriber
class InventoryService {
  constructor() {
    eventBus.subscribe('OrderCreated', this.handleOrderCreated.bind(this));
  }

  async handleOrderCreated(event) {
    await this.reserveInventory(event.orderId, event.items);
  }
}

Resilience Patterns

1. Timeout Pattern

Never wait forever:

Code
async function withTimeout(promise, timeoutMs) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), timeoutMs)
    )
  ]);
}

// Usage
try {
  const result = await withTimeout(
    api.fetchData(),
    5000 // 5 second timeout
  );
} catch (error) {
  if (error.message === 'Timeout') {
    // Handle timeout
  }
}

2. Bulkhead Isolation

Isolate resources:

Code
// Separate connection pools
const criticalPool = new Pool({ max: 20 });
const nonCriticalPool = new Pool({ max: 5 });

// Critical operations use critical pool
async function processPayment(data) {
  const client = await criticalPool.connect();
  // ...
}

// Non-critical operations use non-critical pool
async function sendEmail(data) {
  const client = await nonCriticalPool.connect();
  // ...
}

Real-World Example: E-commerce Platform

Architecture:

Code
API Gateway
├── User Service (PostgreSQL)
├── Product Service (MongoDB)
├── Order Service (PostgreSQL)
├── Payment Service (PostgreSQL)
└── Notification Service (Redis)

Patterns Used:

  1. Circuit Breaker - Payment service calls
  2. Saga - Order creation flow
  3. Event Sourcing - Order history
  4. CQRS - Product catalog (write: normalized, read: denormalized)
  5. API Gateway - Single entry point
  6. Database per Service - Each service has its own DB

Result: System handles 10,000+ orders/day with 99.9% uptime.

Best Practices

  1. Design for failure - Assume everything will fail
  2. Idempotency - Operations should be safe to retry
  3. Monitoring - You can't fix what you can't see
  4. Graceful degradation - System should work partially
  5. Version APIs - Allow evolution without breaking changes

Conclusion

Building resilient distributed systems requires understanding these patterns and applying them appropriately. The key is:

  • Start simple - Don't over-engineer
  • Add resilience gradually - Based on actual needs
  • Monitor everything - Know when patterns are needed
  • Test failure scenarios - Chaos engineering

Remember: Perfect is the enemy of good. A system that's 95% reliable and ships is better than a perfect system that never ships.

What distributed systems challenges have you faced? Which patterns have been most helpful?

Share:

Related Posts