Building Resilient APIs: Retry Strategies and Circuit Breakers

Building Resilient APIs: Retry Strategies and Circuit Breakers

BySanjay Goraniya
2 min read
Share:

Building Resilient APIs: Retry Strategies and Circuit Breakers

APIs fail. Networks are unreliable. Services go down. The question isn't whether failures will happen—it's how your system handles them. After building APIs that handle millions of requests daily, I've learned that resilience isn't optional—it's essential.

The Challenge

Common Failure Scenarios

  1. Network failures - Timeouts, connection errors
  2. Service overload - Too many requests
  3. Partial failures - Some endpoints work, others don't
  4. Cascading failures - One failure causes others

Retry Strategies

Simple Retry

Code
async function retry(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      await sleep(1000 * (attempt + 1));
    }
  }
}

Exponential Backoff

Code
async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      
      const delay = initialDelay * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
}

Jitter

Add randomness to prevent thundering herd:

Code
function addJitter(delay) {
  const jitter = delay * 0.1 * Math.random();
  return delay + jitter;
}

async function retryWithJitter(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      
      const baseDelay = 1000 * Math.pow(2, attempt);
      const delay = addJitter(baseDelay);
      await sleep(delay);
    }
  }
}

Retry Only on Transient Errors

Code
function isRetryableError(error) {
  // Retry on network errors, timeouts, 5xx
  if (error.code === 'ECONNREFUSED') return true;
  if (error.code === 'ETIMEDOUT') return true;
  if (error.response?.status >= 500) return true;
  
  // Don't retry on 4xx (client errors)
  if (error.response?.status >= 400 && error.response?.status < 500) {
    return false;
  }
  
  return true;
}

async function retryOnTransientErrors(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (!isRetryableError(error)) throw error;
      if (attempt === maxRetries - 1) throw error;
      
      const delay = 1000 * Math.pow(2, attempt);
      await sleep(delay);
    }
  }
}

Circuit Breaker Pattern

Basic Implementation

Code
class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.successCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED';
    }
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }

  getState() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      nextAttempt: this.nextAttempt
    };
  }
}

Usage

Code
const breaker = new CircuitBreaker(5, 60000);

async function callExternalAPI() {
  try {
    return await breaker.call(() => {
      return fetch('https://api.example.com/data');
    });
  } catch (error) {
    if (error.message === 'Circuit breaker is OPEN') {
      // Return cached data or default response
      return getCachedData();
    }
    throw error;
  }
}

Timeout Pattern

Never Wait Forever

Code
async function withTimeout(promise, timeoutMs) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), timeoutMs)
    )
  ]);
}

// Usage
try {
  const result = await withTimeout(
    fetch('https://api.example.com/data'),
    5000 // 5 second timeout
  );
} catch (error) {
  if (error.message === 'Timeout') {
    // Handle timeout
  }
}

Bulkhead Pattern

Isolate Failures

Code
class BulkheadExecutor {
  constructor(maxConcurrency = 10) {
    this.semaphore = maxConcurrency;
    this.queue = [];
  }

  async execute(fn) {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject });
      this.process();
    });
  }

  async process() {
    if (this.semaphore === 0 || this.queue.length === 0) {
      return;
    }

    this.semaphore--;
    const { fn, resolve, reject } = this.queue.shift();

    try {
      const result = await fn();
      resolve(result);
    } catch (error) {
      reject(error);
    } finally {
      this.semaphore++;
      this.process();
    }
  }
}

// Usage
const executor = new BulkheadExecutor(5);

// Multiple requests, but only 5 concurrent
const promises = requests.map(req => executor.execute(() => fetch(req)));
const results = await Promise.all(promises);

Fallback Strategies

Graceful Degradation

Code
async function getDataWithFallback() {
  try {
    return await fetch('https://api.example.com/data');
  } catch (error) {
    // Fallback to cache
    const cached = await getCachedData();
    if (cached) {
      return cached;
    }
    
    // Fallback to default
    return getDefaultData();
  }
}

Multiple Fallbacks

Code
async function getDataWithMultipleFallbacks() {
  const sources = [
    () => fetch('https://api1.example.com/data'),
    () => fetch('https://api2.example.com/data'),
    () => getCachedData(),
    () => getDefaultData()
  ];

  for (const source of sources) {
    try {
      const result = await source();
      if (result) return result;
    } catch (error) {
      continue; // Try next source
    }
  }

  throw new Error('All sources failed');
}

Combining Patterns

Complete Resilient Client

Code
class ResilientAPIClient {
  constructor(options = {}) {
    this.breaker = new CircuitBreaker(
      options.failureThreshold || 5,
      options.timeout || 60000
    );
    this.maxRetries = options.maxRetries || 3;
    this.timeout = options.timeout || 5000;
  }

  async request(url, options = {}) {
    return this.breaker.call(async () => {
      return this.retryWithBackoff(async () => {
        return this.withTimeout(
          fetch(url, options),
          this.timeout
        );
      });
    });
  }

  async retryWithBackoff(fn) {
    for (let attempt = 0; attempt < this.maxRetries; attempt++) {
      try {
        return await fn();
      } catch (error) {
        if (!this.isRetryable(error)) throw error;
        if (attempt === this.maxRetries - 1) throw error;
        
        const delay = 1000 * Math.pow(2, attempt);
        await sleep(delay);
      }
    }
  }

  async withTimeout(promise, timeoutMs) {
    return Promise.race([
      promise,
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Timeout')), timeoutMs)
      )
    ]);
  }

  isRetryable(error) {
    if (error.message === 'Timeout') return true;
    if (error.response?.status >= 500) return true;
    return false;
  }
}

// Usage
const client = new ResilientAPIClient({
  failureThreshold: 5,
  timeout: 60000,
  maxRetries: 3
});

try {
  const data = await client.request('https://api.example.com/data');
} catch (error) {
  // All retries failed, circuit breaker open, or timeout
  // Use fallback
}

Real-World Example

Challenge: Payment service calling external payment gateway, frequent timeouts and failures.

Solution:

  1. Circuit breaker - Stop calling when gateway is down
  2. Retry with backoff - Handle transient failures
  3. Timeout - Don't wait forever
  4. Fallback - Queue payment for later processing

Implementation:

Code
const paymentBreaker = new CircuitBreaker(3, 60000);

async function processPayment(order) {
  try {
    return await paymentBreaker.call(async () => {
      return await retryWithBackoff(async () => {
        return await withTimeout(
          paymentGateway.charge(order),
          10000
        );
      }, 3);
    });
  } catch (error) {
    // All strategies failed - queue for later
    await queuePaymentForRetry(order);
    return { queued: true, orderId: order.id };
  }
}

Result:

  • Payment success rate: 85% → 98%
  • Timeout errors: Reduced by 90%
  • System stability: Improved significantly

Best Practices

  1. Retry transient errors - Network issues, timeouts
  2. Don't retry client errors - 4xx errors
  3. Use exponential backoff - Prevent overwhelming services
  4. Add jitter - Prevent thundering herd
  5. Set timeouts - Never wait forever
  6. Use circuit breakers - Prevent cascading failures
  7. Implement fallbacks - Graceful degradation
  8. Monitor everything - Know when patterns trigger

Conclusion

Building resilient APIs requires multiple strategies working together:

  • Retry - Handle transient failures
  • Circuit breaker - Prevent cascading failures
  • Timeout - Don't wait forever
  • Fallback - Graceful degradation

Remember: Failures are inevitable. Resilience is how you handle them. Design for failure, and your systems will be more reliable.

What resilience challenges have you faced? What patterns have worked best for your APIs?

Share: