Building Resilient APIs: Retry Strategies and Circuit Breakers
APIs fail. Networks are unreliable. Services go down. The question isn't whether failures will happen—it's how your system handles them. After building APIs that handle millions of requests daily, I've learned that resilience isn't optional—it's essential.
The Challenge
Common Failure Scenarios
- Network failures - Timeouts, connection errors
- Service overload - Too many requests
- Partial failures - Some endpoints work, others don't
- Cascading failures - One failure causes others
Retry Strategies
Simple Retry
async function retry(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
await sleep(1000 * (attempt + 1));
}
}
}
Exponential Backoff
async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = initialDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
}
Jitter
Add randomness to prevent thundering herd:
function addJitter(delay) {
const jitter = delay * 0.1 * Math.random();
return delay + jitter;
}
async function retryWithJitter(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const baseDelay = 1000 * Math.pow(2, attempt);
const delay = addJitter(baseDelay);
await sleep(delay);
}
}
}
Retry Only on Transient Errors
function isRetryableError(error) {
// Retry on network errors, timeouts, 5xx
if (error.code === 'ECONNREFUSED') return true;
if (error.code === 'ETIMEDOUT') return true;
if (error.response?.status >= 500) return true;
// Don't retry on 4xx (client errors)
if (error.response?.status >= 400 && error.response?.status < 500) {
return false;
}
return true;
}
async function retryOnTransientErrors(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (!isRetryableError(error)) throw error;
if (attempt === maxRetries - 1) throw error;
const delay = 1000 * Math.pow(2, attempt);
await sleep(delay);
}
}
}
Circuit Breaker Pattern
Basic Implementation
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.successCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
if (this.state === 'HALF_OPEN') {
this.state = 'CLOSED';
}
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
getState() {
return {
state: this.state,
failureCount: this.failureCount,
nextAttempt: this.nextAttempt
};
}
}
Usage
const breaker = new CircuitBreaker(5, 60000);
async function callExternalAPI() {
try {
return await breaker.call(() => {
return fetch('https://api.example.com/data');
});
} catch (error) {
if (error.message === 'Circuit breaker is OPEN') {
// Return cached data or default response
return getCachedData();
}
throw error;
}
}
Timeout Pattern
Never Wait Forever
async function withTimeout(promise, timeoutMs) {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeoutMs)
)
]);
}
// Usage
try {
const result = await withTimeout(
fetch('https://api.example.com/data'),
5000 // 5 second timeout
);
} catch (error) {
if (error.message === 'Timeout') {
// Handle timeout
}
}
Bulkhead Pattern
Isolate Failures
class BulkheadExecutor {
constructor(maxConcurrency = 10) {
this.semaphore = maxConcurrency;
this.queue = [];
}
async execute(fn) {
return new Promise((resolve, reject) => {
this.queue.push({ fn, resolve, reject });
this.process();
});
}
async process() {
if (this.semaphore === 0 || this.queue.length === 0) {
return;
}
this.semaphore--;
const { fn, resolve, reject } = this.queue.shift();
try {
const result = await fn();
resolve(result);
} catch (error) {
reject(error);
} finally {
this.semaphore++;
this.process();
}
}
}
// Usage
const executor = new BulkheadExecutor(5);
// Multiple requests, but only 5 concurrent
const promises = requests.map(req => executor.execute(() => fetch(req)));
const results = await Promise.all(promises);
Fallback Strategies
Graceful Degradation
async function getDataWithFallback() {
try {
return await fetch('https://api.example.com/data');
} catch (error) {
// Fallback to cache
const cached = await getCachedData();
if (cached) {
return cached;
}
// Fallback to default
return getDefaultData();
}
}
Multiple Fallbacks
async function getDataWithMultipleFallbacks() {
const sources = [
() => fetch('https://api1.example.com/data'),
() => fetch('https://api2.example.com/data'),
() => getCachedData(),
() => getDefaultData()
];
for (const source of sources) {
try {
const result = await source();
if (result) return result;
} catch (error) {
continue; // Try next source
}
}
throw new Error('All sources failed');
}
Combining Patterns
Complete Resilient Client
class ResilientAPIClient {
constructor(options = {}) {
this.breaker = new CircuitBreaker(
options.failureThreshold || 5,
options.timeout || 60000
);
this.maxRetries = options.maxRetries || 3;
this.timeout = options.timeout || 5000;
}
async request(url, options = {}) {
return this.breaker.call(async () => {
return this.retryWithBackoff(async () => {
return this.withTimeout(
fetch(url, options),
this.timeout
);
});
});
}
async retryWithBackoff(fn) {
for (let attempt = 0; attempt < this.maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (!this.isRetryable(error)) throw error;
if (attempt === this.maxRetries - 1) throw error;
const delay = 1000 * Math.pow(2, attempt);
await sleep(delay);
}
}
}
async withTimeout(promise, timeoutMs) {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeoutMs)
)
]);
}
isRetryable(error) {
if (error.message === 'Timeout') return true;
if (error.response?.status >= 500) return true;
return false;
}
}
// Usage
const client = new ResilientAPIClient({
failureThreshold: 5,
timeout: 60000,
maxRetries: 3
});
try {
const data = await client.request('https://api.example.com/data');
} catch (error) {
// All retries failed, circuit breaker open, or timeout
// Use fallback
}
Real-World Example
Challenge: Payment service calling external payment gateway, frequent timeouts and failures.
Solution:
- Circuit breaker - Stop calling when gateway is down
- Retry with backoff - Handle transient failures
- Timeout - Don't wait forever
- Fallback - Queue payment for later processing
Implementation:
const paymentBreaker = new CircuitBreaker(3, 60000);
async function processPayment(order) {
try {
return await paymentBreaker.call(async () => {
return await retryWithBackoff(async () => {
return await withTimeout(
paymentGateway.charge(order),
10000
);
}, 3);
});
} catch (error) {
// All strategies failed - queue for later
await queuePaymentForRetry(order);
return { queued: true, orderId: order.id };
}
}
Result:
- Payment success rate: 85% → 98%
- Timeout errors: Reduced by 90%
- System stability: Improved significantly
Best Practices
- Retry transient errors - Network issues, timeouts
- Don't retry client errors - 4xx errors
- Use exponential backoff - Prevent overwhelming services
- Add jitter - Prevent thundering herd
- Set timeouts - Never wait forever
- Use circuit breakers - Prevent cascading failures
- Implement fallbacks - Graceful degradation
- Monitor everything - Know when patterns trigger
Conclusion
Building resilient APIs requires multiple strategies working together:
- Retry - Handle transient failures
- Circuit breaker - Prevent cascading failures
- Timeout - Don't wait forever
- Fallback - Graceful degradation
Remember: Failures are inevitable. Resilience is how you handle them. Design for failure, and your systems will be more reliable.
What resilience challenges have you faced? What patterns have worked best for your APIs?
Related Posts
GraphQL vs REST: Making the Right API Choice in 2025
A comprehensive comparison of GraphQL and REST APIs in 2025. Learn when to use each approach, their trade-offs, and how to make the right decision for your project.
API Design Principles: Creating Developer-Friendly REST APIs
Learn the principles and patterns for designing REST APIs that developers love to use. From URL structure to error handling, this guide covers it all.
REST API Design Best Practices: Building Developer-Friendly APIs
Learn how to design REST APIs that are intuitive, maintainable, and developer-friendly. From URL structure to error handling, master the principles that make APIs great.