Debugging Complex Production Issues: A Senior Engineer's Toolkit

Debugging Complex Production Issues: A Senior Engineer's Toolkit

BySanjay Goraniya
4 min read
Share:

Debugging Complex Production Issues: A Senior Engineer's Toolkit

Production issues are inevitable. When they happen, you need to debug quickly and effectively. After debugging hundreds of production incidents, I've learned that having a systematic approach and the right tools makes all the difference.

The Debugging Mindset

Stay Calm

Panic leads to poor decisions. Take a deep breath and approach systematically.

Gather Facts First

Don't jump to conclusions. Gather data, then form hypotheses.

Work Backwards

Start from the symptom and work backwards to the cause.

The Debugging Process

1. Reproduce the Issue

If you can't reproduce it, you can't fix it.

Code
// Try to reproduce locally
// Use production data (sanitized)
// Match production environment

2. Gather Information

Collect as much information as possible:

  • Error messages - What exactly failed?
  • Logs - What was happening?
  • Metrics - CPU, memory, request rates
  • Timeline - When did it start?
  • Scope - Who/what is affected?

3. Form Hypotheses

Based on the information, form hypotheses:

  • "The database connection pool is exhausted"
  • "Memory leak in the image processing service"
  • "Race condition in the payment flow"

4. Test Hypotheses

Test each hypothesis:

Code
// Check connection pool
SELECT count(*) FROM pg_stat_activity;

// Check memory usage
process.memoryUsage();

// Add logging to test race condition
console.log('Payment started', { userId, orderId, timestamp: Date.now() });

5. Fix and Verify

Fix the issue and verify it's resolved:

  • Fix - Implement the solution
  • Test - Verify the fix works
  • Monitor - Watch for recurrence
  • Document - Record what happened

Essential Tools

1. Logging

Structured logging:

Code
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  defaultMeta: { service: 'user-service' },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
});

// Use context
logger.info('User created', {
  userId: user.id,
  email: user.email,
  requestId: req.id,
  timestamp: new Date().toISOString()
});

What to log:

  • Request/response IDs
  • User IDs (when relevant)
  • Timestamps
  • Error details
  • Performance metrics

What NOT to log:

  • Passwords
  • Credit card numbers
  • Personal information (unless necessary)
  • Secrets/API keys

2. Monitoring and Alerting

Key metrics to monitor:

  • Error rates - 4xx, 5xx responses
  • Response times - P50, P95, P99
  • Throughput - Requests per second
  • Resource usage - CPU, memory, disk
  • Business metrics - Orders, payments, etc.

Set up alerts:

Code
// Alert on error rate spike
if (errorRate > 0.05) {
  alert('Error rate above 5%');
}

// Alert on slow responses
if (p95ResponseTime > 1000) {
  alert('P95 response time above 1s');
}

3. Distributed Tracing

For microservices, distributed tracing is essential:

Code
const { trace } = require('@opentelemetry/api');

function processOrder(order) {
  const span = trace.getActiveSpan();
  span.setAttribute('order.id', order.id);
  span.setAttribute('order.total', order.total);
  
  try {
    // Process order
    span.addEvent('Order processed');
  } catch (error) {
    span.recordException(error);
    throw error;
  }
}

4. Debugging Tools

Chrome DevTools (for Node.js):

Code
node --inspect server.js
# Connect Chrome DevTools

VS Code Debugger:

Code
{
  "type": "node",
  "request": "attach",
  "name": "Attach to Process",
  "port": 9229
}

Postman/Insomnia - Test APIs

Database tools - Query analyzers, explain plans

Common Production Issues

1. Memory Leaks

Symptoms:

  • Memory usage grows over time
  • Application slows down
  • Crashes after running for a while

How to debug:

Code
// Monitor memory
setInterval(() => {
  const usage = process.memoryUsage();
  console.log({
    rss: `${Math.round(usage.rss / 1024 / 1024)}MB`,
    heapUsed: `${Math.round(usage.heapUsed / 1024 / 1024)}MB`,
    heapTotal: `${Math.round(usage.heapTotal / 1024 / 1024)}MB`
  });
}, 5000);

// Use heap snapshots
// Chrome DevTools > Memory > Take heap snapshot

Common causes:

  • Event listeners not removed
  • Closures holding references
  • Caches without limits
  • Global variables

2. Database Connection Issues

Symptoms:

  • "Too many connections" errors
  • Slow queries
  • Timeouts

How to debug:

Code
-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Check connection pool
SHOW max_connections;

-- Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';

Solutions:

  • Increase connection pool size
  • Use connection pooling
  • Close connections properly
  • Optimize slow queries

3. Race Conditions

Symptoms:

  • Intermittent bugs
  • Data inconsistencies
  • Works sometimes, fails other times

How to debug:

Code
// Add detailed logging
async function processPayment(orderId, amount) {
  const logId = `${orderId}-${Date.now()}-${Math.random()}`;
  logger.info('Payment started', { logId, orderId, amount });
  
  const existing = await db.query('SELECT * FROM payments WHERE order_id = $1', [orderId]);
  logger.info('Existing payment check', { logId, existing: existing.rows.length });
  
  if (existing.rows.length > 0) {
    logger.warn('Duplicate payment attempt', { logId, orderId });
    return;
  }
  
  // Process payment
  logger.info('Payment processed', { logId, orderId });
}

Solutions:

  • Use database transactions
  • Implement locking
  • Use idempotency keys
  • Add proper error handling

4. Performance Degradation

Symptoms:

  • Slow response times
  • High CPU usage
  • Timeouts

How to debug:

Code
// Profile code
const { performance } = require('perf_hooks');

async function slowFunction() {
  const start = performance.now();
  
  // Your code here
  
  const duration = performance.now() - start;
  if (duration > 1000) {
    logger.warn('Slow function', { duration, function: 'slowFunction' });
  }
}

// Use APM tools
// New Relic, Datadog, etc.

Common causes:

  • N+1 queries
  • Missing indexes
  • Inefficient algorithms
  • Resource contention

Debugging Strategies

Narrow down the problem:

Code
Is it the frontend? → No
Is it the API? → Yes
Is it authentication? → No
Is it the database? → Yes
Is it a specific query? → Yes
Found it!

2. Add Logging

When in doubt, add more logging:

Code
// Add strategic logging
logger.debug('Entering function', { params });
logger.debug('After step 1', { result });
logger.debug('After step 2', { result });
logger.debug('Exiting function', { result });

3. Isolate the Problem

Break down complex systems:

Code
// Test components independently
// Is it the service?
const result = await userService.getUser(id);

// Is it the database?
const result = await db.query('SELECT * FROM users WHERE id = $1', [id]);

// Is it the network?
const result = await fetch('http://api.example.com/users/123');

4. Compare Working vs. Broken

Code
// What's different?
// Working request
GET /api/users/123
Headers: { Authorization: 'Bearer token1' }

// Broken request
GET /api/users/456
Headers: { Authorization: 'Bearer token2' }

// What's different? User? Token? Data?

Real-World Example

Issue: Intermittent 500 errors, 2% of requests failing

Debugging Process:

  1. Reproduced: Happened randomly, couldn't reproduce locally
  2. Gathered info:
    • Error logs showed "Database connection timeout"
    • Happened during peak hours
    • Affected random users
  3. Hypothesis: Connection pool exhausted
  4. Tested:
    • Checked connection pool usage: 95% utilized
    • Found long-running queries holding connections
  5. Fixed:
    • Added query timeouts
    • Increased pool size
    • Optimized slow queries
  6. Verified: Error rate dropped to 0.01%

Best Practices

  1. Log everything - You'll need it later
  2. Monitor proactively - Catch issues before users do
  3. Use structured logging - Easier to search and analyze
  4. Add context - Request IDs, user IDs, timestamps
  5. Test in production-like environments - Staging should mirror production
  6. Document incidents - Learn from each one
  7. Build runbooks - Document common issues and fixes

Conclusion

Debugging production issues is a skill that improves with experience. The key is to:

  • Stay systematic - Don't panic, follow a process
  • Use the right tools - Logging, monitoring, tracing
  • Gather data - Facts before hypotheses
  • Learn from each incident - Build knowledge over time

Remember: Every production issue is a learning opportunity. Document what happened, why it happened, and how you fixed it. This knowledge is invaluable.

What production debugging challenges have you faced? What tools and techniques have been most helpful?

Share:

Related Posts