Observability in Modern Applications: Logging, Metrics, and Tracing

You can't fix what you can't see. In modern distributed systems, observability isn't optional—it's essential. After building observable systems that handle millions of requests, I've learned that good observability is the difference between debugging for hours and debugging for minutes.

The Three Pillars of Observability

1. Logging

What happened? - Events and errors

2. Metrics

How much? - Quantitative measurements

3. Tracing

Where? - Request flow through systems

Logging

Structured Logging

Code

// Bad: Unstructured
console.log('User created', userId, email);

// Good: Structured
logger.info('User created', {
  userId: user.id,
  email: user.email,
  requestId: req.id,
  timestamp: new Date().toISOString(),
  service: 'user-service',
  level: 'info'
});

Log Levels

Code

logger.error('Critical error', { error, context });
logger.warn('Potential issue', { warning, context });
logger.info('Important event', { event, context });
logger.debug('Debug information', { debug, context });

What to Log

Do log:

Request/response IDs
User actions (with privacy in mind)
Errors with context
Performance metrics
Business events

Don't log:

Passwords or secrets
Credit card numbers
Personal information (unless necessary)
Excessive debug info in production

Log Aggregation

Code

// Using Winston with transports
const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.json(),
  defaultMeta: { service: 'api' },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
    new winston.transports.Console({
      format: winston.format.simple()
    })
  ]
});

// Send to external service (Datadog, ELK, etc.)
if (process.env.NODE_ENV === 'production') {
  logger.add(new winston.transports.Http({
    host: 'logs.example.com',
    port: 443,
    path: '/logs'
  }));
}

Metrics

Types of Metrics

1. Counters

Things that only go up:

Code

const prometheus = require('prom-client');

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status', 'endpoint']
});

// Increment
httpRequestsTotal.inc({ method: 'GET', status: '200', endpoint: '/users' });

2. Gauges

Values that go up and down:

Code

const activeConnections = new prometheus.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Set value
activeConnections.set(42);

// Increment/decrement
activeConnections.inc();
activeConnections.dec();

3. Histograms

Distribution of values:

Code

const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Record duration
const end = requestDuration.startTimer();
// ... do work ...
end();

Key Metrics to Track

Code

// Application metrics
- Request rate (requests/second)
- Error rate (errors/second)
- Response time (P50, P95, P99)
- Throughput (items processed/second)

// System metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O

// Business metrics
- Orders per minute
- Revenue per hour
- Active users
- Conversion rate

Metrics Collection

Code

// Express middleware
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    
    httpRequestsTotal.inc({
      method: req.method,
      status: res.statusCode,
      endpoint: req.path
    });
    
    requestDuration.observe(duration / 1000);
  });
  
  next();
});

Distributed Tracing

The Problem

In microservices, a request flows through multiple services:

Code

Client → API Gateway → Auth Service → User Service → Database
                      → Order Service → Payment Service

How do you trace a request through all these services?

OpenTelemetry

Code

const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new SimpleSpanProcessor(new JaegerExporter())
);
provider.register();

const tracer = provider.getTracer('my-service');

Creating Spans

Code

const { trace } = require('@opentelemetry/api');

async function processOrder(orderId) {
  const tracer = trace.getTracer('order-service');
  const span = tracer.startSpan('processOrder');
  
  span.setAttribute('order.id', orderId);
  
  try {
    // Step 1: Validate order
    const validateSpan = tracer.startSpan('validateOrder', { parent: span });
    const order = await validateOrder(orderId);
    validateSpan.end();
    
    // Step 2: Process payment
    const paymentSpan = tracer.startSpan('processPayment', { parent: span });
    const payment = await processPayment(order);
    paymentSpan.setAttribute('payment.amount', payment.amount);
    paymentSpan.end();
    
    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}

Propagating Trace Context

Code

// Client sends request
const { propagation } = require('@opentelemetry/api');
const headers = {};
propagation.inject(trace.active(), headers);

fetch('http://api.example.com/orders', {
  headers
});

// Server receives request
const context = propagation.extract(trace.active(), req.headers);
const span = tracer.startSpan('handleRequest', { context });

Putting It All Together

Example: Observable API Endpoint

Code

const express = require('express');
const { trace } = require('@opentelemetry/api');
const prometheus = require('prom-client');

const app = express();
const tracer = trace.getTracer('api-service');

// Metrics
const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests',
  labelNames: ['method', 'route', 'status']
});

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  const span = tracer.startSpan('http_request');
  
  span.setAttribute('http.method', req.method);
  span.setAttribute('http.route', req.path);
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    // Metrics
    requestDuration.observe(
      { method: req.method, route: req.path, status: res.statusCode },
      duration
    );
    
    // Logging
    logger.info('Request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration,
      requestId: req.id
    });
    
    // Tracing
    span.setAttribute('http.status_code', res.statusCode);
    span.setAttribute('http.duration', duration);
    span.end();
  });
  
  next();
});

// Endpoint
app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('getUser');
  span.setAttribute('user.id', req.params.id);
  
  try {
    const user = await getUser(req.params.id);
    
    logger.info('User retrieved', {
      userId: user.id,
      requestId: req.id
    });
    
    span.setStatus({ code: SpanStatusCode.OK });
    res.json(user);
  } catch (error) {
    logger.error('Failed to get user', {
      userId: req.params.id,
      error: error.message,
      requestId: req.id
    });
    
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    res.status(500).json({ error: 'Internal server error' });
  } finally {
    span.end();
  }
});

Alerting

Setting Up Alerts

Code

// Alert on error rate
if (errorRate > 0.05) {
  alert({
    severity: 'high',
    message: 'Error rate above 5%',
    metric: 'error_rate',
    value: errorRate
  });
}

// Alert on slow responses
if (p95ResponseTime > 1000) {
  alert({
    severity: 'medium',
    message: 'P95 response time above 1s',
    metric: 'p95_response_time',
    value: p95ResponseTime
  });
}

// Alert on resource usage
if (memoryUsage > 0.9) {
  alert({
    severity: 'high',
    message: 'Memory usage above 90%',
    metric: 'memory_usage',
    value: memoryUsage
  });
}

Tools and Platforms

Logging Tools

ELK Stack (Elasticsearch, Logstash, Kibana)
Loki (Grafana)
Datadog Logs
CloudWatch Logs

Metrics Tools

Prometheus + Grafana
Datadog
New Relic
CloudWatch Metrics

Tracing Tools

Jaeger
Zipkin
Datadog APM
New Relic

Best Practices

Log in JSON - Easier to parse and search
Include context - Request IDs, user IDs, timestamps
Set log levels appropriately - Don't log debug in production
Track key metrics - Error rate, latency, throughput
Use distributed tracing - Essential for microservices
Set up alerts - Know when things go wrong
Retain appropriately - Balance cost and usefulness
Protect sensitive data - Don't log secrets

Real-World Example

System: E-commerce platform with 10 microservices

Observability Setup:

Logging: All services log to centralized ELK stack
Metrics: Prometheus collects metrics, Grafana for visualization
Tracing: Jaeger for distributed tracing
Alerting: PagerDuty for critical alerts

Benefits:

Debug time: 2 hours → 15 minutes
Mean time to detect: 5 minutes
Mean time to resolve: 30 minutes

Conclusion

Observability is an investment that pays dividends. The three pillars work together:

Logs tell you what happened
Metrics tell you how much
Traces tell you where

Start with logging, add metrics, then tracing. You don't need everything at once, but you do need visibility into your systems.

Remember: You can't improve what you can't measure. Good observability is the foundation of reliable systems.

What observability challenges have you faced? What tools have worked best for your systems?

Observability in Modern Applications: Logging, Metrics, and Tracing

The Three Pillars of Observability

1. Logging

2. Metrics

3. Tracing

Logging

Structured Logging

Log Levels

What to Log

Log Aggregation

Metrics

Types of Metrics

1. Counters

2. Gauges

3. Histograms

Key Metrics to Track

Metrics Collection

Distributed Tracing

The Problem

OpenTelemetry

Creating Spans

Propagating Trace Context

Putting It All Together

Example: Observable API Endpoint

Alerting

Setting Up Alerts

Tools and Platforms

Logging Tools

Metrics Tools

Tracing Tools

Best Practices

Real-World Example

Conclusion

Related Posts

Database Migration Strategies: Zero-Downtime Deployments

Container Orchestration with Kubernetes: A Practical Guide

Cost Optimization in Cloud Infrastructure: Real-World Strategies

Docker and Containerization: Best Practices for Production

Microservices vs Monoliths: When to Choose What in 2024

Debugging Complex Production Issues: A Senior Engineer's Toolkit