Observability in Modern Applications: Logging, Metrics, and Tracing

Observability in Modern Applications: Logging, Metrics, and Tracing

BySanjay Goraniya
3 min read
Share:

Observability in Modern Applications: Logging, Metrics, and Tracing

You can't fix what you can't see. In modern distributed systems, observability isn't optional—it's essential. After building observable systems that handle millions of requests, I've learned that good observability is the difference between debugging for hours and debugging for minutes.

The Three Pillars of Observability

1. Logging

What happened? - Events and errors

2. Metrics

How much? - Quantitative measurements

3. Tracing

Where? - Request flow through systems

Logging

Structured Logging

Code
// Bad: Unstructured
console.log('User created', userId, email);

// Good: Structured
logger.info('User created', {
  userId: user.id,
  email: user.email,
  requestId: req.id,
  timestamp: new Date().toISOString(),
  service: 'user-service',
  level: 'info'
});

Log Levels

Code
logger.error('Critical error', { error, context });
logger.warn('Potential issue', { warning, context });
logger.info('Important event', { event, context });
logger.debug('Debug information', { debug, context });

What to Log

Do log:

  • Request/response IDs
  • User actions (with privacy in mind)
  • Errors with context
  • Performance metrics
  • Business events

Don't log:

  • Passwords or secrets
  • Credit card numbers
  • Personal information (unless necessary)
  • Excessive debug info in production

Log Aggregation

Code
// Using Winston with transports
const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.json(),
  defaultMeta: { service: 'api' },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
    new winston.transports.Console({
      format: winston.format.simple()
    })
  ]
});

// Send to external service (Datadog, ELK, etc.)
if (process.env.NODE_ENV === 'production') {
  logger.add(new winston.transports.Http({
    host: 'logs.example.com',
    port: 443,
    path: '/logs'
  }));
}

Metrics

Types of Metrics

1. Counters

Things that only go up:

Code
const prometheus = require('prom-client');

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status', 'endpoint']
});

// Increment
httpRequestsTotal.inc({ method: 'GET', status: '200', endpoint: '/users' });

2. Gauges

Values that go up and down:

Code
const activeConnections = new prometheus.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Set value
activeConnections.set(42);

// Increment/decrement
activeConnections.inc();
activeConnections.dec();

3. Histograms

Distribution of values:

Code
const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Record duration
const end = requestDuration.startTimer();
// ... do work ...
end();

Key Metrics to Track

Code
// Application metrics
- Request rate (requests/second)
- Error rate (errors/second)
- Response time (P50, P95, P99)
- Throughput (items processed/second)

// System metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O

// Business metrics
- Orders per minute
- Revenue per hour
- Active users
- Conversion rate

Metrics Collection

Code
// Express middleware
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    
    httpRequestsTotal.inc({
      method: req.method,
      status: res.statusCode,
      endpoint: req.path
    });
    
    requestDuration.observe(duration / 1000);
  });
  
  next();
});

Distributed Tracing

The Problem

In microservices, a request flows through multiple services:

Code
Client → API Gateway → Auth Service → User Service → Database
                      → Order Service → Payment Service

How do you trace a request through all these services?

OpenTelemetry

Code
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new SimpleSpanProcessor(new JaegerExporter())
);
provider.register();

const tracer = provider.getTracer('my-service');

Creating Spans

Code
const { trace } = require('@opentelemetry/api');

async function processOrder(orderId) {
  const tracer = trace.getTracer('order-service');
  const span = tracer.startSpan('processOrder');
  
  span.setAttribute('order.id', orderId);
  
  try {
    // Step 1: Validate order
    const validateSpan = tracer.startSpan('validateOrder', { parent: span });
    const order = await validateOrder(orderId);
    validateSpan.end();
    
    // Step 2: Process payment
    const paymentSpan = tracer.startSpan('processPayment', { parent: span });
    const payment = await processPayment(order);
    paymentSpan.setAttribute('payment.amount', payment.amount);
    paymentSpan.end();
    
    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
}

Propagating Trace Context

Code
// Client sends request
const { propagation } = require('@opentelemetry/api');
const headers = {};
propagation.inject(trace.active(), headers);

fetch('http://api.example.com/orders', {
  headers
});

// Server receives request
const context = propagation.extract(trace.active(), req.headers);
const span = tracer.startSpan('handleRequest', { context });

Putting It All Together

Example: Observable API Endpoint

Code
const express = require('express');
const { trace } = require('@opentelemetry/api');
const prometheus = require('prom-client');

const app = express();
const tracer = trace.getTracer('api-service');

// Metrics
const requestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests',
  labelNames: ['method', 'route', 'status']
});

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  const span = tracer.startSpan('http_request');
  
  span.setAttribute('http.method', req.method);
  span.setAttribute('http.route', req.path);
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    // Metrics
    requestDuration.observe(
      { method: req.method, route: req.path, status: res.statusCode },
      duration
    );
    
    // Logging
    logger.info('Request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration,
      requestId: req.id
    });
    
    // Tracing
    span.setAttribute('http.status_code', res.statusCode);
    span.setAttribute('http.duration', duration);
    span.end();
  });
  
  next();
});

// Endpoint
app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('getUser');
  span.setAttribute('user.id', req.params.id);
  
  try {
    const user = await getUser(req.params.id);
    
    logger.info('User retrieved', {
      userId: user.id,
      requestId: req.id
    });
    
    span.setStatus({ code: SpanStatusCode.OK });
    res.json(user);
  } catch (error) {
    logger.error('Failed to get user', {
      userId: req.params.id,
      error: error.message,
      requestId: req.id
    });
    
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    res.status(500).json({ error: 'Internal server error' });
  } finally {
    span.end();
  }
});

Alerting

Setting Up Alerts

Code
// Alert on error rate
if (errorRate > 0.05) {
  alert({
    severity: 'high',
    message: 'Error rate above 5%',
    metric: 'error_rate',
    value: errorRate
  });
}

// Alert on slow responses
if (p95ResponseTime > 1000) {
  alert({
    severity: 'medium',
    message: 'P95 response time above 1s',
    metric: 'p95_response_time',
    value: p95ResponseTime
  });
}

// Alert on resource usage
if (memoryUsage > 0.9) {
  alert({
    severity: 'high',
    message: 'Memory usage above 90%',
    metric: 'memory_usage',
    value: memoryUsage
  });
}

Tools and Platforms

Logging Tools

  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Loki (Grafana)
  • Datadog Logs
  • CloudWatch Logs

Metrics Tools

  • Prometheus + Grafana
  • Datadog
  • New Relic
  • CloudWatch Metrics

Tracing Tools

  • Jaeger
  • Zipkin
  • Datadog APM
  • New Relic

Best Practices

  1. Log in JSON - Easier to parse and search
  2. Include context - Request IDs, user IDs, timestamps
  3. Set log levels appropriately - Don't log debug in production
  4. Track key metrics - Error rate, latency, throughput
  5. Use distributed tracing - Essential for microservices
  6. Set up alerts - Know when things go wrong
  7. Retain appropriately - Balance cost and usefulness
  8. Protect sensitive data - Don't log secrets

Real-World Example

System: E-commerce platform with 10 microservices

Observability Setup:

  1. Logging: All services log to centralized ELK stack
  2. Metrics: Prometheus collects metrics, Grafana for visualization
  3. Tracing: Jaeger for distributed tracing
  4. Alerting: PagerDuty for critical alerts

Benefits:

  • Debug time: 2 hours → 15 minutes
  • Mean time to detect: 5 minutes
  • Mean time to resolve: 30 minutes

Conclusion

Observability is an investment that pays dividends. The three pillars work together:

  • Logs tell you what happened
  • Metrics tell you how much
  • Traces tell you where

Start with logging, add metrics, then tracing. You don't need everything at once, but you do need visibility into your systems.

Remember: You can't improve what you can't measure. Good observability is the foundation of reliable systems.

What observability challenges have you faced? What tools have worked best for your systems?

Share:

Related Posts