Observability in Modern Applications: Logging, Metrics, and Tracing
You can't fix what you can't see. In modern distributed systems, observability isn't optional—it's essential. After building observable systems that handle millions of requests, I've learned that good observability is the difference between debugging for hours and debugging for minutes.
The Three Pillars of Observability
1. Logging
What happened? - Events and errors
2. Metrics
How much? - Quantitative measurements
3. Tracing
Where? - Request flow through systems
Logging
Structured Logging
// Bad: Unstructured
console.log('User created', userId, email);
// Good: Structured
logger.info('User created', {
userId: user.id,
email: user.email,
requestId: req.id,
timestamp: new Date().toISOString(),
service: 'user-service',
level: 'info'
});
Log Levels
logger.error('Critical error', { error, context });
logger.warn('Potential issue', { warning, context });
logger.info('Important event', { event, context });
logger.debug('Debug information', { debug, context });
What to Log
Do log:
- Request/response IDs
- User actions (with privacy in mind)
- Errors with context
- Performance metrics
- Business events
Don't log:
- Passwords or secrets
- Credit card numbers
- Personal information (unless necessary)
- Excessive debug info in production
Log Aggregation
// Using Winston with transports
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.json(),
defaultMeta: { service: 'api' },
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
new winston.transports.Console({
format: winston.format.simple()
})
]
});
// Send to external service (Datadog, ELK, etc.)
if (process.env.NODE_ENV === 'production') {
logger.add(new winston.transports.Http({
host: 'logs.example.com',
port: 443,
path: '/logs'
}));
}
Metrics
Types of Metrics
1. Counters
Things that only go up:
const prometheus = require('prom-client');
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'status', 'endpoint']
});
// Increment
httpRequestsTotal.inc({ method: 'GET', status: '200', endpoint: '/users' });
2. Gauges
Values that go up and down:
const activeConnections = new prometheus.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Set value
activeConnections.set(42);
// Increment/decrement
activeConnections.inc();
activeConnections.dec();
3. Histograms
Distribution of values:
const requestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
buckets: [0.1, 0.5, 1, 2, 5]
});
// Record duration
const end = requestDuration.startTimer();
// ... do work ...
end();
Key Metrics to Track
// Application metrics
- Request rate (requests/second)
- Error rate (errors/second)
- Response time (P50, P95, P99)
- Throughput (items processed/second)
// System metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
// Business metrics
- Orders per minute
- Revenue per hour
- Active users
- Conversion rate
Metrics Collection
// Express middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpRequestsTotal.inc({
method: req.method,
status: res.statusCode,
endpoint: req.path
});
requestDuration.observe(duration / 1000);
});
next();
});
Distributed Tracing
The Problem
In microservices, a request flows through multiple services:
Client → API Gateway → Auth Service → User Service → Database
→ Order Service → Payment Service
How do you trace a request through all these services?
OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const provider = new NodeTracerProvider();
provider.addSpanProcessor(
new SimpleSpanProcessor(new JaegerExporter())
);
provider.register();
const tracer = provider.getTracer('my-service');
Creating Spans
const { trace } = require('@opentelemetry/api');
async function processOrder(orderId) {
const tracer = trace.getTracer('order-service');
const span = tracer.startSpan('processOrder');
span.setAttribute('order.id', orderId);
try {
// Step 1: Validate order
const validateSpan = tracer.startSpan('validateOrder', { parent: span });
const order = await validateOrder(orderId);
validateSpan.end();
// Step 2: Process payment
const paymentSpan = tracer.startSpan('processPayment', { parent: span });
const payment = await processPayment(order);
paymentSpan.setAttribute('payment.amount', payment.amount);
paymentSpan.end();
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
Propagating Trace Context
// Client sends request
const { propagation } = require('@opentelemetry/api');
const headers = {};
propagation.inject(trace.active(), headers);
fetch('http://api.example.com/orders', {
headers
});
// Server receives request
const context = propagation.extract(trace.active(), req.headers);
const span = tracer.startSpan('handleRequest', { context });
Putting It All Together
Example: Observable API Endpoint
const express = require('express');
const { trace } = require('@opentelemetry/api');
const prometheus = require('prom-client');
const app = express();
const tracer = trace.getTracer('api-service');
// Metrics
const requestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
const span = tracer.startSpan('http_request');
span.setAttribute('http.method', req.method);
span.setAttribute('http.route', req.path);
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
// Metrics
requestDuration.observe(
{ method: req.method, route: req.path, status: res.statusCode },
duration
);
// Logging
logger.info('Request completed', {
method: req.method,
path: req.path,
status: res.statusCode,
duration,
requestId: req.id
});
// Tracing
span.setAttribute('http.status_code', res.statusCode);
span.setAttribute('http.duration', duration);
span.end();
});
next();
});
// Endpoint
app.get('/api/users/:id', async (req, res) => {
const span = tracer.startSpan('getUser');
span.setAttribute('user.id', req.params.id);
try {
const user = await getUser(req.params.id);
logger.info('User retrieved', {
userId: user.id,
requestId: req.id
});
span.setStatus({ code: SpanStatusCode.OK });
res.json(user);
} catch (error) {
logger.error('Failed to get user', {
userId: req.params.id,
error: error.message,
requestId: req.id
});
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
res.status(500).json({ error: 'Internal server error' });
} finally {
span.end();
}
});
Alerting
Setting Up Alerts
// Alert on error rate
if (errorRate > 0.05) {
alert({
severity: 'high',
message: 'Error rate above 5%',
metric: 'error_rate',
value: errorRate
});
}
// Alert on slow responses
if (p95ResponseTime > 1000) {
alert({
severity: 'medium',
message: 'P95 response time above 1s',
metric: 'p95_response_time',
value: p95ResponseTime
});
}
// Alert on resource usage
if (memoryUsage > 0.9) {
alert({
severity: 'high',
message: 'Memory usage above 90%',
metric: 'memory_usage',
value: memoryUsage
});
}
Tools and Platforms
Logging Tools
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Loki (Grafana)
- Datadog Logs
- CloudWatch Logs
Metrics Tools
- Prometheus + Grafana
- Datadog
- New Relic
- CloudWatch Metrics
Tracing Tools
- Jaeger
- Zipkin
- Datadog APM
- New Relic
Best Practices
- Log in JSON - Easier to parse and search
- Include context - Request IDs, user IDs, timestamps
- Set log levels appropriately - Don't log debug in production
- Track key metrics - Error rate, latency, throughput
- Use distributed tracing - Essential for microservices
- Set up alerts - Know when things go wrong
- Retain appropriately - Balance cost and usefulness
- Protect sensitive data - Don't log secrets
Real-World Example
System: E-commerce platform with 10 microservices
Observability Setup:
- Logging: All services log to centralized ELK stack
- Metrics: Prometheus collects metrics, Grafana for visualization
- Tracing: Jaeger for distributed tracing
- Alerting: PagerDuty for critical alerts
Benefits:
- Debug time: 2 hours → 15 minutes
- Mean time to detect: 5 minutes
- Mean time to resolve: 30 minutes
Conclusion
Observability is an investment that pays dividends. The three pillars work together:
- Logs tell you what happened
- Metrics tell you how much
- Traces tell you where
Start with logging, add metrics, then tracing. You don't need everything at once, but you do need visibility into your systems.
Remember: You can't improve what you can't measure. Good observability is the foundation of reliable systems.
What observability challenges have you faced? What tools have worked best for your systems?
Related Posts
Database Migration Strategies: Zero-Downtime Deployments
Learn how to perform database migrations without downtime. From schema changes to data migrations, master the techniques that keep your application running.
Container Orchestration with Kubernetes: A Practical Guide
Learn Kubernetes fundamentals and practical patterns for deploying and managing containerized applications at scale. Real-world examples and best practices.
Cost Optimization in Cloud Infrastructure: Real-World Strategies
Learn practical strategies to reduce cloud infrastructure costs without sacrificing performance or reliability. Real techniques that have saved thousands of dollars.
Docker and Containerization: Best Practices for Production
Master Docker containerization with production-ready best practices. Learn how to build efficient, secure, and maintainable containerized applications.
Microservices vs Monoliths: When to Choose What in 2024
A practical guide to choosing between microservices and monolithic architectures. Learn when each approach makes sense, common pitfalls, and how to make the right decision for your project.
Debugging Complex Production Issues: A Senior Engineer's Toolkit
Learn systematic approaches to debugging complex production issues. From log analysis to distributed tracing, master the tools and techniques that work.