02/20/26

Distributed Tracing in Microservices: A Practical Guide

Follow requests across service boundaries without losing context

11 Min Read

A request hits your API gateway, flows through an auth service, queries an orders service, which calls inventory and then payment. The response comes back slow. Which service is the bottleneck? Without distributed tracing, the answer is "check the logs and hope you can correlate timestamps." With tracing, you get a single timeline showing every service call, every database query, and exactly where the time went.

In a monolith, a stack trace gives you the full picture. In microservices, a stack trace ends at the network boundary. Distributed tracing fills that gap by threading a unique identifier through every service-to-service call, so you can reconstruct the full journey of a request after the fact.

This guide covers how distributed tracing works, what makes it hard, and how to set it up in a TypeScript microservices system. We'll start with the manual approach using OpenTelemetry, then look at how modern frameworks can eliminate most of this work entirely.

How Distributed Tracing Works

Three concepts make up the core model, borrowed from Google's Dapper paper and standardized by OpenTelemetry:

Trace: a single request's journey through the system. Every trace gets a unique ID (typically a 128-bit hex string) that stays the same across all services.

Span: one unit of work within a trace. A span has a name, a start time, a duration, and optional metadata (attributes). Each service call, each database query, each cache lookup is a span.

Parent-child relationships: spans form a tree. The initial request creates a root span. When that service calls another service, the outgoing call creates a child span. The child's span ID is linked to the parent's span ID. This is what lets you build the waterfall view.

Here's what a trace looks like for an order creation that touches three services:

Trace ID: abc123 orders.create [===============================] 120ms users.get [====] 15ms db.query [==] 8ms inventory.reserve [========] 35ms db.query [=====] 22ms payment.charge [============] 55ms stripe.api [=========] 42ms

From this waterfall, you can see that the payment service takes nearly half the total time, and within it, the Stripe API call is the bottleneck. That's the kind of insight you can't get from logs alone.

The Context Propagation Problem

The tracing model is straightforward. The hard part is getting trace context from one service to another.

When service A calls service B over HTTP, the trace ID and parent span ID need to travel with the request. The W3C Trace Context standard defines a traceparent header for this:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 | | | | version trace-id parent-id flags

Every service in the chain has to:

  1. Extract the traceparent header from incoming requests
  2. Create a new span linked to the parent
  3. Inject the updated traceparent header into any outgoing requests
  4. Report the completed span to a collector

Miss any of those steps and you get broken traces. Spans float disconnected from their parents, or entire services disappear from the timeline.

This is where most teams hit friction. It's not that any single step is difficult. It's that every service, every HTTP client, every message queue consumer, and every background job needs to do it consistently. One service that forgets to propagate headers breaks the trace for everything downstream.

A Multi-Service Example

Let's build a concrete system to trace. Three services handle order processing:

  • orders: receives the order request, orchestrates the flow
  • inventory: checks and reserves stock
  • payment: charges the customer

The request flow for creating an order:

Client --> orders.create --> inventory.check --> inventory.reserve --> payment.charge

The orders service calls inventory twice (check, then reserve) and payment once. A single client request produces at least four service-to-service calls, each needing trace context.

Manual Approach with OpenTelemetry

Here's what instrumenting this system looks like with the OpenTelemetry Node.js SDK. You need to set up a tracer provider, configure an exporter, and manually instrument every service.

First, the shared tracing setup that every service needs:

// tracing.ts (shared across all services) import { NodeSDK } from "@opentelemetry/sdk-node"; import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"; import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node"; import { Resource } from "@opentelemetry/resources"; import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions"; export function initTracing(serviceName: string) { const sdk = new NodeSDK({ resource: new Resource({ [ATTR_SERVICE_NAME]: serviceName, }), traceExporter: new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces", }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); }

Now the orders service. Notice how much code is tracing infrastructure vs business logic:

// orders/index.ts import { initTracing } from "../tracing"; initTracing("orders"); // Must run before anything else import express from "express"; import { trace, context, propagation, SpanStatusCode } from "@opentelemetry/api"; const app = express(); app.use(express.json()); const tracer = trace.getTracer("orders"); app.post("/orders", async (req, res) => { // Extract trace context from incoming request const parentContext = propagation.extract(context.active(), req.headers); // Start a span within the extracted context await context.with(parentContext, async () => { const span = tracer.startSpan("orders.create"); try { // Call inventory service (must propagate context) const headers: Record<string, string> = {}; propagation.inject(context.active(), headers); const stockCheck = await fetch("http://localhost:3001/inventory/check", { method: "POST", headers: { "Content-Type": "application/json", ...headers }, body: JSON.stringify({ productId: req.body.productId, quantity: req.body.quantity }), }); if (!stockCheck.ok) { span.setStatus({ code: SpanStatusCode.ERROR, message: "Out of stock" }); return res.status(400).json({ error: "Insufficient inventory" }); } // Call payment service (must propagate context again) const paymentHeaders: Record<string, string> = {}; propagation.inject(context.active(), paymentHeaders); const payment = await fetch("http://localhost:3002/payment/charge", { method: "POST", headers: { "Content-Type": "application/json", ...paymentHeaders }, body: JSON.stringify({ userId: req.body.userId, amount: req.body.amount }), }); if (!payment.ok) { span.setStatus({ code: SpanStatusCode.ERROR, message: "Payment failed" }); return res.status(400).json({ error: "Payment failed" }); } // Reserve inventory (propagate context once more) const reserveHeaders: Record<string, string> = {}; propagation.inject(context.active(), reserveHeaders); await fetch("http://localhost:3001/inventory/reserve", { method: "POST", headers: { "Content-Type": "application/json", ...reserveHeaders }, body: JSON.stringify({ productId: req.body.productId, quantity: req.body.quantity }), }); span.setStatus({ code: SpanStatusCode.OK }); res.json({ orderId: "order-123", status: "confirmed" }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR }); span.recordException(err as Error); res.status(500).json({ error: "Internal error" }); } finally { span.end(); } }); }); app.listen(3000);

The inventory and payment services need similar boilerplate: extract context, create spans, inject context on any outgoing calls. Here's inventory:

// inventory/index.ts import { initTracing } from "../tracing"; initTracing("inventory"); import express from "express"; import { trace, context, propagation, SpanStatusCode } from "@opentelemetry/api"; const app = express(); app.use(express.json()); const tracer = trace.getTracer("inventory"); app.post("/inventory/check", async (req, res) => { const parentContext = propagation.extract(context.active(), req.headers); await context.with(parentContext, async () => { const span = tracer.startSpan("inventory.check"); try { // Database query to check stock const available = await checkStock(req.body.productId, req.body.quantity); span.setAttribute("product.id", req.body.productId); span.setAttribute("inventory.available", available); span.setStatus({ code: SpanStatusCode.OK }); res.json({ available }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR }); span.recordException(err as Error); res.status(500).json({ error: "Check failed" }); } finally { span.end(); } }); }); app.post("/inventory/reserve", async (req, res) => { const parentContext = propagation.extract(context.active(), req.headers); await context.with(parentContext, async () => { const span = tracer.startSpan("inventory.reserve"); try { await reserveStock(req.body.productId, req.body.quantity); span.setStatus({ code: SpanStatusCode.OK }); res.json({ reserved: true }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR }); span.recordException(err as Error); res.status(500).json({ error: "Reserve failed" }); } finally { span.end(); } }); }); app.listen(3001);

That's a lot of code that has nothing to do with orders, inventory, or payments. And this is the happy path. It doesn't include retry logic, timeout handling, or the configuration needed to run a Jaeger or Tempo collector to actually view the traces.

Every new service, every new endpoint, every new inter-service call needs the same extract-span-inject pattern. It's mechanical and repetitive, but you can't skip it. One missing propagation.inject() call and traces break silently.

Automatic Approach with Encore

Encore takes a different approach. Available for both TypeScript and Go, Encore owns the transport layer between services and handles context propagation automatically. You write the same three services, and tracing works without any instrumentation code.

Here's the same system in Encore:

// inventory/inventory.ts import { api, APIError } from "encore.dev/api"; import { db } from "./db"; interface StockCheck { available: boolean; } export const check = api( { method: "POST", path: "/inventory/check" }, async ({ productId, quantity }: { productId: string; quantity: number }): Promise<StockCheck> => { const row = await db.queryRow<{ inventory: number }>` SELECT inventory FROM products WHERE id = ${productId} `; return { available: (row?.inventory ?? 0) >= quantity }; } ); export const reserve = api( { method: "POST", path: "/inventory/reserve" }, async ({ productId, quantity }: { productId: string; quantity: number }): Promise<void> => { const result = await db.exec` UPDATE products SET inventory = inventory - ${quantity} WHERE id = ${productId} AND inventory >= ${quantity} `; if (result.rowsAffected === 0) { throw APIError.failedPrecondition("insufficient inventory"); } } );
// payment/payment.ts import { api, APIError } from "encore.dev/api"; interface ChargeResult { chargeId: string; } export const charge = api( { method: "POST", path: "/payment/charge" }, async ({ userId, amount }: { userId: string; amount: number }): Promise<ChargeResult> => { // Call payment provider const result = await processPayment(userId, amount); return { chargeId: result.id }; } );
// orders/orders.ts import { api } from "encore.dev/api"; import { inventory, payment } from "~encore/clients"; import { db } from "./db"; interface Order { orderId: string; status: string; } export const create = api( { expose: true, method: "POST", path: "/orders" }, async (req: { userId: string; productId: string; quantity: number; amount: number }): Promise<Order> => { // Check inventory const stock = await inventory.check({ productId: req.productId, quantity: req.quantity }); if (!stock.available) { throw new Error("Insufficient inventory"); } // Charge payment await payment.charge({ userId: req.userId, amount: req.amount }); // Reserve inventory await inventory.reserve({ productId: req.productId, quantity: req.quantity }); // Create order record const row = await db.queryRow<{ id: string }>` INSERT INTO orders (user_id, product_id, quantity, amount, status) VALUES (${req.userId}, ${req.productId}, ${req.quantity}, ${req.amount}, 'confirmed') RETURNING id `; return { orderId: row!.id, status: "confirmed" }; } );

All tracing is handled automatically. The service-to-service calls through ~encore/clients carry trace context because Encore controls the RPC layer. Database queries appear in traces too, since Encore manages the database connections. You focus entirely on business logic and the instrumentation just works.

The resulting trace looks the same: a waterfall showing orders.create calling inventory.check, payment.charge, and inventory.reserve with timing for each. You just didn't write any of the plumbing.

Reading Traces Across Services

Having traces is only useful if you can read them. A distributed trace viewer presents requests as a waterfall timeline, with each row representing a span. The hierarchy (indentation) shows which service called which.

Here's what to look for:

Long spans with no children. The service itself is slow. Look at database queries or external API calls within that span. If payment.charge takes 200ms and the Stripe API call inside takes 195ms, the bottleneck is Stripe, not your code.

Many sequential children. The parent is making calls one after another when some could run in parallel. If orders.create calls inventory.check, waits, then calls payment.charge, waits, then calls inventory.reserve, you might be able to overlap the check and charge steps.

Gaps between spans. Time is being spent between service calls. This usually means application logic, serialization overhead, or connection pool contention.

Error propagation. When a child span fails, the error bubbles up. Traces show you exactly which service in the chain returned the error and what it was, so you're not guessing from a generic 500 response.

In Encore, the local development dashboard at localhost:9400 shows traces for every request during development. Click any request to see the waterfall. In production, Encore Cloud provides the same view with search and filtering across environments.

Production Considerations

Development tracing is simple: capture everything, store it locally, look at it when something breaks. Production tracing introduces trade-offs.

Sampling. Tracing every request in a high-traffic system generates enormous data volumes. A service handling 10,000 requests per second produces millions of spans per minute. Most tracing systems support sampling strategies: head-based sampling (decide at the start whether to trace), tail-based sampling (decide after the fact based on duration or errors), or a combination. A common starting point is tracing 1-10% of normal requests and 100% of errors.

Storage and cost. Spans are small individually but add up fast. A single traced request through five services might produce 15-20 spans. At scale, trace storage becomes a meaningful cost. Jaeger, Grafana Tempo, and Honeycomb all have different storage backends and retention models. Evaluate based on your query patterns: do you need traces from last week, or just the last hour?

Span attributes. Adding custom attributes to spans (user ID, order ID, feature flags) makes traces searchable. But avoid putting high-cardinality values in span names, and be careful not to include sensitive data. Trace data often has different access controls than application logs.

Service mesh interactions. If you're running a service mesh like Istio or Linkerd, the mesh can add its own spans. This is useful for network-level visibility but can clutter traces. Make sure your application-level spans and mesh-level spans use the same trace context format so they stitch together correctly.

Alerting on traces. Traces aren't just for debugging after the fact. Set up alerts for traces exceeding duration thresholds or traces with specific error patterns. This catches regressions before they become incidents.

With Encore Cloud, sampling and storage are handled for you. Tracing is available on all plans with usage-based pricing. Traces are retained and searchable from the dashboard without running your own collector infrastructure.

Choosing Your Approach

The manual OpenTelemetry approach gives you full control. You decide exactly what to instrument, which attributes to attach, and where to send the data. The cost is boilerplate in every service and the ongoing maintenance of keeping instrumentation consistent as your system evolves.

The framework approach with Encore gives you distributed tracing from the first service call. Everything is instrumented automatically, the collector is built in during development, and traces are always complete across service boundaries. You trade some customization for the guarantee that tracing always works.

For most teams building TypeScript microservices, the instrumentation burden is the main reason tracing gets deferred or done inconsistently. A framework that handles it automatically means you have traces from day one, when the system is still simple enough to understand, rather than trying to retrofit them after a production incident forces the issue.

Ready to escape the maze of complexity?

Encore Cloud is the development platform for building robust type-safe distributed systems with declarative infrastructure.