Building a Service Mesh on Cloudflare Workers
How to architect worker-to-worker communication at the edge. Service bindings vs HTTP, error handling, retry logic, and observability patterns.
When you move from a monolithic worker to a distributed system, you need a communication layer. Traditional service meshes like Istio or Linkerd don't exist at the edge. You have to build your own.
This is the architecture pattern running in production across 28 workers, handling millions of requests with sub-50ms latency.
The Architecture
A service mesh at the edge looks different from traditional microservices. There's no central orchestrator, no sidecar proxies. Each worker is both a service and a potential mesh participant.
Service Bindings vs HTTP Calls
There are two ways for workers to communicate: Service Bindings (direct invocation) and HTTP calls (network round-trip). The choice has significant implications.
| Factor | Service Bindings | HTTP Calls |
|---|---|---|
| Latency | <1ms overhead | 5-15ms overhead |
| Cold Starts | None | Possible |
| Billing | Single request | Multiple requests |
| Configuration | wrangler.toml |
None needed |
| Cross-account | Not supported | Fully supported |
| Debugging | Harder to trace | Standard HTTP tools |
Rule of thumb: Use Service Bindings for internal, high-frequency calls. Use HTTP for external integrations and cross-account communication.
Service Binding Configuration
# Define service bindings in the calling worker
[[services]]
binding = "AUTH"
service = "auth-worker"
[[services]]
binding = "NOTIFY"
service = "notification-worker"
[[services]]
binding = "METRICS"
service = "metrics-worker"
Calling via Service Binding
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Service binding call - no network overhead
const authResponse = await env.AUTH.fetch(
new Request('https://auth/verify', {
method: 'POST',
headers: { 'Authorization': request.headers.get('Authorization') }
})
);
if (!authResponse.ok) {
return new Response('Unauthorized', { status: 401 });
}
// Continue with authenticated request...
}
}
env.SERVICE.fetch() is ignored for routingโit's just for logging. The binding handles routing automatically.Building a Request Router
The gateway worker needs to route requests to appropriate services. Here's a pattern that scales:
type RouteHandler = (req: Request, env: Env) => Promise<Response>;
const routes: Record<string, RouteHandler> = {
'/api/leads': (req, env) => env.LEADS.fetch(req),
'/api/valuation': (req, env) => env.VALUATION.fetch(req),
'/api/offers': (req, env) => env.OFFERS.fetch(req),
'/api/notify': (req, env) => env.NOTIFY.fetch(req),
};
export function route(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
// Find matching route
for (const [pattern, handler] of Object.entries(routes)) {
if (url.pathname.startsWith(pattern)) {
return handler(request, env);
}
}
return new Response('Not Found', { status: 404 });
}
Error Handling & Retry Logic
Distributed systems fail. The question is how gracefully. Here's a retry wrapper with exponential backoff:
interface RetryOptions {
maxAttempts: number;
baseDelay: number;
maxDelay: number;
}
const defaults: RetryOptions = {
maxAttempts: 3,
baseDelay: 100,
maxDelay: 2000
};
export async function withRetry<T>(
fn: () => Promise<T>,
options: Partial<RetryOptions> = {}
): Promise<T> {
const opts = { ...defaults, ...options };
let lastError: Error;
for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
if (attempt === opts.maxAttempts) break;
// Exponential backoff with jitter
const delay = Math.min(
opts.baseDelay * Math.pow(2, attempt - 1),
opts.maxDelay
);
const jitter = delay * 0.1 * Math.random();
await sleep(delay + jitter);
}
}
throw lastError;
}
Circuit Breaker Pattern
For services that might be down, implement a circuit breaker to fail fast:
interface CircuitState {
failures: number;
lastFailure: number;
state: 'closed' | 'open' | 'half-open';
}
export class CircuitBreaker {
private state: CircuitState = { failures: 0, lastFailure: 0, state: 'closed' };
private threshold = 5;
private timeout = 30000; // 30 seconds
async call<T>(fn: () => Promise<T>, fallback?: () => T): Promise<T> {
// Check if circuit should stay open
if (this.state.state === 'open') {
if (Date.now() - this.state.lastFailure < this.timeout) {
if (fallback) return fallback();
throw new Error('Circuit breaker is open');
}
this.state.state = 'half-open';
}
try {
const result = await fn();
this.reset();
return result;
} catch (error) {
this.recordFailure();
if (fallback) return fallback();
throw error;
}
}
private recordFailure() {
this.state.failures++;
this.state.lastFailure = Date.now();
if (this.state.failures >= this.threshold) {
this.state.state = 'open';
}
}
private reset() {
this.state = { failures: 0, lastFailure: 0, state: 'closed' };
}
}
Observability Layer
Without observability, distributed debugging is impossible. Every request through the mesh needs tracing:
interface TraceContext {
traceId: string;
spanId: string;
parentSpanId?: string;
service: string;
startTime: number;
}
export function createTrace(service: string, parentCtx?: TraceContext): TraceContext {
return {
traceId: parentCtx?.traceId || crypto.randomUUID(),
spanId: crypto.randomUUID().slice(0, 8),
parentSpanId: parentCtx?.spanId,
service,
startTime: Date.now()
};
}
export function injectTraceHeaders(headers: Headers, ctx: TraceContext) {
headers.set('x-trace-id', ctx.traceId);
headers.set('x-span-id', ctx.spanId);
if (ctx.parentSpanId) {
headers.set('x-parent-span-id', ctx.parentSpanId);
}
}
export async function logTrace(ctx: TraceContext, env: Env) {
const duration = Date.now() - ctx.startTime;
// Fire and forget to logging worker
env.LOGGER.fetch(new Request('https://log/trace', {
method: 'POST',
body: JSON.stringify({ ...ctx, duration })
}));
}
Production Metrics
After running this architecture in production:
Common Pitfalls
- Circular dependencies. Worker A calls B, B calls A. Use dependency injection and clear service boundaries.
- Missing timeouts. Always set timeouts on service calls. Default to 10 seconds max.
- No fallbacks. Every external call should have a degraded response path.
- Over-fetching context. Don't pass the entire request through the mesh. Extract what's needed.
- Ignoring cold starts. Even with bindings, first calls may be slower. Warm critical paths.
Implementation Checklist
- Service bindings configured for internal communication
- HTTP calls only for external services
- Retry logic with exponential backoff
- Circuit breakers on critical paths
- Trace context propagation across all calls
- Centralized logging worker
- Timeouts on every service call
- Fallback responses defined
A service mesh isn't a product you install. It's a pattern you implement. At the edge, you build it yourselfโbut you also control it completely.
Related Articles
Next: Choosing the Right Storage
KV, D1, R2, Durable Objectsโwhen to use each.
โ Read Storage Guide