Zalando’s Product Read API handles the traffic that powers product pages, search results and checkout flows across 25 European markets. Millions of requests per second flow through it. Single-digit millisecond latency isn’t a nice-to-have. A brief slowdown shows up in sales numbers.
Ownership of the routing path changed everything.
For years the API’s batch endpoint unpacked a single request into as many as 100 parallel calls. Each one passed through the cluster’s shared Skipper ingress load balancer. Skipper adds only a couple hundred microseconds per hop. Yet when 100 calls wait on the slowest among them, those microseconds compound. Spikes appeared. Blame was hard to assign. Was it the service? Or the shared router sitting in the hot path?
Conor Gallagher, senior principal engineer at Zalando, laid out the problem and the fix in a detailed post. The team built an in-process client-side load balancer. It now directs more than a million requests per second of internal fan-out traffic. (Zalando Engineering Blog, June 23, 2026)
They didn’t rip out Skipper. Edge traffic still uses it. Internal batch calls now route directly from the calling process to product pods. The result? Latency spikes flattened. Skipper’s fleet scaled down sharply. Daily infrastructure cost for those routes dropped from around $450 to roughly $110.
But getting there required matching Skipper’s behavior exactly. Hash parity came first. Without it, caches would split and DynamoDB load would double.
The team replicated Skipper’s xxHash64 algorithm on a 64-bit virtual-node ring. Each endpoint sits at 100 positions. A binary search finds the nearest clockwise match for a hashed product ID. Unit tests run on every build to guarantee the client-side ring produces identical placements to Skipper for any pod set. In production canary traffic, cache hit ratios stayed identical between paths. No silent drift.
The implementation lives in a standalone JVM module. It depends on a zero-allocation hashing library and standard JDK components. Kubernetes discovery and Micrometer metrics sit at the edges. Nothing heavy inside the request path.
Discovery itself demanded care. Early polling of the Kubernetes EndpointSlice API risked repeating past incidents where high-frequency calls overwhelmed the control plane. Hundreds of product-sets pods polling on independent schedules would have created exactly that aggregate load.
They switched to a watch-based informer. It lists current slices on startup, then streams add, update and delete events. A two-second debounce folds rapid changes during scale events into one ring rebuild. If the API goes away, the last good state remains. Connection errors and caller retries handle staleness.
None of the advanced features could ship until the deployment pipeline improved. Old gates and sleep timers had piled up over years. Builds took 21 minutes. Median deploys stretched nearly five hours. One ran over four days.
Three pull requests fixed it. Build caching trimmed wall time. Manual traffic steps collapsed into a single sequenced rollout across market groups. Median time fell to 128 minutes. Small, reversible experiments became practical. Engineers began running cache-tuning tests in parallel. Velocity changed.
Rollout used three toggles: a boolean off switch, a percentage ramp for traffic split, and transparent fallback to Skipper on any client-side failure. They stepped from 1% in canary markets to 10%, 50% and finally 100%. Latency dropped immediately. Daily spikes vanished. At peak, over a million requests per second moved off Skipper.
Stability arrived. So did an unexpected cost saving. Skipper pods for these routes shrank from more than 50 to a minimum of eight. The project that began as a latency and observability effort turned into a capacity win.
Advanced techniques layered on top turned steady gains into structural resilience.
Scale-up spikes had long been accepted as normal. New pods arrived with cold caches. A naive ring would slam them with full traffic share. DynamoDB read bursts followed. Latency across the fleet jumped.
Skipper uses probabilistic fade-in. Zalando’s version improves on it with N-ring fade-in. Each scale event creates its own independent ring. That ring fades in over a default 30-second window on a squared-2.5 curve: slow at first, then rapid. Multiple overlapping fades coexist without disruption. Pods receive traffic matching their final steady-state assignment. Caches warm with exactly the right products. No wasted entries. No eviction churn.
Steady-state balance came next. Some pods ran hot while others idled. In-flight request count seemed the obvious signal. It proved misleading. A hot-cache pod processes requests quickly and can handle more without overload.
The team moved to occupancy-based bounded load. They measure actual work in flight adjusted for observed latency. The math keeps any pod from exceeding the fleet average by more than a configurable factor. Load spreads more evenly. Hot pods shed keys sooner. Idle pods pick them up.
Availability-zone awareness followed. Cross-AZ traffic carries data-transfer fees in AWS. The load balancer now prefers same-zone endpoints when possible. A latency health factor adjusts scores so a slightly slower same-zone pod can still beat a faster cross-zone one if the cost difference justifies it. The factor tunes the tradeoff.
These additions sit on top of the core consistent-hash ring. Each change swaps an immutable snapshot behind an atomic reference. Routing decisions read one consistent view without locks.
Recent industry conversation shows the approach resonates. A Hacker News thread on the Zalando post highlighted the hash-ring replication, fade-in logic and AZ-aware routing as sophisticated distributed-systems work. One commenter noted the project eliminated ambiguity in incident root causes by owning the full path. (Hacker News, July 2, 2026)
Spring Cloud LoadBalancer remains the common starting point for many Java teams moving to client-side routing. Spring’s own guide demonstrates basic integration with RestTemplate or WebClient in microservices. Yet Zalando’s production system at million-request scale required far tighter alignment with existing infrastructure and custom algorithms beyond default round-robin. (Spring.io)
GeeksforGeeks explained the core distinction this year: client-side logic lives inside the caller, while server-side relies on a centralized balancer. The Zalando case shows what happens when that client logic matures to handle fan-out, cache warmth and cost-aware routing at extreme scale. (GeeksforGeeks, May 6, 2026)
Medium posts continue to surface practical implementations. One recent piece outlined how Spring Cloud LoadBalancer replaces legacy Netflix Ribbon in new applications, especially for latency-sensitive services running on Kubernetes. Zalando’s experience adds concrete evidence: when internal traffic dominates and shared edge components introduce uncertainty, moving the decision in-process pays off in observability, cost and predictability. (Medium, 2026)
The team kept the Skipper fallback in code. One ConfigMap change can redirect everything back. That safety net never triggers in normal operation. Yet its presence let them move fast.
Incidents that once mixed Skipper behavior with application code now carry clear logs from inside the process. Root cause becomes obvious. The service grew resilient to infrastructure hiccups beneath it.
Zalando didn’t set out to build a general-purpose client-side balancer. They solved a concrete problem in a high-stakes API. The result offers a model for any organization running Java services on Kubernetes at scale. Match the existing ring. Own discovery. Fix the pipeline first. Add fade-in, occupancy and zone awareness. Measure at every step.
One million requests per second no longer flow through a shared edge router. They route inside the caller. Latency smoothed. Costs fell. Clarity replaced ambiguity. The structural problem finally went away.


WebProNews is an iEntry Publication