My Next Hop Blog
The Networking Behind AI Inference at Scale — What Engineers Targeting Anthropic, Google, and Meta Need to Know
Prefill-decode disaggregation is now the default serving architecture at every major AI lab. The network connecting these pools — RoCEv2 fabric, KV cache transfer, L3 Clos over L2 — is becoming a core interview topic for senior infrastructure engineers.
There is an important distinction between the networking that connects GPUs during model training and the networking that serves a model in production — and as of 2026, both are interview topics. The GPU cluster networking post on this blog covered training: InfiniBand, RoCEv2, all-reduce collectives, and lossless fabric design. This post covers inference: the architecture that runs when a user sends a prompt to Claude, Gemini, or GPT, and why the network connecting the serving infrastructure has become a senior engineering interview topic at Anthropic, Google, Meta, and Microsoft.
Inference serving at scale has split into two computationally distinct phases that now run on separate hardware pools. The prefill phase processes the input prompt: it reads every token in the context window, runs the full forward pass, and produces the KV cache — a compressed representation of the prompt that the model uses during generation. This phase is compute-bound: it saturates GPU matrix multiply units and runs fast. The decode phase generates output tokens one at a time, autoregressively, loading the KV cache from memory on each step. This phase is memory-bandwidth-bound: it runs slowly relative to how much GPU compute is available. These different resource profiles mean that running both phases on the same GPU pool leaves either the compute or the memory bandwidth underutilised. The solution, now deployed by Meta, LinkedIn, Mistral, and Hugging Face in production, is prefill-decode disaggregation: separate GPU pools for each phase, with the network transferring the KV cache between them.
That KV cache transfer is where the network becomes load-bearing rather than incidental. For a 70-billion-parameter model serving a long context window, the KV cache per request spans several gigabytes. Across hundreds of concurrent sessions, the aggregate working set is enormous. The constraint is tight: the handoff delay — the time it takes to transfer the KV cache from a prefill GPU to a decode accelerator — cannot exceed the decode phase's per-token latency without degrading the user-facing response speed. That translates to a requirement for multi-gigabyte tensor transfers at sub-millisecond tail latency, across a fabric that also needs to handle hundreds of such transfers simultaneously. The network technology that meets this requirement is RoCEv2 with RDMA: Remote Direct Memory Access over standard Ethernet, encapsulated as UDP/IP, allowing a prefill GPU to write directly into a decode accelerator's memory without involving the CPU on either side.
The fabric design question — how to build the network that connects these pools — is where senior interview candidates get separated. The answer that appears in production at hyperscale is L3 to the host: a routed leaf-spine Clos topology with BGP unnumbered, no Layer 2 in the AI back-end fabric. The failure modes of the alternative — stretched Layer 2 or EVPN with bridge domains — are specific and severe in this context. MAC table explosion is the first problem: MAC state scales as O(endpoints × switches), and at the endpoint counts of a large inference fleet, merchant ASICs hit their MAC table limits and begin flooding unknown-unicast frames. That flooding amplifies broadcast traffic on the exact links carrying KV cache transfers, causing the Priority Flow Control pause storms that destroy the tail latency RoCEv2 depends on for correctness. The second problem is ingress replication amplification: broadcast events in large bridge domains generate hundreds of replicated unicast copies from ingress VTEPs, saturating the leaf-to-spine links at precisely the wrong moment. L3 eliminates both by confining state to hierarchically summarised IP routes and routing adjacencies, with ECMP providing full bisection bandwidth across all leaf-spine paths without spanning tree disabling any of them.
RoCEv2 still requires near-lossless behavior even over a correctly designed L3 fabric. The mechanisms that provide this — Priority Flow Control pause frames and DCQCN congestion notification — interact in ways that matter for interview answers. PFC pause frames stop transmission on a specific priority queue when a downstream buffer is about to overflow, preventing packet drops and the retransmission storms that destroy tail latency. DCQCN is the early congestion response: it marks packets with Explicit Congestion Notification before buffers fill and PFC is needed, allowing senders to slow down before the problem escalates. The interview question that appears in this area asks what breaks when DCQCN is misconfigured or silently disabled on specific switches. The failure mode is subtle: average link utilization looks normal, PFC pause counters accumulate on queues that should not be paused, and KV cache transfers experience unpredictable tail latency spikes that are difficult to attribute without per-queue telemetry.
The most recent development in this space is MRC — Multipath Reliable Connection — published by OpenAI under the Open Compute Project in May 2026, co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. MRC extends RDMA semantics to distribute a single connection across hundreds of network paths simultaneously, rather than the single path that a conventional RDMA queue pair uses. This eliminates the elephant-flow collisions that ECMP hash-based load balancing produces on large fabrics, and it provides microsecond failure rerouting when a link or switch fails — keeping inference jobs alive through failures that previously caused full request termination. MRC is already deployed across OpenAI's GB200 supercomputers and Microsoft's Fairwater infrastructure. For engineers targeting Anthropic, Microsoft, or Google, MRC is now the expected evolution beyond vanilla RoCEv2, and understanding how it differs from both InfiniBand's credit-based flow control and standard RDMA queue pair semantics is the depth level senior panels will probe.
The interview question pattern for these roles shifts depending on whether the company is operating an inference cluster or building one. At companies already running large inference fleets — Anthropic, Google, Meta — interviewers tend to run scenario-based questions: KV cache transfer latency has spiked on a subset of sessions, average link utilization is normal, what do you check? A strong answer starts with PFC pause frame counters per-queue on specific leaf switches, moves to DCQCN ECN marking rates, checks for MRC path distribution imbalance if MRC is in use, and considers whether a specific prefill-to-decode mapping is routing KV transfers through a congested spine. A weak answer describes what KV cache transfer is without reasoning about what fails and what telemetry surfaces the failure. The ability to reason through a failure chain in a system you have not personally operated — using only your knowledge of the mechanism — is what distinguishes strong candidates in this interview space.
For most network engineers, the preparation gap on this topic is not motivation — it is that inference networking is genuinely new territory. The CCNP and CCIE curriculum does not cover KV cache transfer, prefill-decode disaggregation, or MRC. The starting point is understanding the two-phase split and why it produces the network requirement it does. From there, build the mechanism layer by layer: why RoCEv2 is the transport, why PFC and DCQCN are both necessary, why L3 Clos beats L2 overlays at this scale, and what MRC adds beyond standard RDMA. Practise explaining that chain out loud, under challenge, until you can do it without notes — because the interviewer at Anthropic or Google will not stay at the definition layer. The engineers running these systems know the details. You need to demonstrate that you do too.
Practice with My Next Hop
Reading is only the start. Reps close the gap.
Answer real interview questions by voice or text, get a scored breakdown, and drill your weak spots — free to start.
Start practising freeMore from the blog
5 min read
SRv6 Is Now a Core Interview Topic — What Microsoft's Fairwater Deployment Means for Your Prep
Microsoft deployed SRv6 uSID across one of the world's largest AI training clusters and presented the architecture at NANOG96. SONiC 202505 ships it natively. Here is what interviewers are now testing and how to answer with depth.
4 min read
AIOps Is Now a Core Interview Topic — Here Is What Network Engineers Need to Know
Cisco's AI-Native Networking Platform is live, Gartner projects 60% of network ops will be AI-handled by 2028, and interviewers are already testing you on it. Here is what they actually ask and how to answer well.