Parallelism Techniques for Large-Scale LLM Inference

Data Parallelism

Data parallelism replicates the entire model on each GPU (each colored box) to process different input data in parallel.

How it works: Data parallelism involves deploying multiple copies of the model on different GPUs or nodes. Each model replica handles a different portion of the incoming requests or batch independently, much like having multiple instances of the same microservice handling separate users . All GPUs have the full model loaded, so this technique does not split the model itself; instead it splits the data. For example, if you have 4 identical GPUs each with a copy of a 7B parameter model, you can send different input sequences to each GPU concurrently and get 4 times the throughput (in ideal conditions). Importantly, this does not help if a single inference is too large to fit on one GPU’s memory – data parallelism assumes the model does fit in one device’s memory .

Benefits:
- Linear throughput scaling: Increases overall serving throughput almost linearly with the number of replicas, since each GPU can handle a separate request in parallel . This is great for serving many independent queries simultaneously (e.g. many users at once).
- No model coordination latency: Each inference runs in isolation on one GPU, so there’s no cross-GPU communication needed during forward passes. This keeps per-request latency low (each request is just like running on a single GPU) and avoids networking overheads during inference.
- Simplicity: It’s straightforward to implement – essentially running multiple model instances – and is compatible with existing inference servers. Scaling to more GPUs or nodes is as simple as adding more replica instances.

Trade-offs:
- Memory duplication: Every GPU must hold a complete copy of the model weights. This is wasteful for very large models and can become impossible if the model is larger than a single GPU’s memory (e.g. a 175B model cannot be served on one 24GB GPU by data parallelism alone) . It also means using N GPUs requires N times the memory for weights.
- No single-query speedup: Data parallelism does not reduce the latency for a single input. One query can only run on one replica, so if you need to accelerate one long or complex inference (rather than many in parallel), data parallelism doesn’t help. It’s geared toward throughput over latency.
- Diminishing returns: If the workload (number of concurrent requests) is low, many model replicas will sit idle – scaling out beyond the demand just wastes resources. And in multi-node scenarios, distributing requests adds slight overhead (e.g. load balancing, network transfer of input/output), though typically minor compared to the inference compute.

Scalability: Data parallelism scales horizontally across GPUs and even across multiple machines. In theory, if you have 100 identical GPUs, you can serve ~100x more requests per second. In practice, near-linear scaling is achievable as long as the external factors (like request dispatch and network I/O) are not bottlenecks. Unlike model-sharding techniques, data parallel replicas don’t need high-speed interconnects between GPUs because they run independently – this makes it suitable for scaling across nodes with just standard networking. However, each replica still needs to fit the model in memory, so extremely large models might require model parallelism instead of or in addition to data parallelism.

Use cases: Data parallelism shines for multi-user inference serving and high-throughput APIs. For example, a deployment of a 13B model that easily fits on a GPU can be replicated 8 times to handle 8 concurrent requests with minimal latency impact, which is common in production systems . It’s also used in batch inference: a large batch can be split so each GPU processes a chunk of the batch in parallel, then results are combined. However, data parallelism fails when the model is too big for one GPU or when one needs to speed up a single inference beyond the capability of one device – in those cases, model or pipeline parallelism is needed.

Interview Questions – Data Parallelism

Throughput vs. latency: If you needed to increase the throughput of an LLM service (serving many queries per second), why is data parallelism a good choice? And conversely, why does it fail to improve the latency of a single query?

Memory overhead: In an inference deployment, what are the memory implications of using data parallelism with a 30B parameter model across 4 GPUs? How might this influence your decision to use data parallelism or not?

Scaling across nodes: When scaling data parallel inference to multiple machines, what network considerations arise (for input/output handling or model updates) even though the model replicas don’t directly communicate during inference?

Combining parallelism: If a model barely fits on one GPU, can you simply use data parallelism to leverage two GPUs for one request? Why not, and which parallelism technique would you consider in that scenario?

Use-case judgment: Imagine a scenario with sporadic, heavy single-user queries (long, expensive prompts) rather than many concurrent users. Would you invest in data parallel replicas or another approach to handle this load? Explain your reasoning.

Tensor (Model) Parallelism

Tensor parallelism (a type of model parallelism) splits each weight matrix or tensor across multiple devices. In this illustration, matrix B is split into two parts (green and blue) and multiplied with A in parallel; partial results are then combined (all-gather) to form the final output C .

How it works: Tensor parallelism (also called intra-layer model parallelism) partitions the computations within each model layer across multiple GPUs . Instead of replicating the whole model, different GPUs hold different slices of the model’s weight tensors. During inference, a single forward pass is distributed: each GPU computes its fragment of the layer and the partial results are then aggregated to produce the same output as the full layer. For example, if a transformer’s feed-forward layer has a large weight matrix, one can split that matrix into 2 or 4 chunks and place each chunk on a different GPU. Each GPU multiplies its chunk by the input simultaneously, and then the results are summed or concatenated to get the final output . Similarly, for multi-head attention, different heads (or groups of heads) can be assigned to different GPUs to be computed in parallel . In essence, tensor parallelism “slices” the tensor operations along a dimension and uses an all-reduce or gather operation to combine outputs at the end of the layer. This allows a single huge model to be spread across multiple devices within each layer.

Benefits:
- Enables larger models: Tensor parallelism makes it possible to serve models that are too large for one GPU’s memory by dividing the model’s parameters. Each GPU only stores a fraction of the weights (e.g. 1/4 of each large matrix if using 4-way parallelism), reducing per-GPU memory usage proportionally . This was key to deploying models like GPT-3 on clusters of GPUs.
- Potential speedups: By sharing the compute of a layer across multiple GPUs, you can shorten the wall-clock time for that layer’s forward pass. Each GPU handles a smaller matrix multiplication, which might complete faster, and the results are merged. In an ideal scenario with fast interconnect, splitting a compute-heavy layer across N GPUs could approach an N× speedup for that layer. This can reduce latency for a single large inference if the GPUs are efficiently utilized in parallel.
- Transparent model output: The outputs are the same as if computed on a single device (just assembled from parts), so accuracy isn’t affected. It parallelizes the math without approximating it. From the outside, it’s still one model – just partitioned under the hood.
- Mix-and-match with other parallelism: Tensor parallelism can be combined with data parallelism or pipeline parallelism. For example, you might shard the model across 2 GPUs (tensor parallel) and also have 2 replicas of that setup (data parallel) to serve more queries. It’s a flexible building block for scaling.

Trade-offs:
- Communication overhead: Splitting a layer means that at some point GPUs must exchange data (e.g. send their partial outputs to each other) to produce the final result . Typically this involves high-speed GPU-to-GPU communication (like an all-reduce or all-gather after every layer). This overhead can significantly hurt latency if the interconnect is slow. In fact, the network bandwidth between GPUs often becomes the bottleneck for tensor parallel scaling . Using NVLink or NVSwitch (on-node) or Infiniband (across nodes) is usually required to make it performant.
- Synchronization: All parallel GPUs must wait for each other at the synchronization points each layer. The inference can only move at the speed of the slowest participant (e.g., if one GPU is momentarily slower or has more load, it delays the others). This means latency doesn’t always scale linearly – adding more GPUs might give diminishing returns if communication and synchronization costs grow.
- Memory overheads: While each GPU stores only a slice of the big weight matrices, some parts of the model might still be replicated on all GPUs (for example, small layers like layernorm or embeddings might be kept full on each for simplicity, or you need memory for communication buffers). There’s also overhead in storing partial activations that must be exchanged. So memory per GPU is reduced, but not by exactly the factor of parallelism in all cases.
- Complex implementation: Using tensor parallel inference often requires a custom runtime or libraries (Megatron-LM, FasterTransformer, etc.) that know how to split the computations and do all-reduce. It adds engineering complexity compared to running a model on one device. Load balancing the split (deciding how to partition tensors) can also be complex for different layer types.

Scalability: Tensor parallelism is effective up to a point. Typically, we see good scaling for moderate numbers of GPUs (e.g. 2, 4, 8) especially with excellent interconnects. For instance, splitting an attention or MLP across 8 GPUs can work if they share NVSwitch, but beyond that, the communication overhead can dominate. Each additional GPU contributes less speedup if the network traffic (which grows with more shards) becomes the limiting factor . Also, some operations can’t be split arbitrarily – if a layer is very small, splitting it might not be worthwhile. In practice, large LLMs often use a combination of tensor parallel groups (e.g. groups of 8 GPUs) to shard the model, possibly combined with pipeline parallel groups for further scaling. The key scalability consideration is network bandwidth: a high-bandwidth, low-latency interconnect (such as NVLink or PCIe5 or better) is needed to make tensor parallel inference across GPUs efficient. Multi-node tensor parallelism is possible but requires extremely fast network (e.g. InfiniBand) and even then may incur significant latency per token.

Use cases: Tensor parallelism is commonly used when a model just cannot fit on one GPU’s memory. For example, serving a 175B parameter GPT-3 model might involve splitting each layer across 8 GPUs (Megatron-LM style) so that each GPU holds ~22B parameters. It also finds use in accelerating inference for very large batches or very large models – e.g. big batch transformer scoring can be sped up by distributing the matmuls. Real-world deployments of ultra-large models (like those by OpenAI, Microsoft, etc.) rely on model parallelism under the hood. However, if the model does fit on one GPU, pure tensor parallelism is usually avoided for latency reasons – one GPU is often faster than two with communication. Thus, it shines primarily in the “make it fit” scenario (or the “use more GPUs because one would be too slow” scenario) and is often combined with pipeline parallelism (model sharding by layers) for models that are both deep and wide.

Interview Questions – Tensor/Model Parallelism

Networking bottleneck: In an inference setup using tensor parallelism across 4 GPUs, what happens if the GPUs are connected only via a slow PCIe bus (and not NVLink)? How would this affect latency, and why is high-bandwidth GPU interconnect crucial in this context?

Parallelizing attention vs. MLP: Suppose you split the attention heads of a transformer across GPUs (each GPU handles some heads) and also split the feed-forward network matrix. Describe how the outputs are merged in each case. What communication (all-reduce or all-gather) is needed for attention and for the MLP, and what does that imply for scaling efficiency?

Memory trade-off: If you have a 60B parameter model and two 32GB GPUs, outline how you would use tensor parallelism to serve this model. What portions of the model might still be replicated on both GPUs, and how much memory could you save per GPU roughly? What additional memory overheads could appear when doing this?

Combining with data parallel: Imagine you have 8 GPUs and a model that just fits on 2 GPUs with tensor parallelism. You also need to handle 4 concurrent queries with low latency. How would you architect a solution with both tensor and data parallelism in this case? What would be the groups of GPUs and the communication pattern?

Limits to scaling: Why doesn’t splitting a model across 64 GPUs (with tensor parallelism alone) typically result in 64× faster inference for a single prompt? Discuss the factors that cause sublinear scaling and how an inference engineer might detect and alleviate those issues (e.g. by profiling communication vs compute).

Pipeline Parallelism

Pipeline parallelism (layer-wise model parallelism) splits the model’s layers into stages across GPUs. In this diagram, the neural network’s layers are divided into three stages on GPU0, GPU1, GPU2 respectively. Each stage processes and then passes activations to the next, like an assembly line .

How it works: Pipeline parallelism partitions the model vertically by layers (or sets of layers) and assigns different layers to different GPUs . Each GPU is responsible for a consecutive chunk of the neural network’s layer stack. For example, in a model with 24 transformer layers and 4 GPUs, you might put layers 1-6 on GPU0, layers 7-12 on GPU1, 13-18 on GPU2, and 19-24 on GPU3. When an input comes in, it first passes through the layers on GPU0, then the intermediate result is sent to GPU1 to process the next set of layers, and so on in sequence until GPU3 produces the final output . In essence, the inference is like a relay: each device does its part and forwards the activations to the next. This allows a single forward pass to utilize multiple GPUs without each needing the full model. Moreover, when multiple inputs (or micro-batches) are processed, the pipeline can be kept busy – while GPU1 is working on the second batch’s layers, GPU0 can start on a third batch, etc., overlapping work like an assembly line. The key idea is to increase throughput by simultaneously processing different inputs at different stages of the network.

Benefits:
- Memory distribution: Like tensor parallelism, pipeline parallelism enables serving models larger than one GPU’s memory by distributing layers. Each GPU only needs to load the weights for the layers it owns. This is especially useful for very deep models. For instance, if one GPU can hold 10 layers worth of weights, an 30-layer model can be split across 3 GPUs in a pipeline.
- Increased throughput via concurrency: Once the pipeline is filled with multiple inputs, all GPUs can work in parallel on different stages. This means you can achieve high device utilization and throughput – while one input is in later layers, a new input can start processing in the earlier layers. With enough micro-batching, pipeline parallelism approaches the throughput of data parallelism, but using a single model copy split across GPUs.
- Limited communication: The communication in pipeline parallelism is only the passing of activations from one stage to the next. This is typically smaller in size than exchanging full model parameters. For example, sending an activation tensor (e.g. batch_size × hidden_dim) forward is often less data than synchronizing large weight gradients in training or combining large weight shards as in tensor parallel. Thus, pipeline parallelism’s networking cost can be moderate, mainly point-to-point between adjacent stages.
- Combining with tensor parallel: In practice, pipeline parallelism is often combined with tensor parallelism to form 2D parallelism. For example, within each stage you might still shard the layer across 2 GPUs (tensor parallel) and you have, say, 4 such stages in sequence (pipeline). This hybrid can scale to very large numbers of GPUs for extreme model sizes while balancing memory and compute load.

Trade-offs:
- Sequential latency: An inference still has to traverse all pipeline stages in order. For a single input, the end-to-end latency is the sum of all stage computations plus the inter-stage communication overhead. Unlike tensor parallel (which splits one layer’s work concurrently), pipeline parallelism doesn’t reduce the critical path latency – in fact, it can increase it if there’s communication delay between stages . So a single inference may be slightly slower than if it ran on one GPU (due to network hops between stages).
- Pipeline “bubble”: To get throughput benefits, you need multiple inputs in flight to fill the pipeline. The first input incurs a bubble – GPU2 and GPU3 are idle until the input reaches them. Similarly, after the last input, the earlier GPUs go idle while later stages finish. For short bursts of requests or small batch sizes, these bubbles (idle times) reduce efficiency. If you cannot keep the pipeline full, you won’t get good utilization.
- Load balancing: All stages should ideally take roughly equal time, otherwise the slowest stage becomes a bottleneck that holds up others. But different layers have different compute characteristics. For example, in transformers the later layers might have the same dimension sizes as earlier ones (so similar compute per token). If the model is evenly divisible, great – if not, you may need to distribute layers unevenly or even duplicate some compute to balance. An imbalance means some GPUs sit idle waiting for a straggling stage.
- Complexity and failure modes: Implementing pipeline parallel inference requires coordination – you need to manage asynchronous transfers of data between stages, possibly use technologies like CUDA streams or NCCL for communication, and handle batch splits. Also, if one stage fails or is overloaded, it stalls the whole pipeline. There’s added complexity in coding and debugging compared to single-device inference.

Scalability: Pipeline parallelism can scale to a large number of GPUs by increasing the number of stages (you can even assign one layer per GPU if you want, up to the number of layers). It’s commonly used across nodes (e.g., a model split across 2 or 4 servers). The scalability is mainly limited by the depth of the model and the overhead of stage-to-stage communication. Each stage only directly communicates with its neighbor, so bandwidth requirements are typically between one pair at a time, which is more manageable than all-to-all communication. With efficient scheduling (such as interleaving micro-batches), pipeline parallelism achieves near-linear throughput scaling with number of stages, as long as there are enough concurrent inputs to keep all stages busy. However, the latency for a single input will always include all stages serially. Recent optimizations like 1F1B (one forward, one backward micro-batch schedule in training) are not needed for inference (no backward pass), but similar ideas of overlapping stage processing can maximize device utilization. In summary, it scales well for throughput on large models, but if you had, say, 16 pipeline stages and only one query, it would be very slow – so it’s best when batch or request concurrency is high.

Use cases: Pipeline parallelism is a go-to for very deep models or sequence of model segments. For example, if serving a gigantic transformer with 100+ layers, one might allocate 10 layers per GPU across 10 GPUs in a pipeline. It’s used in deployment of models like PaLM or GPT variants where model size mandates multi-GPU splitting. It shines in batch inference scenarios where you can batch many inputs and achieve high throughput (all GPUs busy). In interactive systems with low batch sizes, pipeline parallelism is mainly used just to fit the model in memory, accepting some latency hit. A real-world scenario: A long-context LLM with 64 layers might be split over 4 GPUs; if you can batch 16 requests, each GPU is busy on a different request’s portion, giving high throughput. If you only have 1 request at a time, that request still has to pass GPU1 → GPU2 → … → GPU4 sequentially, incurring extra hops – that’s the latency trade-off . Pipeline parallelism can also fail or become inefficient if one stage becomes a bottleneck (e.g., if one partition of layers is especially slow or memory-bound) or if the pipeline frequently stalls due to lack of input workload.

Interview Questions – Pipeline Parallelism

Latency vs throughput: Explain why pipeline parallelism tends to increase single-query latency but can increase overall throughput. How does the concept of pipeline fill/bubble illustrate this trade-off?

Stage partitioning: Suppose you have a 12-layer transformer and 3 GPUs. Give an example of how you would assign layers to each GPU. What factors would you consider to ensure each GPU’s workload is balanced, and what could happen if one stage takes twice as long as the others?

Pipeline in deployment: In an inference server, you receive a burst of 8 requests and then nothing for a while. How would pipeline parallelism handle this compared to data parallelism? Discuss what happens during the burst (pipeline fill) and after (pipeline flush) in terms of efficiency.

Inter-stage communication: What kind of data is sent between pipeline stages during LLM inference? How does the size of this data compare to, say, the model size or the size of activations in tensor parallelism? (Consider, for example, sending hidden states of shape [batch, seq_len, hidden_dim] to the next stage.)

Hybrid parallel strategies: In a deployment of a very large model, you decide to use 8 GPUs with a combination of pipeline and tensor parallelism. How might you organize these 8 GPUs (e.g., 4 pipeline stages with 2-way tensor parallel in each)? What benefits does this hybrid approach offer over pure pipeline or pure tensor parallel alone in terms of scalability and resource usage?

Expert Parallelism (Mixture-of-Experts)

How it works: Expert parallelism refers to distributing parts of a Mixture-of-Experts (MoE) model across multiple GPUs . An MoE model contains multiple “expert” sub-networks (for example, many feed-forward networks specialized on different token patterns) and a gating mechanism that activates only a few experts per input. During inference, the router (gating network) examines each input token or sequence and decides which expert(s) should handle it. Only those expert networks are executed for that input, and their outputs are combined to produce the final result. Different experts reside on different GPUs – effectively, each GPU might host a subset of the experts. When a token needs a particular expert, the token’s data is sent to the GPU owning that expert, processed there, and the result is sent back or onward . Because each token only uses a small fraction of all experts, most of the model’s parameters are “idle” for that token, and the computation is sparse. For instance, imagine a model with 16 experts where each token is handled by 2 experts: if you have 4 GPUs, each could hold 4 experts. For a given token, the system activates the chosen 2 experts (say one on GPU2 and one on GPU3); those GPUs compute their part in parallel and return their outputs which are combined (the combination might happen on the original GPU or a central place). Expert parallelism is essentially model parallelism at the level of entire sub-networks: different GPUs hold different chunks of the model (the experts), and an inference dynamically uses only a subset of those chunks.

Benefits:
- Massive model capacity: MoE allows extremely large models (hundreds of billions of parameters) by spreading many experts across many devices, without requiring that every inference activates all those parameters. You can increase the number of experts (and hence parameters) almost linearly with number of GPUs, gaining model capacity and potential quality improvements, while each individual inference only pays for a few experts’ worth of compute . This means you get the quality of a huge model at a fraction of the inference cost per request (if the gating is effective).
- Per-token efficiency: Since only a subset of experts are used for a given input, the computation and memory access is sparse. Each GPU works on the tokens that need its experts and is idle for tokens that don’t – which can be efficient if workload is balanced. Inactive experts don’t consume compute for that token . Compared to a dense model of equal size, an MoE can be faster because not all weights are processed for each token.
- Scalability: Need a bigger model? Add more experts (and GPUs to host them). Expert parallelism scales out model size almost arbitrarily – many recent MoE research models scale to dozens or hundreds of experts. It’s easier to reach trillion-parameter scales by adding experts than by making layers or hidden sizes huge. This is attractive for deployment because you can trade-off between model quality and inference cost by adjusting how many experts are used per token.
- Better hardware utilization than pure model slicing: Unlike tensor parallelism which forces every GPU to work on every token (incurring synchronization), expert parallelism lets different GPUs truly work on different tokens (when routing differs). This can lead to better overall hardware utilization when serving batches of varied inputs – GPUs can operate more independently on the tokens assigned to their experts, with less fine-grained communication. In ideal cases, it’s more like a scatter-gather pattern than lockstep synchronization, which can reduce communication overhead .

Trade-offs:
- Complex routing & communication: The gating and routing of tokens to experts introduces communication overhead and complexity. In inference, you might have to perform an all-to-all exchange where each GPU sends the token data that chose its experts to the appropriate devices, then gather results back. This can be a heavy communication pattern, especially if many experts (GPUs) are involved per token . If the interconnect is not fast, the latency added by routing can be large.
- Load imbalance: One expert might end up being popular for many inputs (depending on the data distribution and gating decisions) while others are rarely used. This can create hot spots where one GPU (with a popular expert) gets overloaded with work while others sit mostly idle. MoE inference performance is very sensitive to how evenly the token load can be balanced across experts. Systems often have to implement load-balancing strategies or capacity limits (each expert only handles X tokens, extra tokens go to a secondary choice etc.) which complicates the design.
- Memory overhead: While each GPU doesn’t need the full model, an MoE often still has a shared backbone (like the transformer layers without the feed-forward experts, or embedding layers) that might be replicated on all GPUs. Also, all experts are loaded in memory across the cluster. For example, a 1T-parameter MoE with 64 experts might load ~15B params per GPU if evenly divided, which is manageable, but if gating only uses a few experts per token, you carry a lot of “inactive” weights in memory that are only occasionally used. Also the gating network itself is usually replicated and small.
- Consistency and debugging: The indeterministic nature of routing (depending on input content) can make testing and debugging harder. It’s not as straightforward as a single sequence always taking the same code path – small input changes might activate different experts. For deployment, you must ensure the system can handle any routing patterns (including worst-case where everyone hits the same expert). Caching of expert outputs or dynamic batching of tokens per expert might be needed to maximize efficiency, adding to system complexity.

Scalability: Expert parallelism shines in scalability of model size. You can scale to many GPUs by adding more experts and distributing them. It has been demonstrated on clusters with thousands of experts spread over hundreds of GPUs. The scalability of throughput and latency, however, depends on the routing efficiency. In the best case (perfectly balanced, low-cost routing), you could achieve near-linear throughput scaling: more GPUs = more experts = more tokens processed in parallel (since different tokens use different experts). In practice, implementations like GShard or DeepSpeed MoE use techniques to reduce communication (e.g. each token only goes to top-2 experts, maybe on two GPUs) to make scaling manageable. The networking fabric is a big factor: high-bandwidth interconnect can allow frequent token exchanges. There is also a notion of capacity factor – how many tokens an expert can handle at once – which if set properly, can keep each GPU busy. If you increase batch size or number of concurrent tokens, you might need to proportionally increase experts to maintain low latency. Thus, scaling an MoE inference system involves scaling the number of experts (GPUs) and careful load balancing. It is worth noting that beyond a certain point, the overhead of coordinating hundreds of experts can limit latency gains – so often you might use expert parallelism to reach a model quality target with fewer flops, rather than to make inference super fast. Recent systems research (e.g. Tutel, FastMoE) provide optimized communication patterns for experts to improve scalability.

Use cases: Expert parallelism is used in extremely large language models where a dense model would be too slow or too memory-heavy to serve. For instance, Google’s Switch Transformer and GLaM, or META’s LM on TorchMoE, use MoE to get big model performance with fewer FLOPs per token. In inference, MoEs are useful when different inputs vary a lot – the model can specialize experts for different topics, and only those experts run for relevant inputs. It shines when you need the capacity of a huge model but want to keep inference cost per token lower. A real deployment scenario might be a multilingual model that has different experts for different languages – when a Chinese query comes in, the Chinese-language experts activate (others don’t), and for an English query, a different set of experts activates. This can save time relative to running a giant dense model with all weights. However, MoE can fail or become less beneficial if the overhead negates the sparse computation advantage – e.g., if every token ends up needing many experts or if load balancing is poor. In the worst case, if an MoE’s gating isn’t effective, you could end up using most experts for most tokens, which turns it into a bloated dense model with extra communication. Therefore, MoEs are most advantageous in mixed or broad deployment contexts where inputs truly differ and can be routed to specialized parts of the network . Together AI and others exploring inference infra for MoE focus on ensuring that the routing latency and imbalance issues are mitigated so that this approach yields actual speedups for large-scale systems.

Interview Questions – Expert Parallelism (MoE)

Routing overhead: In a distributed MoE model, each token may need to be sent to its selected experts on various GPUs. Describe the communication pattern you’d expect in an inference step for a batch of tokens (each token going to top-2 experts for example). What challenges does this pattern pose compared to the more predictable all-reduce in tensor parallelism?

Load balancing: Suppose in production you observe that one expert GPU is maxed out while others are underutilized (many inputs are routed to the same expert). What strategies could you employ to alleviate this bottleneck? (Consider solutions like gating adjustments, adding duplicate experts, or capacity throttling.)

Comparing dense vs sparse: If you have a dense 20B model and a sparse MoE model also totaling 20B parameters (say 4 experts of 5B each, with 2 experts used per token), compare the inference process. Which model might be faster per token and why? What conditions need to hold true for the MoE to actually be faster than the dense model?

Failure modes: During inference of an MoE model, what could go wrong if the gating network suddenly starts routing every token to all experts instead of one or two? How would that affect performance and memory, essentially turning the MoE into what kind of parallel workload?

System design: Imagine you are designing a serving system for a 64-expert MoE model on 16 GPUs. How would you map the experts to GPUs (e.g., 4 experts per GPU)? And how would you orchestrate the inference pipeline to minimize latency – for instance, would you gather all token representations to a central GPU for gating then scatter to experts, or perform gating on each GPU for its local tokens? Discuss the design decisions and their implications on performance.

Sequence Parallelism

How it works: Sequence parallelism is a parallelization strategy that splits the sequence length dimension of the input (and corresponding intermediate activations) across multiple GPUs . This is particularly useful for transformer models when dealing with very long input sequences or when certain operations (like layer normalization or dropout) are not easily partitioned by other means. In practice, sequence parallelism might mean each GPU is responsible for a different chunk of the input tokens. For example, if you have a sequence of 10,000 tokens to process (extremely long context) and 4 GPUs, you could assign ~2,500 tokens to each GPU for certain parts of the computation. Unlike data parallelism, which would replicate the model for different data samples, sequence parallelism is working on the same sequence but dividing the work by positions within that sequence. Each GPU holds a full copy of the model (or at least the parts of the model that will be sequence-partitioned) but only processes its subset of tokens, and they exchange information as needed to produce the final result.

In transformers, one way this manifests is by partitioning parts of the forward pass that are element-wise or independent per token. For instance, layer normalization and dropout at each time step can be done independently on each token, so those can be executed in parallel for different token subsets on different GPUs . More ambitiously, some approaches also split the attention computation across sequence: e.g., GPU0 handles attention updates for tokens 1-2500 attending to all tokens, GPU1 handles tokens 2501-5000, etc., and they share key/value info. Essentially, sequence parallelism tries to distribute the memory and compute load of very long sequences by chunking the sequence into segments that can be processed concurrently, then stitched back together.

Benefits:
- Handles long contexts: Sequence parallelism shines when the sequence length (number of tokens) is huge – potentially in the thousands or even millions. It allows multiple GPUs to cooperatively handle a single long sequence that would otherwise not fit in memory or would be too slow on one GPU. By splitting the sequence, each GPU deals with a smaller effective sequence length, reducing memory usage for activations (which often scale with sequence length) .
- Memory savings on activations: In a transformer, the memory required for activations (especially for attention key/value caches or intermediate states) grows with sequence length. By partitioning the sequence, each GPU only needs to store activations for its portion of the sequence, not the entire sequence. This can drastically cut memory per GPU, enabling, for example, a 16k token sequence to be handled by 4 GPUs each seeing 4k tokens. This is critical for long-context LLM inference where KV cache can consume tens of GB of memory for long inputs.
- Parallel speed for long ops: Some operations that scale with sequence length (like certain matrix multiplies in attention with complexity O(n^2) in sequence length) can be divided among GPUs. By splitting the sequence, those operations can run in parallel, potentially giving a speed boost. For example, computing attention scores for 10k tokens attending to 10k tokens is heavy; splitting into two 5k-token chunks per GPU can roughly half the work each does (with some overhead for combining results). Recent research shows significantly lower latency for long sequences using sequence parallel methods – e.g. Snowflake’s Ulysses technique achieved about 3.4× lower latency on long context inference by distributing the sequence across GPUs .
- Complements model parallelism: Sequence parallelism can be used alongside tensor or pipeline parallelism. In fact, some frameworks (Megatron, DeepSpeed) integrate sequence parallelism as an option when using tensor parallelism, to handle those layernorm/dropout parts more efficiently. It’s another dimension to exploit: you can shard across tokens in addition to sharding across layers or across weight parameters. This can further reduce memory bottlenecks and improve throughput when context is the limiting factor.

Trade-offs:
- Complex communication: Unlike data parallel (different sequences) or tensor parallel (combining partial sums), sequence parallelism often requires exchanging partial results that depend on the other sequence parts. For example, in self-attention each token attends to all others, so if tokens are split across GPUs, they must share their key/value representations or partial attention scores. This means heavy communication of activations between GPUs at certain stages. If not carefully optimized (with strategies like ring-based all-reduce or scatter-gather pattern), this can eat up the benefit of parallelizing the work.
- Added synchronization points: All the GPUs working on different parts of the same sequence will need to sync up at points where the sequence segments interact (e.g., after computing attention for their tokens, they might need to exchange to get the complete result for each token). These synchronization barriers can reduce parallel efficiency, especially if sequence chunks are imbalanced or some require more compute (e.g., maybe one chunk has tokens that attend to many others heavily in a sparse attention scenario).
- Limited by sequence independence: Some parts of the model are perfectly parallelizable by sequence (like layernorm on each token), but other parts inherently involve the full sequence (like softmax across attention scores or a final classification over the entire sequence). Sequence parallel methods have to find ways to partition those or otherwise still handle global operations. In some cases, you might still need to gather the full sequence on one GPU for a final step, which could become a bottleneck.
- Software complexity: Implementing sequence parallelism can be quite complex. It’s less straightforward than splitting by layers or weights, because you’re effectively parallelizing the data within a single sequence in the middle of model computations. Ensuring correctness (that each token’s result is as if it saw the whole sequence) requires careful coordination. Tools like DeepSpeed-Ulysses introduce custom kernels and scheduling to manage this. It’s cutting-edge enough that not all inference frameworks support it out-of-the-box.

Scalability: Sequence parallelism is specifically targeted at scaling with longer sequences rather than more data items. Its scalability is measured by how long of a sequence you can handle, or how much you can cut latency for a given long sequence by adding GPUs. If you have an extremely long sequence, you can split it over more GPUs – for instance, an input of 1 million tokens could be conceptually split over 16 or 32 GPUs to make each chunk manageable. The latency can decrease sub-linearly as you add GPUs because each GPU’s share of work goes down, though communication might start to dominate beyond a point. The elastic sequence parallelism idea (as in LoongServe) even suggests dynamically adjusting the number of GPUs based on sequence length – using more GPUs for longer inputs to keep latency in check, and fewer for shorter ones . In terms of across nodes: yes, you can do sequence parallel across machines too, but you’ll need an extremely fast network, since you’ll be sending lots of activation data around (potentially every token embedding). So, usually sequence parallel is done on GPUs with fast interconnect (within a server or across servers with InfiniBand). As with other parallelism forms, diminishing returns hit when the overhead of merging results outweighs the reduced compute per GPU. But recent advances (like ring-based attention algorithms) attempt to reduce communication so that sequence parallel can scale to more GPUs efficiently by only exchanging needed parts. Overall, it scales the feasible context length roughly linearly with the number of GPUs (each GPU adds more memory and compute for more tokens), and can reduce latency for a fixed long context, though not perfectly linearly due to overhead.

Use cases: Sequence parallelism is most relevant for long-context LLM inference. If you have a model that can handle, say, 32k or 100k tokens context (for long documents, multi-document QA, etc.), a single GPU might be too slow or run out of memory storing the attention keys/values for that many tokens. Splitting the sequence across 4 or 8 GPUs can make such long context use cases feasible in reasonable time. Real-world use: A company wants to deploy a summarization model for long legal documents (tens of thousands of words) – they could use sequence parallelism to split the document across multiple GPUs and process it in parallel, enabling a result in, say, 5 seconds instead of 30. Sequence parallelism also appears in training long-context models, but for inference, systems like Ulysses by Snowflake specifically target serving 256k-token contexts by dividing work among GPUs. It shines when memory for sequence is the limiting factor – e.g., serving many long conversations or code files. If sequences are short (a few hundred tokens), sequence parallelism isn’t needed and would only add unnecessary overhead. It also may not help if the model’s bottleneck isn’t the sequence length (for instance, if compute per token is the issue rather than number of tokens). But as we push toward models with very long contexts, sequence parallelism or similar “context splitting” strategies are becoming key enabling techniques for inference on those models .

Interview Questions – Sequence Parallelism

Long context challenge: Suppose you want to serve an LLM with a 64k token context window on GPUs that normally can only handle 8k tokens before running out of memory. How could sequence parallelism be applied here? Describe what each GPU would do and how they would collaborate to handle an input nearing 64k tokens.

Attention computation: In sequence parallelism for self-attention, each GPU might handle a subset of query tokens. How do these GPUs get the information about the key/value for tokens they didn’t compute? Describe a mechanism to exchange or replicate data so that each query token still attends to the whole sequence.

When not to use: If an LLM usually processes inputs of only 256 tokens, would sequence parallelism provide any benefit? Explain why splitting such a short sequence across multiple GPUs might be counterproductive, touching on overhead vs. work.

Combining with other parallelism: Imagine a scenario with both a very large model and very long inputs – e.g., a 50B parameter model with 32k context. You have 8 GPUs. How might you combine tensor/model parallelism and sequence parallelism to handle this? (Hint: you could shard the model weights and also shard the sequence length).

Performance debugging: You tried sequence parallel inference on a 10k-token input with 2 GPUs and found it was slower than using 1 GPU. What might be the causes for this slowdown? How would you determine if the problem is communication overhead, poor load balance, or something else?

Operator Fusion / Kernel-Level Parallelism

How it works: Operator fusion is a low-level optimization where multiple operations in the model’s computation graph are combined into a single GPU kernel launch. While not parallelism in the multi-GPU sense, it exploits the parallel computing resources within a GPU more efficiently. In large LM inference, there are many sequential tensor operations (matrix multiplications, additions, activations, layer norms, etc.). Normally, each of these operations would be executed as a separate GPU kernel call: the GPU loads data from memory, does the op, writes results, then moves to the next op. With operator fusion, we merge these steps so that one kernel performs several operations in one pass over the data, reducing the overhead of launching kernels and intermediate memory transactions .

For example, in a transformer block, instead of launching one kernel for matrix multiply, one for adding bias, one for activation (GELU), one for dropout, etc., a fused kernel could do “GEMM + bias + GELU” together on each element. This means the data for that layer is read from memory once, transformed through all those operations by the GPU’s threads, and written out once at the end. Another prominent example is FlashAttention, which fuses the attention score computation, softmax, and weighted sum operations into a single highly-optimized kernel to avoid excessive memory reads/writes and to utilize on-chip memory better .

In inference, operator fusion often involves using compiler optimizations (like TensorRT, XLA, or custom CUDA kernels) that detect sequences of ops that can be merged. It can also involve using specialized libraries (e.g., fused multi-head attention kernels). The end result is that the GPU executes a smaller number of heavier kernels that each do more work in parallel.

Benefits:
- Lower kernel launch overhead: Each kernel launch has some CPU overhead and latency. By fusing operations, you launch fewer kernels overall. This is especially important in inference where batch sizes might be small (even 1) and the model is large – the overhead of scheduling thousands of tiny operations can become a significant portion of latency. Fused kernels amortize this overhead by doing more per launch .
- Better memory locality: Fusion reduces the number of times intermediate results are written to and read from GPU global memory. Instead, data can stay in registers or shared memory for the duration of the fused computation. This cuts down on memory bandwidth usage and avoids unnecessary data traffic, which is often a bottleneck. In a memory-bound scenario (common in LLM decode where we shuffle a lot of data), this can improve throughput.
- Improved parallel efficiency: A single fused kernel can be optimized to use the GPU’s parallel threads and warps more effectively. For instance, combining elementwise ops means one thread can do an element’s series of ops sequentially without handing off to another thread or idling. It often increases occupancy and uses fewer global synchronizations. Overall, this means the GPU spends more time doing actual math and less time idling or switching context.
- Tailored optimizations: Fused kernels can implement algorithmic optimizations that aren’t possible in a fragmented implementation. FlashAttention is a great example: by fusing, it can tile the attention computation to avoid storing the entire big attention matrix, drastically reducing memory usage and speeding up computation. This kind of optimization yields big gains (FlashAttention can 2-4× speed up the attention step and enable longer sequences) which benefit inference latency and throughput.

Trade-offs:
- Complex development: Writing and maintaining fused kernels (especially for complex sequences of operations) is hard. It often requires low-level CUDA or assembly programming and careful handling of different GPU architectures. Each time the model changes (say a new activation function), you might need new fusion code. Relying on compiler frameworks can abstract this, but compilers may not catch all fusion opportunities or might have bugs. So there’s an engineering cost and less flexibility.
- Less modularity: Once ops are fused, it’s harder to debug or modify them individually. If a fused kernel is malfunctioning, it’s a monolithic block to troubleshoot. Also, you lose the ease of mixing and matching different operations (unless you generate a new fused kernel). This is usually fine for well-known architectures but can slow down iteration on new model designs.
- Hardware-specific tuning: A fused kernel might be optimized for a specific GPU generation or require tuning parameters (block size, thread count) for efficiency. This can make it less portable. For example, a kernel fused and tuned on NVIDIA V100 might not be optimal on NVIDIA A100 unless re-tuned. Often, vendor libraries (NVIDIA’s FasterTransformer, etc.) provide these kernels but one must update them for new hardware.
- Diminishing returns: Not every sequence of ops is worth fusing. Fusing two very large matrix multiplies, for instance, is not possible if there’s a data dependency between them (you need the result of first before second). Many critical ops (like the big GEMMs in transformers) are already quite optimized individually. The benefit of fusion is mostly on the “glue” operations (activations, elementwise adds, small matrix ops). Once those are fused, additional fusion won’t help much – so you eventually hit a point of no further speed-up. Also, if a kernel becomes too large or complex, it may use a lot of registers or not scale as well, potentially even hurting performance by reducing occupancy. So there’s a sweet spot in how much to fuse.

Scalability: Operator fusion is about optimizing single-GPU performance. It doesn’t directly scale across GPUs, but it helps you better saturate each GPU’s compute and memory bandwidth. In multi-GPU inference, you’d typically apply the same fused kernels on each GPU (for whatever portion of the model it runs). The fused kernels ensure each device runs as efficiently as possible, which indirectly improves overall scalability because you need fewer GPUs to achieve a certain throughput. If each GPU is 30% faster due to fusion, you scale out 30% less to meet a throughput target. Additionally, by reducing memory bandwidth needs, fusion can make other parallelism schemes more effective (for example, if tensor parallel partitions have less data to exchange because some intermediate wasn’t materialized due to fusion, that’s a win). But mostly, scalability here means handling larger batch sizes or sequence lengths on a single GPU more efficiently. As batch size increases, kernel launch overhead amortizes naturally, so fusion is most impactful at smaller batch or streaming inference. Modern compilers can fuse many ops regardless of batch size, maintaining efficiency from batch=1 to batch=N. In summary, fused kernels keep the GPU pipelines busy and minimize idle gaps, ensuring you get closer to peak theoretical performance. This is crucial when trying to maximize throughput per GPU (e.g., to serve as many tokens/s from an expensive accelerator as possible).

Use cases: Practically every high-performance inference engine uses operator fusion. For LLMs, when using frameworks like NVIDIA TensorRT, Hugging Face Accelerate, or ONNX Runtime with optimizations, a lot of fusion is done under the hood. For example, during the decode phase of autoregressive inference (generating one token at a time), the batch is often 1 and there are many small kernel launches – here fusion yields big latency improvements, making the model responsive. A concrete case: vLLM and FasterTransformer fuse the sampling and output projection steps, or fuse multiple small operations in the transformer block, to cut latency. FlashAttention (which is now widely used) is a case of kernel-level parallelism providing both speed and ability to handle longer sequences without blowing memory. Operator fusion especially shines in edge deployments or CPU inference too, but in GPUs it’s about squeezing out every bit of performance. One scenario where it might “fail” to help is if the model is already dominated by one huge operation (like a giant matmul) – you can’t fuse much into that, it’s already one kernel. But for Transformers, there are plenty of medium-sized ops to fuse. Essentially, whenever you see mention of “optimized kernels” or “TensorRT graph optimizations,” it’s operator fusion at work. It’s a behind-the-scenes technique, but inference engineers absolutely rely on it to hit latency and throughput targets .

Interview Questions – Operator Fusion

Latency at small batch: In an LLM serving scenario with batch size 1, why can the overhead of kernel launches significantly affect latency? How does operator fusion mitigate this, and what kind of latency improvement might you expect from fusing a chain of elementwise operations?

Memory bandwidth vs compute: If a transformer model’s inference is memory-bound (not all GPU cores are busy because it’s waiting on memory transfers), what role can kernel fusion play in improving performance? Give an example with specific operations (e.g., softmax and scaling in attention) that illustrates this.

FlashAttention insight: FlashAttention is a fused-kernel approach for the attention mechanism. Can you explain what it does differently compared to a naive implementation of attention in terms of kernel-level operations and memory access? Why is this especially relevant for long sequences?

Trade-off scenario: Fusing operations can sometimes make a kernel very large. What could go wrong if you fuse too many operations into one kernel? (Consider things like GPU register pressure, complexity of scheduling, and debugging.) How do engineers decide what to fuse and what to leave separate?

Tooling: As an inference engineer, would you prefer to write custom fused kernels by hand or rely on a compiler/graph optimizer to do it for you? Discuss the pros and cons of using automated tools like NVidia TensorRT or TVM to perform operator fusion versus writing custom CUDA kernels for your model. How might the answer change if you need to deploy on multiple hardware platforms (GPU, CPU, etc.)?

Asynchronous / Overlapping Prefetch Parallelism

How it works: Asynchronous execution and overlapping prefetching are techniques to utilize computation and communication concurrently, effectively creating a pipeline between data transfer and compute. In LLM inference, this often comes into play with things like prefetching the next data (weights, activations, or key/value cache) while current computations are ongoing. Modern GPUs and frameworks allow us to use multiple streams or hardware engines so that, for example, while the GPU cores are busy computing a matrix multiply, the DMA engine can be fetching the next layer’s weights from CPU memory or the next batch of data from system RAM into GPU memory. By the time the compute finishes, the needed data is already “in place,” reducing waiting time.

A concrete example is KV cache prefetching during autoregressive generation. When generating the next token, the model needs to read the key/value tensors from all previous tokens (which might be in GPU memory or offloaded to CPU if memory is limited). If you wait until you need them and then load them, the GPU might sit idle waiting for data. Instead, with asynchronous prefetch, the system predicts which parts of the KV cache will be needed soon and issues a transfer to L2 cache or HBM memory ahead of time, overlapping that data movement with the current token’s computation . As a result, when the compute units are ready for that data, it’s already available in fast memory, avoiding a stall.

Another example: In pipeline parallelism across GPUs, one can overlap the communication of activations between stages with computation. As soon as GPU0 produces its output, it sends it to GPU1 while GPU1 is still finishing up its previous task, so that by the time GPU1 is free, the new data has partially or fully arrived (this often uses async send/recv operations).

From a systems perspective, this often involves techniques like double buffering (one buffer being used in computation while another is being filled with new data), using CUDA streams (one stream does compute, another handles memcopies, set to overlap), or specialized instructions (like CUDA’s cp.async to prefetch data into shared memory or cache) . The goal is to hide latency of memory I/O or inter-device communication behind useful computation, so that the GPU (or multi-GPU system) is never idle waiting on data.

Benefits:
- Latency hiding: The primary benefit is reduced effective latency. If, say, loading a layer’s weights from CPU to GPU takes 5 ms and computing on them takes 5 ms, doing them sequentially would be 10 ms per layer. But if you prefetch the next layer’s weights during the 5 ms of computing on the current layer, you can potentially get it down to just ~5 ms (plus perhaps some overhead) total, as the transfer and compute happen in parallel. This can drastically speed up inference when memory transfers (or network transfers) are on the critical path.
- Throughput improvement: By keeping the hardware busy (either doing compute or transfer at all times), you maximize utilization. This is especially important if the workload is mixture of compute-heavy and bandwidth-heavy tasks. Overlap ensures neither the compute units nor the communication channels are left idle. As a result, the overall tokens-per-second or requests-per-second of the system increases. For instance, an asynchronous KV cache prefetch was shown to improve attention throughput by ~2× in some research because the memory wait was eliminated.
- Scalability across heterogenous memory: Overlap techniques enable using resources like CPU memory or slower GPU memory without incurring as high a penalty. You can offload large buffers (like KV cache, or large embedding tables) to CPU or slower memory and use prefetch so that it appears “fast enough” when needed. This effectively increases the usable memory for inference (allowing longer contexts or larger models than pure HBM would allow) without linear latency cost. Systems like PagedAttention or other offloading strategies rely on overlapping GPU compute with CPU-GPU data transfer to work well.
- Smooth pipeline: In multi-request or batched scenarios, asynchronous scheduling allows one part of the system to prepare data for the next request while the current request is executing. For example, while the GPU is busy generating a token for user A, the CPU can simultaneously preprocess the request for user B and even transfer the input to GPU memory. This way, as soon as the GPU is free, user B’s data is ready to go. This leads to better overall responsiveness and throughput.

Trade-offs:
- Complex scheduling logic: Implementing effective overlap requires careful scheduling of events. You must predict what data will be needed (prefetching the wrong data or too early/late reduces benefit or can waste bandwidth). If done incorrectly, you could fetch data that isn’t ultimately used (wasting memory bandwidth) or you might still end up waiting because prefetch was late. It introduces complexity in code (using multiple streams, callbacks, etc.) and potential bugs (e.g., race conditions if not synchronized properly).
- Memory overhead: Double-buffering and prefetching mean you might hold extra buffers. For example, you might need two copies of a weight chunk in flight (one being used, one being loaded). This can increase memory footprint slightly. Also, if you prefetch a lot of data “just in case,” you might evict useful data from caches or use up GPU memory for things not yet needed.
- Hardware limitations: Not all operations perfectly overlap. Sometimes the compute is also using the memory bandwidth heavily, so a simultaneous transfer might contend for resources (PCIe or memory controllers), leading to less than ideal overlap. On some systems, there may be only one copy engine, etc., which means you can overlap compute with copy, but if you try to do two copies at once and a compute, maybe a bottleneck appears. Basically, theoretical overlap might be limited by underlying hardware concurrency.
- Difficult to generalize: The optimal prefetch distance (how far ahead to load something) or overlap strategy can be workload-specific. For instance, if generation is going token by token, you might always fetch the next token’s data one token ahead. But if the pattern is irregular, or if using beam search, etc., logic gets complicated. Tuning the overlap for different batch sizes or sequence lengths might require experimentation. There’s also a risk of making things worse if the overlap saturates some resource – e.g., launching too many asynchronous transfers could clog the bus while GPU is also trying to use it for other memory accesses.

Scalability: Overlap techniques help maintain scalability as you move to systems with multiple resources. On a single GPU, overlapping compute and memory operations lets you approach the hardware’s full capacity (both compute and bandwidth). In multi-GPU, overlapping communication (like all-reduce or point-to-point transfers) with computation is crucial for scaling efficiency. For example, in distributed data parallel training, techniques like overlapping gradient all-reduce with backward computation (known as wait-forward-backward propagation) are used; similarly for inference, overlapping the transfer of pipeline stage outputs with computation of those stages improves scaling in pipeline parallel inference. If you add more GPUs or more nodes, the communication overhead usually grows – overlapping helps hide that, so scaling is more linear. The asynchronous prefetch of KV cache is especially useful if you offload KV to CPU for many concurrent sequences – it basically parallelizes GPU compute and CPU-GPU IO, meaning adding more CPU memory (for longer context) doesn’t linearly slow you as long as you can overlap. Generally, a well-designed inference engine will use multiple threads/streams to ensure data movement, compute on GPU, and even CPU-side tasks (like token post-processing or scheduling) are all happening in parallel as much as possible. This turns the inference process into a pipeline with stages (data prep, compute, communication, etc.) overlapped. Scalability then improves because you’re not extending the critical path with each new component – you’re running parts concurrently. The limitation is when there’s no independent work to overlap with (then you’re back to sequential waits). So as model architectures evolve, one tries to restructure tasks to create opportunities for overlap.

Use cases: Asynchronous and overlapping techniques are ubiquitous in high-performance inference servers. For example:
- KV Cache Offloading: Many long-context inference implementations offload the oldest KV cache to CPU and use asynchronous prefetch (using cudaMemcpyAsync or even GPU direct RDMA) to fetch needed segments in time for attention. This allows, say, a 128GB CPU memory to supplement 40GB of GPU memory without stalling every generation step .
- Streaming generation: When streaming tokens to a client, you can overlap the preparation of network packets with the generation of the next token. The GPU computes token N+1 while the CPU thread sends token N out over the socket.
- Multi-request batching: Systems like HuggingFace Text Generation Inference or NVIDIA Triton will gather incoming requests into batches. They often use one thread to batch inputs while the GPU is busy on the current batch, thereby overlapping batching logic with computation.
- Pipeline parallel inference: Overlapping communication between pipeline stages (using CUDA streams that do sends/receives concurrently with compute) is critical to avoid pipeline bubbles. This means stage i is sending data to stage i+1 at the same time stage i+1 is still finishing its previous micro-batch – classic producer/consumer overlap.
- Where it fails: If the workload is purely compute-bound and there is little data transfer, then there’s not much to overlap – the GPU is 100% busy with math, and memory is mostly co-utilized. In such cases, async prefetch doesn’t buy much. Another tricky case is if dependencies are tight: e.g., you can’t preload data because you don’t know which branch of a model you will take (like in a conditional execution graph, though in inference this is rare for transformers). But by and large, any time there is wait time for data, we try to overlap. It’s a key part of achieving high throughput in systems like DeepSpeed and TensorRT engines. In fact, NVIDIA’s newer GPUs and software explicitly support prefetch instructions (like cp.async to L2 or shared mem) to facilitate exactly this kind of optimization .

Interview Questions – Asynchronous Prefetch & Overlap

Latency hiding principle: What does it mean to overlap communication with computation in the context of LLM inference? Can you give a specific example (e.g., overlapping data transfer of the next token’s data with the current token’s computation) and explain how it hides latency?

KV cache example: In a long-context generation, assume the GPU can only hold the cache for the last 1000 tokens, and older ones are on CPU. How would you design a prefetching strategy to ensure the GPU has the necessary key/value tensors when computing attention for token 1001? What factors would you consider (like how early to prefetch, how much to prefetch)?

Double buffering: Explain the concept of double buffering in GPU inference. For instance, how could double buffering be used when loading model weights from CPU to GPU in a partial offload scenario, or when sending inputs to a GPU, to keep the compute pipeline fed?

Streams and engines: Modern GPUs have separate copy engines and compute engines. How does this hardware feature enable overlap of data transfer and kernel execution? What must a programmer do in code to actually achieve this overlap (consider CUDA streams and asynchronous calls)?

Diagnosing overlap effectiveness: Suppose you implemented asynchronous prefetching of data between GPUs for pipeline inference, but you’re not seeing any speedup – the timeline still shows communication and computation happening sequentially. What could be some reasons for this (e.g., synchronization issues, resource contention, small message sizes)? How would you verify and fix the overlap so that they truly run in parallel?